This is the fifth in a series of blogs which outline a vision of a Modern Data Platform, its components, and the benefits that can be realised from taking a holistic view of your data assets.
In this blog, I will expand further on the Lifecycle component of the Modern Data Platform. It is important to understand that although the components of the Modern Data Platform can be subdivided and thought of separately, they work together with the other components to deliver additional synergies for the organisation.
The purpose of a data lifecycle is to describe the journey of a data asset. Copies of the data asset are kept at each stage of the data lifecycle to form a data lineage. This is especially useful in analytical organisations where an organisation may need to trace back through the history of the data to troubleshoot issues.
The data lifecycle will change in different types of organisations, depending on their data inputs, processing requirements and outputs. The lifecycle presented here is an example.
Data stored in the Consolidated Data Store is organised according to lifecycle stage, either by a folder type structure or using database schemas.
In this example data lifecycle, the stages are;
- Plan. This stage defines the actions to be taken to identify the resources and data assets required to produce the new data asset. It normally takes the form of an Data Processing design; an Office document which describes which data it will contain, its sources, expected audience, applicable data standards and policies, expected volumes, data obfuscation requirements etc.
- Ingest. This stage is where new data is obtained. It is considered RAW data, and it might be newly obtained information or the output from another data asset.
- Process. This stage is where the RAW data from the Ingest stage is taken and prepared for use. Data is cleaned in this stage, matched with data standards, columns renamed, formats changed, have data enrichment performed or have data imputed where it is missing.
- Model. This stage is where the data is modelled. It is stored de-normalised where appropriate, has the appropriate primary/foreign key relationships created and is indexed, ready to be used in further analysis.
- Analyse. This stage is where further analysis is performed on the modelled data, creating aggregations or predictive models, describing facts, detecting patterns and test hypothesis.
- Publish. This stage is where data is shared internally within the organisation or externally.
- Integrate. This stage is where the analysis produced are then reused within other data assets.
Not every data asset will feature in each lifecycle stage. For example, where data assets are being reused, they will have been already ingested, processed and modelled, so data assets may only exist in the Plan, Analyse, Publish and Integrate stages.
The data lineage created using the data lifecycle is surfaced in the Data Catalogue component of the Modern Data Platform. Using graph database technology, relationships between data assets can be visualised to understand;
- Where data assets have come from (parent data assets)
- Where data assets have been used (child data assets)
- Which data assets are linked to other data assets
The data lineage information is very powerful. It can be used to identify bad data flowing through the organisation, or understanding where licensed data has been used, for example.
Would you like to know more?
Would you like to know more, or how a Modern Data Platform can be applied within your own organisation to bring back control of your data? Contact us on the link below.