Modern Data Platform: Storage

This is the third in a series of blogs which outline a vision of a Modern Data Platform, its components, and the benefits that can be realised from taking a holistic view of your data assets.
In this blog, I will expand further on the Storage components of the Modern Data Platform.  It is important to understand that although the components of the Modern Data Platform can be subdivided and thought of separately, they work together with the other components to deliver additional synergies for the organisation.

The Consolidated Data Store

The Consolidated Data Store is a logical entity, comprising multiple storage technologies.  In the example shown here, it contains four physical data stores for the following types of data;
  • SQL – a relational data store for traditional table held data
  • NoSQL – a datastore for key/value pairs, JSON document-based data and graph data
  • Data Lake – an organisation wide, Hadoop based storage platform for big data sets
  • Document – an enterprise document management platform for Office documents

Note that the physical data stores within the Consolidated Data Store are not limited to the four in the example.  They could also include a cloud service, file shares, Hadoop clusters, cloud storage buckets or other technologies for storing data.

What makes this a consolidated data store, is that they are all managed in the same way.  Data stored in each data store will follow the same taxonomy and use a consistent naming convention to provide users with an understanding of where to find and store assets they use.  They support the organisations data lifecycle and they are all secured using the same Common Security Model and audited by a Common Audit Model.

The physical stores can be delivered as on-premises services, Cloud Infrastructure as a Service (IaaS), Cloud Platform as a Service (PaaS), Cloud Software as a Service (SaaS) or a mixture of these.

The Consolidated Data Store approach allows metadata to be collected about the data assets stored on it, and its structure allows automated cataloguing of the assets into the Data Catalogue component of the Modern Data Platform.

The Consolidated Data Store in Action

The following graphic depicts how data assets are stored within the different physical stores within the Consolidated Data Store, how they map onto the lifecycle for the development of a data asset, and examples of types of processing using automated standard functions and processes or manual manipulation.

In this example, the lifecycle starts in the Plan stage with a business analyst planning a new dataset.  This user would capture details of the audience, formats, data sources and analysis required in a document.  This data asset is stored in the document data store for peer review.

We then move to the Ingest stage of the lifecycle.  The data sources are ingested into the Consolidated Data Store in the form of two new data assets, a JSON document stored in the NoSQL data store and a CSV file stored in the data lake.  A standard Python script is used to load both data assets into the SQL data store as separate tables so that they can be stored, joined and manipulated in the Process stage of the lifecycle.
A standard TSQL stored procedure then reads, cleans and enriches the data and when the data is ready for analysis, it is stored as a new asset in the Model stage of the lifecycle in a new table.
The data analyst then extracts the data from the SQL data store into an analytical tool, produces some further analysis by imputing and/or aggregating data and then stores the result in the SQL data store in the Analyse lifecycle stage as a new data asset.
This data asset is then used in a presentation tool for consumption by the end user community and stored as a file in the document store.
Finally, the analysed data set is exported as a CSV file into the data lake, so that it can be reused and integrated within other data assets in the future.
All of the data assets shown in this example would be logged in Data Catalogue automatically, and metadata about them prepared from the use of a common taxonomy and further analytics specific to that physical store.

Would you like to know more?

Would you like to know more, or how a Modern Data Platform can be applied within your own organisation to bring back control of your data?  Contact us on the link below.

About the author