What is Azure Data Catalog?
Azure Data Catalog is a cloud application from Microsoft that allows an organisation to construct an enterprise repository of its data assets. It is delivered by the Microsoft Azure public cloud as a Software as a Service (SaaS) application.
Its purpose is to provide a repository within an organisation where data assets can be registered and described. The application only stores metadata about each registered data asset, with the data remaining in its original location. Data assets can be registered from any location, be they on-premises, Azure or any other public cloud service. Azure Data Catalog also provides glossary functionality, to allow an organisation to maintain its own data vocabulary, when registering and updating assets.
What is a data asset?
Examples of a data asset are;
- The database of a Line of Business (LoB) system
- RAW data ingested by the organisation from an external data source
- A dataset derived from other datasets
- A report
- A spreadsheet
- A Power BI model
What problems does it solve?
Azure Data Catalog provides functionality to help organisations solve the following problems;
- Discovery and reuse of data assets. It helps data consumers to discover existing data assets within the organisation so that they can enrich their own datasets, or reuse datasets that have already been generated by other departments.
Accessing data sets. Each registered data asset stores data on where to find and how to access the dataset
- Identifying who is responsible for upkeep of the data. Whether you have data owners, data stewards or data custodians, this information is registered as metadata against the asset, so consumers know who to speak to, to get more information about the data asset
- Governance. Collecting knowledge of all the data assets, increases the data governance capability of an organisation. It also helps the Chief Information Officer in understanding where data assets are held, and whether they are being managed correctly
- Identifying if datasets contain personally identifiable information, or sensitive data. Data asset metadata can store whether the asset contains personally identifiable information, or whether it has any sensitivity or usage restrictions
What does it do?
Azure Data Catalog provides for four use cases;
- Discovery of data assets within an organisation. Registered data assets with the catalog can be browsed or searched by data consumers to discover and reuse existing datasets
- Helping data consumers to understanding data assets. The addition of metadata in the catalog allows data owners to describe the data and its business context, using a vocabulary that is understood across the organisation
- Assisting consumption of data. The catalog can contain links to the registered data asset, and describe which tools are required to access the data, and how access is requested
- Contributing to data knowledge. Registered data assets can be tagged in the catalog using an organisation’s taxonomy to increase the ease of discovery. Asset metadata can be edited by users of the data, and data lineage can be quantified, so that reports or aggregated information can be linked back to their original sources
Who is it useful for?
Azure Data Catalog is suitable for use with any user persona that generates, consumes or manages an organisations data. Examples of these user are;
- data consumers
- data owners
- BI teams
- data analyst/scientist
The application provides role-based access control, to allow an organisation to provision relevant access to each user persona and control visibility of objects in catalog.
Starting out with Azure Data Catalog
If you are interested in making a start with Aure Data Catalog, there are a few things to bear in mind. There are two versions of the product, a free version which is limited to 5000 data assets, and a standard edition which can manage up to 100000 data assets. The supported data sources for registering in Azure Data Catalog include all the popular relational and non-relational sources. A full list can be found at;
Azure Data Catalog has a built in APIs (Application Programming Interface) which can be used to publish data or query data in the catalog, which lend themselves to automated data asset registration and publishing of data assets. To update data sources over time, it is as simple as reregistering the asset. This may be automated using these APIs.
Need something more?
Azure Data Catalog is a great starting point for organisations seeking greater efficiency and governance of their data assets, however it is just the starting point in the journey to a modern data platform. To get the most out of the product consider;
- Combining the product with other components of the Microsoft data platform to gain efficiencies of using loosely coupled cloud components
- Data orchestration tooling to automate data asset discovery, registration of new assets and refresh of existing assets when they change
- How data assets change and become new assets in your data lifecycle
- How governance is enhanced and is represented in the catalog, for example, compliance with data standards
- The use of exception reporting with Power BI to understand where there are missing data assets, or registered assets with missing metadata
- The use of bot technology and natural language query to enable users of the application to surface the catalog data in different ways
- Incorporate Azure Machine Learning classification algorithms to help automatically generate metadata for a data asset, in line with the organisation’s taxonomy
- Use API access to expose catalog data to other systems
- How the security model for the product is mapped onto user personas with the organisation
- How the catalog technology can link with master data management, data quality services, and a common security and audit model
If this sounds like something that your organisation can benefit from, contact risual to discuss how you we can help you transform the role that data plays in your organisation.