Integrating Azure Synapse Analytics and Microsoft Purview

Wouldn’t it be brilliant if your company had a massively scalable data analytics suite combined with rich data asset discovery and lineage tracking functionality.

So what are your options?

Individually, Azure Synapse Analytics and Microsoft Purview are the premium choices. When combined, Synapse and Purview is the comprehensive solution for both data analytics and governance.

Synapse is becoming very popular as a data analytics platform for data-driven clients across the globe. It is a cloud-scale data analytics suite that unifies data ingestion and integration, big data analytics with SQL and Spark, and enterprise level data warehousing.

Purview is also getting noticed!

It is Microsoft’s flagship cloud data governance solution for cataloging, classifying and tracking assets across a company’s estate for both on-premises and cloud based data sources.

In this blog I’ll discuss:

  • How Purview catalogs Synapse database assets.
  • Configuring Purview integration in Synapse.
  • Searching the Purview data catalog from Synapse Studio.
  • Tracking data lineage in Synapse Integration Pipelines.

Cataloging Synapse Assets

Synapse can process data in Serverless, Spark and Dedicated SQL pools, or files held in Data Lake Storage Gen2. Data sources can hold multiple SQL or Lake databases and tables, or folders and files accordingly. This is challenging for finding the data assets, understanding them, and seeing how the objects link together within a Synapse Workspace.

Microsoft Purview solves this problem by cataloging the assets in a ‘data map’. Data Stewards can then add categorisations, metadata, contact details or other information to help data analysts or other business users to understand the data.

The first step is giving Purview the permissions to scan the components within the Synapse Workspace. This is achieved by a combination of assigning relevant RBAC and database roles to the Purview system managed identity. This allows Purview to scan and read the data components within the Workspace.

* High-level Purview -> Synapse connectivity
Image provided by Microsoft

Once connectivity is setup, we are in a position to register and scan the data sources. In our example we would register Azure Synapse Analytics and Azure Data Lake Storage Gen2 as the data sources. We would then setup Purview to scan each data source interactively or via a schedule to keep the data map up to date.

Example Purview Data Map source for Data Lake and Synapse Workspace

Each scan finds data assets within the registered sources which then get added to the data catalog collection. We can drill-down into the Data Folders and see the metadata for a particular file as shown below

Example Data Lake CSV file metadata retrieved from data asset scan

Configuring Purview integration into Synapse

So, how do we hook Purview up into the Synapse Workspace?

Well, it’s simple!

We just have to go into the Synapse Workspace and link it to Purview as show in the screenshot below. Note if the Purview Account is behind a firewall, then managed endpoints are required (networking not in scope for this blog. phew!)

Synapse Workspace Management Area

Searching the Purview data catalog from Synapse Studio

A good question is, “What are the benefits of combining Purview with Synapse?”.

I’d say my favourite feature is being able search the Purview catalog from within the Synapse Workspace. It means I can examine a curated catalog of documented data sources and data assets in one central place. I’ve even discovered data assets that I didn’t know existed or were interlinked with other components in my data landscape!

The following screenshot shows how I can search for csv files scanned by Purview from within the Synapse Workspace. It also tells me the last time the data source was scanned (bottom right)

Searching for scanned Purview data assets within Synapse

Tracking data lineage in Synapse Integration Pipelines

Synapse Integration Pipelines are commonly used for data ingestion from source to target systems. Data can then get manipulated and transformed in a Dedicated SQL Pool ready for reporting and analysis.

How do we check the lineage of data as it moves through the various ETL stages?

If Purview is connected to your Synapse Workspace, then any Copy Activity or Data Flow Activity automatically generates data lineage. This includes lineage on the data source, the activity used to transfer the data, and the target destination for where the data gets stored.

The screenshot below shows an example of data lineage of how a CSV file gets processed by Synapse Copy Activity. The end result is the data being copied into a SQL Dedicated Pool table.

Purview Copy Data Activity data lineage example

My Thoughts

Coming from a data background, combining Synapse and Purview is a big plus as I get a birds-eye view of the architectural landscape of my Synapse Workspace components. Working out how your data assets interlink is one of the biggest headaches for anyone involved in data.

This feels like a great solution!

Would you like to know more?

risual are currently building Azure Synapse Analytics data platforms for multiple clients right now! Contact us on the link below if your organisation wants to take advantage of a modern cloud technology approach to solving data problems.

About the author