Azure Synapse Analytics – Compute Pools

This is the second in a series of blogs about Microsoft’s key Azure cloud data analytics platform: Azure Synapse Analytics. For those unfamiliar with Synapse, it is essentially a product that unifies data integration, big data analytics and enterprise data warehousing.

The Synapse architecture is displayed below and shows how data flows from the sources on the left-hand side and get stored and transformed within Synapse to finally produce an output for end-user consumption.

* Image taken from Microsoft website

Synapse Compute Pools

Synapse currently provides three different analytics runtimes or ‘Pools’ for distinctive workloads which are:

  • SQL Pool (Serverless and Dedicated Analytics Runtimes)
  • Spark Pool (Big Data Analytics)
  • Data Explorer Pool (Log and Telemetry Analysis)

This blog explains each pools’ usage and helps clients understand which is suitable for solving their business problems.

Serverless Pool

The Serverless Pool (also known as SQL on-demand) is a serverless distributed data processing system that allows big data to get analysed faster. For example, the Serverless Pool can query huge csv files held in a Data Lake and return the results at incredibly fast speeds (i.e. seconds!). Clients don’t need to be concerned with infrastructure setup as once the Synapse Workspace gets created in the Azure Portal, it’s possible to access the Serverless pool immediately.

The Serverless architecture consists of a Control Node and Compute Nodes and is underpinned by Azure Storage. The diagram below shows how the Control Node distributes user queries across multiple Compute Nodes using a Distributed Query Processing Engine (DQP). The Compute Nodes run parallel queries against the data held in Azure Storage and scale automatically depending on the resource requirements of the source query.

The following diagram displays the components of the Serverless Pool.

* Image taken from Microsoft Website

Serverless Pool Benefits

  • Data Discovery:  It’s easy to explore and query csv, json or parquet data directly from your data lakes.
  • No Infrastructure Setup: The Serverless Pool comes as part Synapse on Workspace creation.  Zero infrastructure setup is needed.
  • Cost Savings:  You only pay for compute on usage.
  • Saving Queries:  Previously written queries can get saved in the Workspace for future usage.
  • Transformation:  Data can be transformed using T-SQL queries and saved back into the Data Lake, potentially for future Power BI usage.
  • Data Lakehouse:  A Logical Data Warehouse can get built over the Data Lake to avoid the overhead of additional data ingestion steps.
  • Scalability:  Serverless technology means the Pool can scale the amount of compute required to cater for the query resource usage.

Dedicated SQL Pool

The Dedicated SQL Pool is used for intense data crunching and is optimised for ETL/ELT analytical workloads and datasets. It uses one Control Node with multiple Compute Nodes for distributed processing which is commonly known as Massive Parallel Processing (MPP). The following diagram displays the components of the Dedicated SQL Pool.

* Image taken from Microsoft Website

The Dedicated SQL Pool performance level is scalable, and the Data Warehouse Units (DWUs) can get increased or decreased according to the ETL workload. Additionally, the SQL Pool can get paused when not in use which helps reduce costs.

Dedicated SQL Pool Benefits

  • Creation:  It’s very easy to create one or more Dedicated SQL Pools using the Synapse Workspace.
  • Scalability: The Dedicated Pool can be scaled up or down according to usage.  
  • Cost Savings:  It can be paused when not used to save costs.
  • Optimisation:  The database uses columnular storage to store data in its relational tables which gives better compression rates and improves query performance.
  • Compatibility:  The Dedicated Pool supports T-SQL which is the common development language across SQL Server.

Spark Pool

Apache Spark is popular for Data Scientists and Data Engineers to analyse big data and apply machine learning. It’s simple to create an Apache Spark Pool via the Synapse Workspace and has notebook support to write .NET, C#, Scala, SQL or Python code. The architecture consists of one master node and multiple worker nodes in a cluster.

The following diagram displays the components of the Spark Pool.

* Image taken from Microsoft Website

Spark Pool Benefits

  • Creation:  It’s very easy to create one or more Spark Pools using the Synapse Workspace.
  • Libraries: The Spark Pool has high-level libraries that support streaming data, machine learning, SQL queries and graph processing.  
  • Compatibility:  Spark Pool notebooks support various programming languages for developers.
  • Scalability:  The Spark Pool can have auto-scale enabled, so nodes can be added or removed accordingly.

Data Explorer Pool

The Data Explorer Pool is used for real-time analysis on huge volumes of data streaming activities. This ranges from IoT devices, websites or other applications and can unlock insights from log and telemetry data.

The following diagram displays the components of the Data Explorer Pool.

* Image taken from Microsoft Website

Data Explorer Pool Benefits

  • Ingestion:  Data can be easily ingested from sources like Event Hub, Data Lake or Kafka.
  • Less Complexity:  There is no need to build complex data models or scripts to consume the data.  
  • Speed:  Data is optimised and is available immediately allowing for hard-hitting queries to run against the streaming data.
  • Kusto:  The Kusto Query Language (KSQL) is available to explore telemetry or time series data, allowing for free-text searches within the semi-structured data.
  • Scalability:  The Data Explorer Pool has compute and storage that can scale automatically to enable analytics on huge amounts of data (Petabytes!).

Would you like to know more?

risual are currently building Azure Synapse Analytics data platforms for multiple clients right now! Contact us on the link below if your organisation wants to take advantage of a modern cloud technology approach to solving data problems.

About the author