Azure Data Factory: Managed Airflow

In February 2023 Microsoft introduced Apache Airflow within Azure Data Factory, which provides extensibility for orchestrating Python-based workflows at scale on Azure. Managed Airflow provides a managed orchestration service for Apache Airflow that simplifies the creation and management of Airflow environments. Moreover, it natively integrates Apache Airflow with Azure Active Directory for a single sign-on (SSO) and a secure solution.

Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as “workflows.”

In Airflow, a Directed Acyclic Graph (DAG) is a collection of all the tasks orchestrated to run and organised to reflect their relationships and dependencies. A DAG is represented in a Python script, representing the DAGs structure as code.

Azure Data Factory (ADF) helps data engineers bring existing Apache Airflow workflows / DAGs into ADF that runs on a fully managed Airflow Environment (also called Airflow Integration runtime). It brings the best of Azure and Apache by offering Azure Data Factory’s trustworthiness, scale, security, and ease of administration, with Apache Airflow’s extensibility and community-led updates as a managed offering on Azure.

Use

Azure Data Factory delivers Pipelines to orchestrate data processes, whereas Managed Airflow offers Apache Airflow-based Python DAGs (python code-centric authoring) for defining the data orchestration process.

  • – The preference for Managed Airflow most often applies to individuals who have a robust Apache Airflow background or are currently using Apache Airflow.
  • – Pipelines a preferred in a case where the preference exists for writing/ managing Python-based DAGs for data process orchestration.

Azure Data Factory offers multi-orchestration capabilities across visual, code-centric, OSS orchestration requirements by introducing Managed Airflow.

Benefits

Managed Airflow allows using Airflow and Python skills to create data workflows without addressing the underlying infrastructure for scalability, availability, and security within Azure Data Factory. 

Features

  1. Automatic Airflow setup – Quickly set up Apache Airflow by choosing an Apache Airflow version when you create a Managed Airflow environment. Azure Data Factory Managed Airflow sets up Apache Airflow for you using the same open-source code you can download on the Internet and provides the same familiar User Interface once set up and launched.
  2. Automatic scaling – Automatically scale Apache Airflow workers by setting the minimum and maximum number of workers in your environment. Azure Data Factory Managed Airflow monitors the workers in your environment. It uses its autoscaling component to add workers to meet demand until it reaches the maximum number of workers you defined.
  3. Built-in authentication – Enables Azure Active Directory (AAD) role-based authentication and authorization for your Apache Airflow Web server by defining AAD RBAC’s access control policies offering SSO (single sign-on) capability for Azure users.
  4. Built-in security – Metadata is also automatically encrypted by Azure-managed keys, so your environment is secure by default. Additionally, it supports double encryption with a Customer-Managed Key (CMK).
  5. Streamlined upgrades and patches – Azure Data Factory Managed Airflow periodically provide new versions of Apache Airflow. The ADF Managed Airflow team will auto-update and patch the minor versions.
  6. Workflow monitoring – View Apache Airflow logs and Apache Airflow metrics in Azure Monitor to identify Apache Airflow task delays or workflow errors without needing additional third-party tools. ADF Managed Airflow automatically sends environment metrics—and if enabled—Apache Airflow logs to Azure Monitor.
  7. Azure integration – ADF Managed Airflow supports open-source integrations with Azure Data Factory pipelines, Azure Batch, Azure CosmosDB, Azure Key Vault, ACI, ADLS Gen2, Azure Kusto, as well as hundreds of built-in and community-created operators and sensors.

Limitations

Microsoft announced the following limitations

  1. Managed Airflow in other regions is available by GA.
  2. Data Sources connecting through Airflow should be publicly accessible.
  3. Blob Storage behind VNet is not supported during the public preview.
  4. DAGs inside a Blob Storage in VNet/behind Firewall are currently not supported.
  5. Azure Key Vault isn’t supported in LinkedServices to import dags.
  6. Airflow officially supports Blob Storage and ADLS with some limitations

About the author