Excel Data Pipelines

We need to talk about Excel

I began this blog series with a preface, Laying Data Pipelines in the Cloud, outlining my intention to run through the design and build of a number of different data pipelines in the cloud, in a variety of use cases.

Let’s kick off by discussing Excel.

Full disclosure, I’m an Excel fan! I’ve used it throughout my career, as a user and a developer: I’ve developed countless Excel VBA apps (unmanaged and managed code!), supported complex financial and analytical Excel data models. I’ve even trained hundreds of graduates how to use the advanced features of Excel.

However, in Data Engineering use cases, Excel spreadsheets have always been problematic.

Sure, if you’re working with an on-premises SQL Server, then SQL Server Integration Services (SSIS) usually makes light work of automating Excel data ingestion.

But what about working in a cloud data platform environment like Azure? How does it handle ingesting Excel data?

Well let’s look at this in the context of a hypothetical data pipeline requirement.

Business Requirement

An organisation requires a new data source to be integrated into their existing business intelligence reporting system. This system is built on the Azure Data Platform, fully Platform-as-a-Service (PaaS) , with business intelligence reports serviced by Power BI.

The required data is captured by a third party supplier who record the information on their IoT devices, but supply their data to the organisation in the following ways:

  • Authorised users can access an interactive portal which displays the live data from these devices
  • Manually download data from the portal via an Excel spreadsheet

Unfortunately data cannot be accessed from an API, or downloaded in any other format.

It is possible, however, to download the Excel spreadsheet on a daily schedule, and to send it to a designated email address.

So what are our options ?

Let’s assume the Excel spreadsheet is configured to be sent to an email address on a daily basis. Firstly, how do we get this spreadsheet into Azure storage from an email ?

That’s a straightforward activity for Power Automate to handle (and if you want to productionise this activity, you can do this in Azure using Logic Apps).

OK, so now the spreadsheet is in Azure storage, it should be straightforward to just ingest this into the organisation’s business intelligence reporting system, specifically the PaaS database/data warehouse that stores the business data, right…?

Well No, or Yes, depending on when you asked me!

Pre-June 2020: No

In a typical PaaS scenario, to get Excel data into a PaaS database, you’d need to convert the data into CSV format FIRST, and so I would have suggested these options; use:

  • Power Automate/Logic Apps
    • An Excel spreadsheet can be converted into CSV using the Create CSV Table action (Data Operations category) BUT this only works if you are opening a spreadsheet which has its data stored as an Excel Table
  • Azure Functions
    • Powershell or Python code can easily convert an Excel spreadsheet to CSV format. You can deploy such code to Azure Functions to create serverless apps that can be automated via Azure Data Factory (ADF).

Post-June 2020: Yes

You can now ingest an Excel spreadsheet directly into a database using Azure Data Factory!

* NOTE: Whilst it is possible to create an SSIS package that ingests Excel into a SQL database/data warehouse, and then lift and shift that on to Azure Data Factory via Azure SSIS Integration Runtime to perform this activity, the goal here is to go Azure ADF native.

Wrap up

I’ve referred to and provided links for many Azure services in this blog, all of which I will be discussing in further detail in future posts. In particular I’ll also run though an actual implementation of the hypothetical business requirement in this blog.

For now, I’ll leave you by framing the business requirement back to the premise in my first post:

  1. you begin with something, some input
    • An Excel spreadsheet
  2. you need an output, of some kind
    • Pre-June 2020, CSV output
    • Post-June 2020, data added/appended to a database table
  3. you transform your input to your output
    • Power Automate/Logic Apps and Azure Functions for the CSV output
    • Azure Data Factory natively for data ingestion direct to database table

Further information

For further details about the new native Excel connector for Azure Data Factory, look here: https://docs.microsoft.com/en-us/azure/data-factory/format-excel.

Would you like to know more about how cloud services can help improve your organisational data strategy? Then get in touch with risual at https://www.risual.com/contact/

Watch out for the next post soon, and keep staying safe people!

About the author