Simple explanation of Azure Data Factory

Simple explanation of Azure Data Factory

This article provides a business-friendly view of what Azure Data Factory is, where it fits in the data-tools landscape.

Today businesses are generating more and more data, and the amount of data is growing exponentially. Businesses are using a variety of software to drive business. This data comes from different sources, from real-time feeds to big-batches at regular intervals, and the format of the data can either be structured, semi-structured or unstructured.

Businesses want to tap this data and convert it into insights to drive their business as quickly as possible. In order to gain the insights, we need to combine and integrate these data sources into common storage, so that the data analyst or data scientists can dig into this data and find the insights.

As we mentioned above data comes from lots of different sources, right? What are these sources? These data sources either be SaaS applications like Google Analytics, Facebook ads, Salesforce, Zoho etc.; from on-premise databases like Oracle, SQL server and so on, or other cloud platforms like AWS, GCP etc.

One of the biggest challenges is the lack of functionality in our normal integration tools, to integrate and combine into common storage. This is where Microsoft Azure Data Factory comes into the picture and eases the process

What is Azure Data Factory (ADF)?

As per Microsoft definition “ADF is a fully managed, serverless data integration service for ingesting, preparing and transforming all of your data source”.

Let’s go into more detail:

  1. ADF is a resource in the Azure subscription and Microsoft manages our data factory. That means we don’t need to worry about installing the application, operating system, scalability, availability, security requirements etc of our data factory. That’s why it is called a fully managed service.
  2. ADF is serverless, which means our resources can scale to any size without infrastructure management.
  3. ADF supports 90+ connectors (integration) for ingestion of various data sources and Microsoft is continually adding new connectors. It is easy to connect with other major clouds like AWS, and GCP, as well as ingest data from on-premise databases. We can easily build ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines in a code-free environment.
  4. In addition to the above, ADF helps to orchestrate the execution of Machine Learning models (using Azure ML Studio, Databricks) and publishing dashboards using Power BI.
  5. Finally, ADF helps to monitor our data pipeline.

Things to keep in mind if you are thinking of using ADF

  • ADF is not a data storage solution. It just provides a compute to process the data and we need to store it in either Azure storage or Databases.
  • ADF is not a data migration tool. We can use the Azure data migration service for migrating our data from one database to another database.
  • Complex data transformation is challenging in ADF due to the lack of support compared to other code free ETL tools. We can use Databricks, and HDInsights for complex transformation. However, ADF can orchestrate this workflow. We may expect some future improvement from Microsoft in this area.
  • ADF is not designed for streaming datasets, we need to use Azure Event hub or other components. Basically, it is used for loading and transforming data periodically.

This is just a simple introduction to ADF.