Transform

The second phase of an ETL process involves transforming the extracted data that is cleansed and meets a set of business requirements. Data is scrubbed of dirty data and prepared so that it fits the schema of the destination data model. Transformations are split into multiple activities for optimal data pipeline performance. This modular approach allows transformation activities to run in parallel and provides an easier method for troubleshooting failed tasks. It also allows data engineers to easily implement additional transformation activities as new business requirements are added.

Depending on the complexity of the transformations, data may be loaded into one or more additional staging layers to serve as intermediary stages for partially transformed data. One example of this is splitting the different phases of data transformations into bronze, silver, and gold staging layers.

  • The bronze layer represents raw data ingested from different sources in a variety of different formats. Some filtering may have happened to get the data to this stage, but there are minimal transformations to data in the bronze layer.
  • The silver layer represents a more refined view of the data. Silver layer data is characterized by data that has been scrubbed of dirty records and entities made up of fields from multiple bronze layer datasets using join or merge operations. Data in the silver layer is typically used for machine learning activities since this data is cleansed but not summarized.
  • The gold layer represents aggregated datasets that are used by reporting applications. Calculations such as weekly sales and month-over-month averages are included in gold layer datasets.

As in the extract phase, transformation activities can be built with GUI-based or code-based technologies. SQL Server Integration Services (SSIS) is an ETL tool that is involved in traditional, on-premises BI solutions. SSIS provides many connectors and transformation activities out of the box that allow developers to build sophisticated data engineering pipelines with a GUI. ADF provides a similar development experience for cloud-based ETL. ADF provides a drag-and-drop experience with several data transformation activities out of the box that can be chained together graphically. The core components of how ADF orchestrates ETL pipelines will be discussed in the section “Control Flows and Data Flows” later in this chapter, but as far as transformations are concerned, ADF can execute transformation activities in four ways:

  • External Compute Services—ADF can be used to automate the execution of externally hosted transformation activities that are custom coded. These activities can be developed in several different languages and hosted on tools such as Azure Databricks and Azure Functions. Stored procedures hosted on Azure SQL Database or Azure Synapse Analytics can also be invoked by ADF using the Stored Procedure Activity. Transformations that are developed from scratch give engineers more flexibility on how to implement business rules and how to handle different scenarios. ADF allows engineers to pass results from previous steps in a data pipeline as parameters or static predefined parameters to a custom-developed transformation activity so that it can transform data more dynamically.
  • Mapping Data Flows—ADF gives data engineers the option to build no-code transformation pipelines with the use of mapping data flows. These are very similar to data flow activities in SSIS, giving data engineers the ability to create transformation activities with a GUI. The benefit of a no-code solution like this is that the code performing the transformations and the compute running the code is obfuscated from the data engineer. This can greatly improve operational productivity by allowing engineers to purely focus on implementing business logic instead of optimizing code and compute infrastructure. Just as with transformation activities that are hosted on external compute services, ADF can pass static parameters or results from previously executed activities as parameters to mapping data flows to dynamically transform data.
  • Power Query—Previously known as wrangling data flows, ADF allows data engineers to perform no-code transformations on data using Power Query. Power Query is a native component of Power BI and gives analysts the ability to perform transformation activities in a scalable manner. Power Query in ADF enables citizen data analysts to create their own pipelines in ADF without needing to know how to build sophisticated data engineering pipelines.
  • SSIS—Organizations have been building BI solutions for many years now, and if their solution involved SQL Server, then there is probably an SSIS component involved. Rebuilding existing SSIS with ADF pipelines could be very time consuming if the existing SSIS footprint is sophisticated. This can be a blocker for organizations migrating to Azure. To alleviate these concerns, customers can choose to migrate their SSIS projects to ADF. Data engineers can use the Execute SSIS Package activity in their data pipelines as singleton activities or chained to other ADF native activities. Running an SSIS project in ADF requires the use of a special compute infrastructure known as the Azure-SSIS integration runtime to run them. Chapter 5 will discuss the Azure-SSIS integration runtime and other types of runtimes in further detail.

Leave a Reply

Your email address will not be published. Required fields are marked *