We recently tackled a challenging ETL problem for a client who essentially takes care of commute needs of a whole country :). Below White paper explains at a very high level how Azure Platform was used to solve their problems.
Being the country’s most favored and widely used transportation means, this client generates huge amounts of data every second. Each train and bus making trips and each rider getting from place A to B is multiple data points and when this data is ingested into a relational database, we end up with multiple tables with an average of 50 mln rows each.
The data has to be ingested first into stage tables and after transforming the data through a series of steps which involve cleansing data, running complex joins on multiple stage tables, processing data based on a standard set of reference tables a handful of fact tables are created. At the end of all of this, business analysts connect fact tables to a visualization tool to draw insights about key performance indicators like redireship, distance covered, revenue, delays etc.
There were multiple challenges:
- The major challenge was the time it takes to complete the cycle of extracting data from various sources, cleansing it and transforming it through complex joins. Business wanted to bring down the processing time for this ETL cycle from few hours to few minutes.
- Legacy tools were being used for ETL resulting in longer cycle beginning data getting generated at the source to being ingested, processed, analyzed and visualized.
- Due to the volume of data involved, the client was often touching the limits of a traditional OLTP database even with partitioning, careful indexing and scaling resources vertically. For faster processing and results, more resources were necessary which means investing on more resources upfront.
Microsoft Azure’s strong suite of data offerings proved to be a perfect fit for solving these problems. While SQL Server native tools like
SSIS help ingesting data from various sources, loading humongous amounts of data to a target Azure Data Warehouse cluster was made faster by loading flat files into blob storage and parallel loading to Azure DW via
Ploy Base. Loading from disparate data sources into warehouse at regular intervals was made easier by
Azure Data Factory.
PowerBI was used to visualize data for insights in near realtime.
Solution Architecture Diagram And Details
- The data was received in the form of flat files. This was a major data source
- Other data sources like existing relational databases were also present
- The flat files generated were not of any of the standard format like a CSV, TSV. This can be challenging while loading to any target because most of the tools break while loading non-standard file formats to a target.
- Some of the files were processed on a stage VM using unix tools like zcat and sed to convert from a fixed width, comma separated non-standard extended ASCII format to comma separated, UTF8 format.
- Some of the files were processed using custom UQL code and Azure Data Lake Analytics
- Data from other sources was scheduled or triggered on-demand via Azure Data Factory
- Processed files were parallel uploaded to Azure blob using Python
- External tables were created using PolyBase on a stage Azure Data Warehouse cluster running at 1000 DWH capacity
Create Table As, data from external tables was loaded to internal tables via PolyBase
- After complex processing, the fact tables get loaded to a production Azure data warehouse via Azure Data Factory.
- Stage Azure Warehouse is scaled down and paused to save costs till the next processing event
- PowerBI renders reports connecting to Prod Azure DW.
- The biggest benefit is the ability to scale Azure Data Warehouse when capacity is needed. By provisioning more DWH units, the processing times can be reduced exponentially.
- Once the processing is done, the warehouse can scale down and can even be paused completely. The pause and resume feature of Azure data warehouse is one of the first in the industry.
- Parallel loading from Azure Blob Storage to warehouse makes loading extremely large amounts of data in a matter of minutes
- An entire ecosystem of tools like
BCP, SSIS, SSRS, SSAS, PowerBIwork seamlessly with Azure Data Warehouse.
Data Factoryallows stitching various data sources and destinations together as a managed ETL tool.
- Ability to control data distribution across nodes and statistics can be created manually depending on the kind of queries that are frequently executed.
- Dynamic Management Views that allow examining the innards of query execution plans, status of execution etc.
In days to come we hope to deep dive into the kind of problems we faced, how we solved them and also write about best practices for Azure Data Warehouse, using Polybase, Datafactory, PowerBI etc. Stay tuned!