Oops! Something went wrong while submitting the form.
August 4, 2020
ETL process overview: design, challenges and automation
Learn all about the ETL process. From extracting, transforming and loading basics to architecture and automation.
The Extract-Transform-Load process (ETL for short) is a set of procedures in the data pipeline. It collects raw data from its sources (extracts), cleans and aggregates data (transforms) and saves the data to a database or data warehouse (loads), where it is ready to be analyzed. The following diagram represents the data flow through the different ETL stages:
The ETL process is engineered in such a way that your data pipelines and analytics provide business value to your company. A well-engineered ETL pipeline has several benefits:
Information clarity. During ETL transformations, data is cleaned and joined across sources before it is saved in the database, where you can then analyze it. These operations allow you to work with clear information and disambiguate unclear, raw data.
Information completeness. A well-designed ETL pipeline includes all of the business sources which are relevant to your operations in a single place (the destination data warehouse/database). All of the information is complete, so there are no missing puzzle pieces.
Information quality. ETL processes validate data at extraction or correct/discard data at transformation. This ensures that the quality of data is always controlled before it is analyzed, thus increasing trust in the analysis and giving you the confidence to use data for business intelligence and data-driven decision making.
Information velocity. ETL processes can be designed to trigger the entire ETL pipeline whenever new data arises in the sources or when existing data is changed. You can therefore control the ‘freshness’ of the data, as well as the speed at which you make decisions based on signals in the outside world.
Novel business insights. The entire ETL process brings structure to your company’s information. This allows you to spend more time analyzing novel questions and acquiring new insights, rather than trying to perform procedures to get valuable data at each stage.
The ETL process is guided by engineering best practices. However, setting up your data pipelines accordingly can be tricky. Here, we dive into the logic and engineering involved in setting up a successful ETL process:
Extract explained (architectural design and challenges)
Transform explained (architectural design and challenges)
Load explained (architectural design and challenges)
ETL vs ELT
How Keboola automates much more than just your ETL process
The “Extract” stage of the ETL process involves collecting data from its data sources. This data will ultimately lead to the rows and columns of your analytic database.
Traditionally, extraction meant getting data from Excel files and Relational Management Database Systems, as these were the primary sources of information for businesses (e.g. purchase orders written in Excel). With the increase in Software as a Service (SaaS) applications, the majority of businesses now find valuable information in the apps themselves, e.g. Facebook for advertising performance, Google Analytics for website utilization, Salesforce for sales activities, etc. Today, data extraction is mostly about obtaining information from an app’s storage via APIs or webhooks.
Extract architecture design
When designing the software architecture for extracting data, there are 3 possible approaches to implementing the core solution:
Full-extraction. Each extraction collects all data from the source and pushes it down the data pipeline.
Incremental extraction. At each new cycle of the extraction process (e.g. every time the ETL pipeline is run), only the new data is collected from the source, along with any data that has changed since the last collection. For example, data collection via API.
Source-driven extraction. The source notifies the ETL system that data has changed, and the ETL pipeline is run to extract the changed data. For example, data collection via webhooks.
Below are the pros and cons of each architecture design, so that you can better understand the trade-offs of each ETL process design choice:
The data extraction part of the ETL process poses several challenges. A lot of the problems arise from the architectural design of the extraction system:
Data latency. Depending on how fast you need data to make decisions, the extraction process can be run with lower or higher frequencies. The tradeoff is between stale or late data at lower frequencies vs higher computational resources needed at higher frequencies.
Data volume. The volume of data extraction affects system design. The solutions for low-volume data do not scale well as data quantity increases. With large amounts of data, you need to implement parallel extraction solutions, which are complex and difficult to maintain from an engineering perspective.
Source limits. You need to be aware of the source limitations when extracting data. For example, some sources (such as APIs and webhooks) have imposed limitations on how much data you can extract simultaneously. Your engineers need to work around these barriers to ensure system reliability.
Datavalidation. Either you validate data at extraction (before pushing it down the ETL pipeline), or at the transformation stage. When validating data at extraction, check for missing data (e.g. are some fields empty, even though they should have returned data?) and corrupted data (e.g. are some returned values nonsensical, such as a Facebook ad having -3 clicks?).
Orchestration. Based on your choices of data latency, volume, source limits and data quality (validation), you need to orchestrate your extraction scripts to run at specified times or triggers. This can become complex if you implement a mixed model of architectural design choices (which people often do in order to accommodate for different business cases of data use).
Monitoring. You need to monitor your extraction system on several levels:
Resources. How much computational power and memory is allocated?
Errors. Have there been any errors which have caused missing or corrupted data?
Reliability. Have the extraction scripts run at all?
Disparate sources. Working with different data sources causes problems with overhead and management. The variety of sources increases the data management surface by increasing the demands for monitoring, orchestration and error fixes.
With the increasing dependency on third-party apps for doing business, the extraction process must address several API challenges as well:
Variability & inconsistency. Every API is designed differently, whether you are using apps from giants like Facebook or small software companies. The variability and inconsistencies between their designs will cause your engineers to spend a lot of time on implementing the extractors for each source.
Lack of documentation. APIs are notoriously lacking in proper documentation. Every time your use case deviates from the core minimum (which is documented), engineering hours need to be allocated to understand how to execute on your vision for the extractor within the API.
(Breaking) Changes. APIs change frequently and break your extractors. Whether it’s because of version updates or expiring access tokens, APIs require a lot of monitoring and maintenance hours.
High complexity. APIs expose data in multiple locations (endpoints) and complex aggregations. This is especially true for SaaS apps, which cover multiple business use cases and collect a lot of different data. You need to budget engineering hours for navigating this complexity.
The “Transform” stage of the ETL process takes the data that has been collected at the extractor stage and changes (transforms) it before saving it to the analytic database. There are multiple transformations:
Data cleaning. Data cleaning involves identifying suspicious data and correcting or removing it. For example:
Remove missing data
Recode missing data into NULLs or 0s or “#NA”
Recode different versions of the same data to a common denominator. For example, “M”, 1, “male”, “masculine” to “Male”
One-hot encode categorical data
Convert data types to standard forms. For example, convert DateTime objects and Unix timestamps to the same data type
Data enriching. Data enriching involves adding new information to the raw data already collected.
Join different sources. For example, create customer information blobs, which join information from a variety of purchasing apps.
Deduplication. Identify which information is duplicated and remove the copycat.
Calculated fields. For example, calculate the lifetime value of the customers at import, or the number of their consecutive purchases.
In reality, though, the majority of work is done via data cleaning. If you would like to dig deeper into the intricacies of data cleansing, check out The Ultimate Guide to Data Cleaning.
Transform architecture design
When designing the architecture of data transformation, there are multiple things to consider:
Order of operations. The order in which transform rules are applied to incoming data can affect the end result. For instance, imagine we have two transform scripts. The first one processes data to compute the consecutive number of purchases made by a customer. The second transformation process drops purchase information from the data pipeline unless there is a shipping address. If we drop the row for a customer with a missing shipping address before we calculate the consecutive order, the end result is going to be two different purchase orders.
Business logic. Transforms often implement business logic, such as calculating a customer’s lifetime value or their consecutive orders. Your architecture needs to be designed so it can handle missing or corrupt data and transform orders, thus supporting business logic implementation.
Algorithmic efficiency. Because transforms go through the extracted data, they sometimes need to handle heavy loads. Algorithmic efficiency in the design of transforms can make a difference in the time required for a transform to execute, or whether it will time-out your system. Take this simple example: implementing a dictionary solution for 1M rows transformation vs a for loop results is a difference of a couple of orders of magnitude.
Quality assurance. Transformations are often the place where data is validated against a set of criteria (e.g. do not import customer information unless we have their email) and monitored for data quality. At this stage, a lot of ETL processes are designed with alerts to notify developers of errors, as well as rules, which are preventing data from passing on the data pipeline unless it matches certain criteria.
There are several challenges when dealing with transformations:
Lack of business logic. Oftentimes, it becomes clear that there is a lack of business logic given the data we receive from the extract phase. As an example: the business rule for determining a new customer is the date of their first product purchase. But what do we do for customers who paid for shipping, but not for a product?
Changing business logic. As company operations evolve, business definitions change. Even small changes to business logic can have multiple effects on transforms, especially if the change in one transform affects others which depend on it.
Hard business logic. Sometimes, implementing something trivial from a business perspective can be challenging from an engineering perspective.
Changing source data. APIs can change their response payloads, data can become corrupted, or your system might migrate to a new SaaS… so you need to implement a different transform logic. In addition to the decoupling issues, changing source data requires constant monitoring and maintenance of the transform stage.
Scaling complexity. Transforms present challenges when the ETL processes evolve. The more transforms you implement, the harder it is to keep track of their mutual effects.
“Load” involves taking data from the transform stage and saving it to a target data store (relational database, NoSQL data store, data warehouse, or data lake), where it is ready for analysis.
Load architecture design
There are three possible designs for architecting data being loaded into a destination warehouse/database. Here, we explore them alongside their pros and cons:
There are several challenges in the loading stage of the data pipeline:
Order of insertion. The order of insertion can affect the end result. If a table has a foreign key constraint, it might prevent you from inserting data into that table (and would probably skip it), unless you first insert matching data in another table.
Schema changes. The schema represents what the destination (database or data warehouse) expects the data to look like. As your business evolves, the schema is often updated to reflect changes in business operations. The resulting need for schema updates can lead to a waste of engineering hours, as well as unintended consequences for the entire system (e.g. data quality validations might break when the form of data breaks).
Within the ETL (traditional process), data is extracted to the staging area (either in-memory data structures or temporary databases) before it is transformed and loaded into the analytic (OLAP) database for analysis.
ELT (extract-load-transform) takes advantage of the new data warehousing technologies (e.g. BigQuery, Amazon Redshift, Snowflake…) by loading the raw data into a data warehouse or data lake first and transforming the data on the fly when it is needed for analysis. ELT is preferred for operations working with extremely large volumes of data or with real-time data.
The main conceptual difference is the final step of the process: in ETL, clean data is loaded in the target destination store. In ELT, loading data happens before transformations - the final step is transforming the data just before data analysis.
Even though the end result is the same (i.e. data is ready for analysis), there is a tradeoff between ETL and ELT, which needs to be made clear:
Regardless of your preference (ETL or ELT), there are several architectural considerations to keep in mind.
Which ETL tool should you choose?
There are plenty of ETL tools which automate, accelerate, and take care of your ETL processes for you.
Their main value is that they accelerate the data pipeline delivery (you can shorten the time it takes to get insights from months to weeks), automatize ETL processes, and externalize maintenance costs (the tools take care of all breaking changes, updates and overall maintenance).