The “Extract” stage of the ETL process involves collecting structured and unstructured data from its data sources. This data will ultimately lead to a consolidated single data repository.
Traditionally, extraction meant getting data from Excel files and Relational Management Database Systems, as these were the primary sources of information for businesses (e.g. purchase orders written in Excel). With the increase in Software as a Service (SaaS) applications, the majority of businesses now find valuable information in the apps themselves, e.g. Facebook for advertising performance, Google Analytics for website utilization, Salesforce for sales activities, etc.
Extracted data may come in several formats, such as relational databases, XML, JSON, and others, and from a wide range of data sources, including cloud, hybrid, and on-premise environments, CRM systems, data storage platforms, analytic tools, etc. But today, data extraction is mostly about obtaining information from an app’s storage via APIs or webhooks.
Now let’s look at the three possible architecture designs for the extract process.
1.1. Extract architecture design
When designing the software architecture for extracting data, there are 3 possible approaches to implementing the core solution:
Full-extraction. Each extraction collects all data from the source and pushes it down the data pipeline.
Incremental extraction. At each new cycle of the extraction process (e.g. every time the ETL pipeline is run), only the new data is collected from the source, along with any data that has changed since the last collection. For example, data collection via API.
Source-driven extraction. The source notifies the ETL system that data has changed, and the ETL pipeline is run to extract the changed data. For example, data collection via webhooks.
Below are the pros and cons of each architecture design, so that you can better understand the trade-offs of each ETL process design choice:
1.2. Extract challenges
The data extraction part of the ETL process poses several challenges. Knowing them can help you prepare for and avoid the issues before they arise.
Data latency. Depending on how fast you need data to make decisions, the extraction process can be run with lower or higher frequencies. The tradeoff is between stale or late data at lower frequencies vs higher computational resources needed at higher frequencies.
Data volume. The volume of data extraction affects system design. The solutions for low-volume data do not scale well as data quantity increases. With large amounts of data, you need to implement parallel extraction solutions, which are complex and difficult to maintain from an engineering perspective.
Source limits. You need to be aware of the source limitations when extracting data. For example, some sources (such as APIs and webhooks) have imposed limitations on how much data you can extract simultaneously. Your data engineers need to work around these barriers to ensure system reliability.
Data validation. Either you validate data at extraction (before pushing it down the ETL pipeline), or at the transformation stage. When validating data at extraction, check for missing data (e.g. are some fields empty, even though they should have returned data?) and corrupted data (e.g. are some returned values nonsensical, such as a Facebook ad having -3 clicks?).
Orchestration. Based on your choices of data latency, volume, source limits and data quality (validation), you need to orchestrate your extraction scripts to run at specified times or triggers. This can become complex if you implement a mixed model of architectural design choices (which people often do in order to accommodate for different business cases of data use).
Disparate sources. Working with different data sources causes problems with overhead and management. The variety of sources increases the demands for monitoring, orchestration and error fixes.
Monitoring. You need to monitor your extraction system on several levels:
Resources. How much computational power and memory is allocated?
Errors. Have there been any errors that have caused missing or corrupted data?
Reliability. Have the extraction scripts run at all?
With the increasing dependency on third-party apps for doing business, the extraction process must address several API challenges as well:
Variability & inconsistency. Every API is designed differently, whether you are using apps from giants like Facebook or small software companies. The variability and inconsistencies between their designs will cause your engineers to spend a lot of time on implementing the extractors for each source.
Lack of documentation. APIs are notoriously lacking in proper documentation. Every time your use case deviates from the core minimum (which is documented), engineering hours need to be allocated to understand how to execute on your vision for the extractor within the API.
(Breaking) Changes. APIs change frequently and break your extractors. Whether it’s because of version updates or expiring access tokens, APIs require a lot of monitoring and maintenance hours.
High complexity. APIs expose data in multiple locations (endpoints) and complex aggregations. This is especially true for SaaS apps, which cover multiple business use cases and collect a lot of different data. You need to budget engineering hours for navigating this complexity.
The “Transform” stage of the ETL process takes the data that has been collected at the extractor stage and changes (transforms) it before saving it to the analytic database. There are multiple transformations:
Data cleaning. Data cleaning involves identifying suspicious data and correcting or removing it. For example:
Remove missing data
Recode missing data into NULLs or 0s or “#NA”
Recode different versions of the same data to a common denominator. For example, “M”, 1, “male”, “masculine” to “Male”
One-hot encode categorical data
Convert data types to standard forms. For example, convert DateTime objects and Unix timestamps to the same data type
Data enriching. Data enriching involves adding new information to the raw data already collected. For example:
Join different sources. Create customer information blobs, which join information from a variety of purchasing apps.
Deduplication. Identify which information is duplicated and remove the copycat.
Calculated fields. For example, calculate the lifetime value of the customers at data sets import, or the number of their consecutive purchases.
In reality, though, the majority of work is done via data cleaning. If you would like to dig deeper into the intricacies of data cleansing, check out The Ultimate Guide to Data Cleaning.
2.1. Transform architecture design
When designing the architecture of data transformation, there are multiple things to consider:
Order of operations. The order in which transform rules are applied to incoming data can affect the end result. For instance, imagine we have two transform scripts. The first one processes data to compute the consecutive number of purchases made by a customer. The second transformation process drops purchase information from the data pipeline unless there is a shipping address. If we drop the row for a customer with a missing shipping address before we calculate the consecutive order, the end result is going to be two different purchase orders.
Business logic. Transforms often implement business logic, such as calculating a customer’s lifetime value or their consecutive orders. Your architecture needs to be designed so it can handle missing or corrupt data and transform orders, thus supporting business logic implementation.
Algorithmic efficiency. Because transforms go through the extracted data, they sometimes need to handle heavy loads. Algorithmic efficiency in the design of transforms can make a difference in the time required for a transform to execute, or whether it will time-out your system. Take this simple example: implementing a dictionary solution for 1M rows transformation vs a for loop results is a difference of a couple of orders of magnitude.
Quality assurance. Transformations are often the place where data is validated against a set of criteria (e.g. do not import customer information unless we have their email) and monitored for data quality. At this stage, a lot of ETL processes are designed with alerts to notify developers of errors, as well as rules, which are preventing data from passing on the data pipeline unless it matches certain criteria.
2.2. Transform challenges
There are several challenges when dealing with transformations:
Lack of business logic. Oftentimes, it becomes clear that there is a lack of business logic given the data we receive from the extract phase. As an example: the business rule for determining a new customer is the date of their first product purchase. But what do we do for customers who paid for shipping, but not for a product?
Changing business logic. As company operations evolve, business definitions change. Even small changes to business logic can have multiple effects on transforms, especially if the change in one transform affects others which depend on it.
Hard business logic. Sometimes, implementing something trivial from a business perspective can be challenging from an engineering perspective.
Changing source data. APIs can change their response payloads, data can become corrupted, or your system might migrate to a new SaaS… so you need to implement a different transform logic. In addition to the decoupling issues, changing source data requires constant monitoring and maintenance of the transform stage.
Scaling complexity. Transforms present challenges when the ETL processes evolve. The more transforms you implement, the harder it is to keep track of their mutual effects.
“Load” involves taking data from the transform stage and saving it to a target database (relational database, SQL, NoSQL data store, data warehouse, or data lake), where it is ready for big data analysis.
3.1. Load architecture design
There are three possible designs for architecting data being loaded into a destination warehouse/database or other target systems: Full load, or incremental load (batch and stream).
Here, we explore them alongside their pros and cons:
3.2. Load challenges
There are several challenges in the loading stage of the data pipeline:
Order of insertion. The order of insertion can affect the end result. If a table has a foreign key constraint, it might prevent you from inserting data into that table (and would probably skip it), unless you first insert matching data in another table.
Schema changes. The schema represents what the destination (database or data warehouse) expects the data to look like. As your business evolves, the schema is often updated to reflect changes in business operations. The resulting need for schema updates can lead to a waste of engineering hours, as well as unintended consequences for the entire system (e.g. data quality validations might break when the form of data breaks).
Data quality. Suspicious data is sometimes formatted in such a way that it circumvents all of your data validation at extraction and transformation. As a result, you need additional data quality monitoring to assure data quality in your database or data warehouse.
The best way to reduce errors, save time and overcome the challenges of the ETL process is to automate all the steps using an ETL tool. Use them to set up your ETL process once and reuse it forever. Speaking of forever…
Keboola offers a forever-free no-questions-asked account you might want to try out if you are building an ETL pipeline. With hundreds of components available and drag’n’drop capabilities, you will build a data flow in minutes. Create a free account and try it for yourself.
4. ETL vs ELT
Within the ETL (traditional process), data is extracted to the staging area (either in-memory data structures or temporary databases) before it is transformed and loaded into the analytic (OLAP) database for analysis.
ELT (extract-load-transform) takes advantage of the new data warehousing technologies (e.g. BigQuery, Amazon Redshift, Snowflake…) by loading the raw data into a data warehouse or data lake first and transforming the data on the fly when it is needed for analysis. ELT is preferred for operations working with extremely large volumes of data or with real-time data.
To put it another way. The main conceptual difference is the final step of the process: in ETL, clean data is loaded in the target destination store. In ELT, loading data happens before transformations - the final step is transforming the data just before data analysis.
Even though the end result is the same (i.e. data is ready for analysis), there is a tradeoff between ETL and ELT, which needs to be made clear:
Regardless of your preference (ETL or ELT), there are several architectural considerations to keep in mind.
Manage and orchestrate your data in one place. Cut costs and build data products in days instead of weeks.
A thought-out ETL process can drive true business value and benefits such as:
Information clarity. During ETL transformations, data is cleaned and joined across sources before it is saved in the database, where you can then analyze it. These operations allow you to work with clear information and clarify raw data.
Information completeness. A well-designed ETL pipeline unites all business sources in a single place (the destination data warehouse/database). All of the information is complete, so there are no missing puzzle pieces.
Information quality. ETL processes validate data at extraction or correct/discard data at transformation. This ensures that the quality of data is always controlled before it is analyzed, thus increasing trust in the analysis and giving you the confidence to use data for business intelligence, data analytics, machine learning algorithms or other data science projects.
Information velocity. ETL processes can be designed to trigger the entire ETL pipeline whenever new data arises in the sources or when existing data is changed. You can therefore control the ‘freshness’ of the data, as well as the speed at which you make decisions based on signals in the outside world.
Novel business insights. The entire ETL process brings structure to your company’s information. This allows you to spend more time analyzing novel questions and acquiring new insights, rather than trying to perform procedures to get valuable data at each stage.
6. Which ETL tool should you choose?
There are plenty of ETL tools which automate, accelerate, and take care of your ETL processes for you. What makes them valuable is that they:
Accelerate the data pipeline delivery. You can shorten the time it takes to get insights from months to weeks.
Automatize ETL processes. Reduce manual tasks and create a more efficient and reliable workflow.
Externalize maintenance costs. The tools take care of all breaking changes, updates and overall maintenance.
Picking the right tool for your company is a whole new chapter in your ETL process journey, and you can choose your adventure by either: