8 TYPES OF ETL DATA TRANSFORMATIONS AND HOW TO AUTOMATE THEM.
Working with raw or unprocessed data often leads to poor decision-making.
Accordingly, the ETL process - the foundation of all data pipelines - devotes an entire section to T, transformations: the act of cleaning, molding, and reshaping data into a valuable format.
In this blog, you’ll find:
Data transformation is part of an ETL process and refers to preparing data for analysis. This involves cleaning (removing duplicates, fill-in missing values), reshaping (converting currencies, pivot tables), and computing new dimensions and metrics. This process requires some technical knowledge and is usually done by data engineers or data scientists.
In a typical ETL process, data transformation follows data extraction, where raw data is extracted to the staging area (an intermediate, often in-memory storage).
After data is transformed, it is then loaded to its data store: a target database (such as the relational databases MySQL or PostgreSQL), a data warehouse, a data lake, or even multiple destinations.
Usually, cleaned data is loaded to business intelligence (BI) tools, where it is ready for visualization and analytics done by the business users.
ETL is an idealized form of data architecture that portrays data pipelines as sequentially linear processes.
However, the reality is usually messier.
Usually, the steps of the ETL process overlap and are done in parallel wherever possible, to get the freshest data available ASAP.
That is why when it comes to data engineering architecture there are two distinct ways of incorporating transformations into data pipelines:
Both paradigms have advantages and shortcomings and are better thought of as two strategies for different DataOps challenges.
ETL is usually better suited for smaller data pipelines, while ELT is the go-to design pattern for big data.
Irrespective of the architecture of your choice, all etl transformations can be categorized into a couple of prototypical types, which we are going to breakdown in this blog.
Recommended read: Complete ETL process overview.
Depending on the type of transformation you employ to clean your data, you could be using one of the prototypes of data transformation.
Data filtering is one of the simplest transformations. Certain columns or rows are filtered out before saving or inserting the data into the data storage.
Example: For the table sales_2021, you filter out all data from orders that were placed before 2021-01-01.
Data mapping is one of the most common types of data transformations across all operations and industries. Data mapping (also called translation or conversion), takes one data input and changes it to its equivalent in another format.
It takes on many forms such as unifying capitalizations, converting strings to the same encoding standard, synchronizing date and time values, aligning units across different granulations, etc.
Example 1: Take product names from your eCommerce shop across all language variants and translate them into English before inserting them into the product_details table.
Example 2: You are running an app and are interested in usage analytics. The data you collect is time-stamped with the local timezone of each user. Before extracting and saving events to your event_raw table, you convert local time zones to UTC timestamps. For example, Mario and Jessica are playing each other in your app’s game. But Mario plays it in Italy at 5 pm, while Jessica plays it in New York at 11 am (both local time). To avoid confusing timestamps, you convert events from both Mario and Jessica to UTC (3 pm).
Example 3: You are building a machine learning classifier that helps you determine which customers are more likely to repurchase. The data you extract from the CRM says “female” or “male”, which your algorithm would not understand. Therefore you recode the categorical variable into 0 for “male” and 1 for “female” before saving the data to the database.
Duplicated data repeats the same information two or more times. Data deduplication removes duplicated data before inserting it into the data warehouse.
Example: After collecting distribution information from your ERP, you find the same item being shipped twice:
Derived variables or metrics compute attributes from other data points and variables.
Example: You create a column “profit” in your orders table that is computed by subtracting taxes, shipping_costs, and COGS from your sales_total for each row (aka order).
Before saving the data, you sort it by date, so it is easier to find via search.
Merge related data that was collected from multiple data sources.
Example: Join Facebook Ads spend data, Google Ads spend data and Linkedin Ads spend data into a single table called digital_advertising_spend.
In Keboola, you can instantly pull data from different sources using the 250+ pre-built connectors that include Google Ads, Facebook Ads, LinkedIn Ads, and Mailchimp.
You can create a data pipeline in X steps:
Data aggregation is a form of summarization. Aggregation is done by picking a relevant dimension and grouping a metric across that dimension.
It is often performed for metrics that are commonly used to speed up analytics.
Example 1: Compute the total number of new customers (grouped metric) on each date (dimension).
Example 2: Find total sales (grouped metric) by geographical region (dimension).
When working with unstructured and semistructured data (such as strings, JSON, or XML files), you split the data before saving it into a table.
Example 1: You get the customer address in the form of the string “Boulevard avenue 23, Portobello PO1234, UK”. Before saving it into the customer table, you split it into the columns:
The choice of transformation type ultimately depends on your business needs.
Get yourself on the right track by following this process:
Keboola is a data platform as a service that automates the majority of data integration process to free up valuable engineering resources.
When it comes to data transformations, Keboola has many smart features data engineers love:
Take Keboola for a spin and check for yourself what it can do for your ETL pipelines and transformations.