Working with raw or unprocessed data often leads to poor decision-making.
This explains why data scientists, engineers, and other analytic professionals spend over 80% of their time finding, cleaning, and organizing data.
Accordingly, the ETL process - the foundation of all data pipelines - devotes an entire section to T, transformations: the act of cleaning, molding, and reshaping data into a valuable format.
ETL process or ETL pipeline is a 3-letter abbreviation that refers to the three necessary steps to get data from its raw format to its final, valuable state:
Once data is loaded it can either be further transformed or directly used by data analysts or other stakeholders for decision-making. For example, properly cleaned data can be pushed to business intelligence (BI) tools, where it is ready for visualization and analytics.
ETL is a form of data architecture. As such, it is idealized, and it portrays data pipelines as sequentially linear processes.
Of course, the reality is messier. As always.
Usually, the steps of the ETL process overlap and are done in parallel wherever possible, to get the freshest data available ASAP.
But when it comes to data engineering architecture there are two distinct ways of incorporating transformations into data pipelines.
The first is ETL as explained above. ETL is the traditional process of extracting data to the staging area, where it is transformed before being loaded into its final destination storage.
The second architectural paradigm is ELT. Notice the L before the T. The data loading happens before the transformations. By using advances in data warehousing technologies, the cost of saving has dropped drastically.
Raw data is extracted and loaded into a data warehouse (e.g. BigQuery, Amazon Redshift, Snowflake…). Only afterward the data is transformed.
Both paradigms have advantages and shortcomings and are better thought of as two strategies for different DataOps challenges (dive into the details here).
ETL is usually better suited for smaller data pipelines, while ELT is the go-to design pattern for big data.
Irrespective of the architecture of your choice, all transformations can be categorized into a couple of prototypical types.
Data transformations - also called data cleansing or data cleaning - is the process of manipulating data from its raw form into its final, cleaned, valuable shape.
Depending on the type of transformation you employ to clean your data, you could be using one of the prototypes of data transformation.
Data filtering is one of the simplest transformations. Certain columns or rows are filtered out before saving or inserting the data into the data storage.
Example: For the table sales_2021, you filter out all data from orders that were placed before 2021-01-01.
Data mapping, also called translation or conversion, takes one data input and changes it to its equivalent in another format.
Example 1: Take product names from your eCommerce shop across all language variants and translate them into English before inserting them into the product_details table.
Example 2: You are running an app and are interested in usage analytics. The data you collect is time-stamped with the local timezone of each user. Before extracting and saving events to your event_raw table, you convert local time zones to UTC timestamps. For example, Mario and Jessica are playing each other in your app’s game. But Mario plays it in Italy at 5 pm, while Jessica plays it in New York at 11 am (both local time). To avoid confusing timestamps, you convert events from both Mario and Jessica to UTC (3 pm).
Example 3: You are building a machine learning classifier that helps you determine which customers are more likely to repurchase. The data you extract from the CRM says “female” or “male”, which your algorithm would not understand. Therefore you recode the categorical variable into 0 for “male” and 1 for “female” before saving the data to the database.
Data mapping is one of the most common types of data transformations across all operations and industries.
It takes on many forms not mentioned here, such as unifying capitalizations, converting strings to the same encoding standard, synchronizing datetimes, aligning units across different granulations, etc.
Duplicated data repeats the same information two or more times. Data deduplication removes duplicated data before inserting it into the data warehouse.
Example: After collecting distribution information from your ERP, you find the same item being shipped twice:
The data is unnecessarily duplicated, so you insert only one row into the final database.
Derived variables or metrics compute attributes from other data points and variables.
Example: You create a column “profit” in your orders table that is computed by subtracting taxes, shipping_costs, and COGS from your sales_total for each row (aka order).
Before saving the data, you sort it by date, so it is easier to find via search.
Merge related data that was collected from different sources.
Example: Join Facebook Ads spend data, Google Ads spend data and Linkedin Ads spend data into a single table called digital_advertising_spend.
Data aggregation is a form of summarization. Aggregation is done by picking a relevant dimension and grouping a metric across that dimension.
Example 1: Compute the total number of new customers (grouped metric) on each date (dimension).
Example 2: Find total sales (grouped metric) by geographical region (dimension).
Data aggregation is often performed for metrics that are commonly used to speed up analytics.
When working with unstructured and semistructured data (such as strings, JSON, or XML files), you split the data before saving it into a table.
Example 1: You get the customer address in the form of the string “Boulevard avenue 23, Portobello PO1234, UK”. Before saving it into the customer table, you split it into the columns:
The choice of transformation type ultimately depends on your business needs.
Get yourself on the right track by following this process:
Keboola is an end-to-end data operations platform that automates the majority of data work to free up valuable engineering resources.
Its main focus is to automate the entire data integration process, so it has built many automation and smart features surrounding transformations:
1. Language-agnostic. Write your transformation in SQL or pick a language of your choice (such as Python, R, Julia, …).
2. Orchestrated. All transformations can be scheduled via orchestrations, so are easily repeatable without the need for someone to manually run the transformation scripts.
3. Centralized. Access all transformations from the “Transformations” tab within the Keboola dashboard. The centralization gives your team visibility over the transformations set in place and liberates precious time that is usually spent with organizing and searching for existing data cleaning scripts. Simply scroll through the list of transformations or use the search bar to find the transformation you are looking for.
4. Full traceability. Each transformation is fully monitored, so you do not have to worry about setting regulatory compliance. Inspect each transformation run under the “Jobs” tab.
5. Automatic versioning. Play around and tinker with your data cleaning. You can always revert changes with versioning.