Join our newsletter

#noSpamWePromise
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
cross-icon
Subscribe

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Download the file

#getsmarter
Oops! Something went wrong while submitting the form.
cross-icon
How To
July 21, 2021
How to do data transformation in your ETL process?
Learn all about ETL transformations and how to automate them in Keboola.

Working with raw or unprocessed data often leads to poor decision-making.

This explains why data scientists, engineers, and other analytic professionals spend over 80% of their time finding, cleaning, and organizing data.

Accordingly, the ETL process - the foundation of all data pipelines - devotes an entire section to T, transformations: the act of cleaning, molding, and reshaping data into a valuable format.

What is the ETL process?

ETL process or ETL pipeline is a 3-letter abbreviation that refers to the three necessary steps to get data from its raw format to its final, valuable state:

  1. Extract. Collect data from all data sources. Your raw data can reside in your in-house CRM or ERP, or you could collect advertising data from Facebook Ads API or Google Ads API. Irrespective of where your data is produced, in the extraction phase, data is collected from its data sources. 
  2. Transform. Following data extraction, raw data is pushed to the staging area (an intermediate, often in-memory storage) where it is transformed, processed, and cleaned.
  3. Load. Finally, the clean data is moved to its data store: a target database (such as the relational databases MySQL or PostgreSQL), a data warehouse, a data lake, or even multiple destinations. 

Once data is loaded it can either be further transformed or directly used by data analysts or other stakeholders for decision-making. For example, properly cleaned data can be pushed to business intelligence (BI) tools, where it is ready for visualization and analytics.

Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

Two ways to architect transformations - ETL vs ELT

ETL is a form of data architecture. As such, it is idealized, and it portrays data pipelines as sequentially linear processes. 

Of course, the reality is messier. As always. 

Usually, the steps of the ETL process overlap and are done in parallel wherever possible, to get the freshest data available ASAP.

But when it comes to data engineering architecture there are two distinct ways of incorporating transformations into data pipelines.

The first is ETL as explained above. ETL is the traditional process of extracting data to the staging area, where it is transformed before being loaded into its final destination storage.

The second architectural paradigm is ELT. Notice the L before the T. The data loading happens before the transformations. By using advances in data warehousing technologies, the cost of saving has dropped drastically. 

Raw data is extracted and loaded into a data warehouse (e.g. BigQuery, Amazon Redshift, Snowflake…).  Only afterward the data is transformed.

Both paradigms have advantages and shortcomings and are better thought of as two strategies for different DataOps challenges (dive into the details here).

ETL is usually better suited for smaller data pipelines, while ELT is the go-to design pattern for big data.

Irrespective of the architecture of your choice, all transformations can be categorized into a couple of prototypical types.

Types of transformations

Data transformations - also called data cleansing or data cleaning - is the process of manipulating data from its raw form into its final, cleaned, valuable shape.

Depending on the type of transformation you employ to clean your data, you could be using one of the prototypes of data transformation.

Data Filtering 

Data filtering is one of the simplest transformations. Certain columns or rows are filtered out before saving or inserting the data into the data storage.

Example: For the table sales_2021, you filter out all data from orders that were placed before 2021-01-01.

Data Mapping

Data mapping, also called translation or conversion, takes one data input and changes it to its equivalent in another format.

Example 1: Take product names from your eCommerce shop across all language variants and translate them into English before inserting them into the product_details table.

Example 2: You are running an app and are interested in usage analytics. The data you collect is time-stamped with the local timezone of each user. Before extracting and saving events to your event_raw table, you convert local time zones to UTC timestamps. For example, Mario and Jessica are playing each other in your app’s game. But Mario plays it in Italy at 5 pm, while Jessica plays it in New York at 11 am (both local time). To avoid confusing timestamps, you convert events from both Mario and Jessica to UTC (3 pm). 

Example 3: You are building a machine learning classifier that helps you determine which customers are more likely to repurchase. The data you extract from the CRM says “female” or “male”, which your algorithm would not understand. Therefore you recode the categorical variable into 0 for “male” and 1 for “female” before saving the data to the database. 

Data mapping is one of the most common types of data transformations across all operations and industries. 

It takes on many forms not mentioned here, such as unifying capitalizations, converting strings to the same encoding standard, synchronizing datetimes, aligning units across different granulations, etc. 

Data Deduplication

Duplicated data repeats the same information two or more times. Data deduplication removes duplicated data before inserting it into the data warehouse. 

Example: After collecting distribution information from your ERP, you find the same item being shipped twice:

The data is unnecessarily duplicated, so you insert only one row into the final database.

Derived Variables

Derived variables or metrics compute attributes from other data points and variables.

Example: You create a column “profit” in your orders table that is computed by subtracting taxes, shipping_costs, and COGS from your sales_total for each row (aka order). 

Data Sorting or Ordering

Before saving the data, you sort it by date, so it is easier to find via search.

Joining data from multiple sources

Merge related data that was collected from different sources.

Example: Join Facebook Ads spend data, Google Ads spend data and Linkedin Ads spend data into a single table called digital_advertising_spend. 

Aggregating data

Data aggregation is a form of summarization. Aggregation is done by picking a relevant dimension and grouping a metric across that dimension.

Example 1: Compute the total number of new customers (grouped metric) on each date (dimension).

Example 2: Find total sales (grouped metric) by geographical region (dimension). 

Data aggregation is often performed for metrics that are commonly used to speed up analytics. 

Splitting data

When working with unstructured and semistructured data (such as strings, JSON, or XML files), you split the data before saving it into a table.

Example 1: You get the customer address in the form of the string “Boulevard avenue 23, Portobello PO1234, UK”. Before saving it into the customer table, you split it into the columns:

  • street, 
  • city, 
  • zip_code, and 
  • country.

How to choose which transformation to apply?

The choice of transformation type ultimately depends on your business needs.

Get yourself on the right track by following this process:

  1. Identify the target tables schema. Prepare a list of tables and their schemas (columns, possible values, etc.) of how the data would ideally look once it has been cleaned and loaded into the database or data warehouse.
  2. Record the gap between target and incoming data. Contrast the ideal target state with the raw data you collected in the extraction step. Note down what transformation (types) would be needed to convert the raw data into the cleaned data.
  3. Implement the changes needed to bridge the gap between extracted data and target data. Write SQL-based queries that clean the raw data. Validate the transformations before setting them on autopilot with the ETL tool of your choice. 

Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

Take control of your transformations with Keboola

Keboola is an end-to-end data operations platform that automates the majority of data work to free up valuable engineering resources.

Its main focus is to automate the entire data integration process, so it has built many automation and smart features surrounding transformations:

1. Language-agnostic. Write your transformation in SQL or pick a language of your choice (such as Python, R, Julia, …). 

2. Orchestrated. All transformations can be scheduled via orchestrations, so are easily repeatable without the need for someone to manually run the transformation scripts. 

3. Centralized. Access all transformations from the “Transformations” tab within the Keboola dashboard. The centralization gives your team visibility over the transformations set in place and liberates precious time that is usually spent with organizing and searching for existing data cleaning scripts. Simply scroll through the list of transformations or use the search bar to find the transformation you are looking for.

4. Full traceability. Each transformation is fully monitored, so you do not have to worry about setting regulatory compliance. Inspect each transformation run under the “Jobs” tab.

5. Automatic versioning. Play around and tinker with your data cleaning. You can always revert changes with versioning.

Take Keboola for a spin and check for yourself what it can do for your ETL pipelines and transformations. No worries, Keboola has an always-free plan, so you can check it out on your own schedule. 

Recomended Articles