Learn everything about ETL testing - from definition to a step-by-step guide and automation.
Companies use their data to accelerate business growth and overtake their competitors. To achieve this, they invest a lot in their ETL (extract-transform-load) operations, which take raw data and transform it into actionable information.
It’s no wonder, then, that ETL testing is a crucial part of a well-functioning ETL process, since the ETL process generates mission-critical data.
In this guide, we explore ETL testing: from its benefits and best practices to specific techniques that will set you up for success.
Create a free account and automate ETL testing at no cost.
1. What is ETL testing?
ETL testing is performed to assert a high quality of data for operations.
In the classical ETL paradigm (on which we have written an extensive guide), companies Extract raw data from different data sources, Transform it to conform with business rules, and Load it into the database or data warehouse of their choice. Later, the data is either consumed by an internal Business Intelligence (BI) system or used as production data in a customer-facing application.
To make solid, data-driven business decisions and make customers happy, we need to validate, verify, qualify, and guarantee high standards of data in general - all before they enter the BI system or production applications.
The task of implementing quality control over data can seem daunting, but don’t fret! We’ll be presenting the best practices of testing the ETL data pipeline to guarantee that your data is as expected before it is further piped into analytics and products.
2. Why test? The benefits of ETL testing
Quality assurance of data done via ETL testing offers multiple benefits:
- Increased trust in data. Validated data is trustworthy data. Quality data not only gives you the confidence to reach conclusions as a data analyst and scientist, it also builds trust between you and business decision-makers. Showing that the data has been thoroughly tested removes ambiguity and settles doubts over the reliability of data used in making business decisions.
- Detect bugs before they become detrimental. Constructing the ETL testing suite allows you to detect bugs before they cause harm to your operations. For instance, an alert for exporting data prevents you from transforming that data (aggregating it), and inserting it into the database. Such aggregate inserts are harder to debug, especially when only some records are corrupted. Their corruption can be obscured by the aggregation. Preparing the ETL testing properly allows you to detect and prevent the propagation of bugs within your data ecosystem.
- A deeper knowledge of data. When you test the ETL pipeline, you dive deep into the specifics of data. Why does the field “total” in the orders table have such high extreme values (e.g. $120.000)? Dissecting, inspecting, and analyzing data for test cases provides you with a deeper appreciation of the data structure, which can be useful for further engineering, analytic, and scientific work down the line.
3. When do we use ETL testing?
There are 5 main settings in which ETL testing is used:
- First ETL setup. When we establish a new ETL pipeline from scratch, it is imperative to set up ETL testing to guarantee that the pipeline behaves as expected.
- Migration tests. New data warehouse testing. When we migrate historical data from old legacy systems to a new data warehouse, we set up the ETL testing process to check whether or not the data migration causes any breaks between the new and old storage systems.
- Change of tests. Application upgrades and changes to existing products usually change the underlying data models and schemas, as well as business logic. In these circumstances, we need to change the tests to reflect the new business and technological constraints.
- Report testing. Data is often fueled by the database or storage location to BI tools for data analysis. Business users access that data to generate reports for decision-making. Report testing determines whether the data in the reports follows the same logic, format, structure, and shape as can be found in the source data and target database.
- Regular tests. A well-oiled data operations process requires ETL testing to be part of a regular testing suite, which is launched periodically to confirm that the ETL process is still working as intended. As an example: data is extracted from different sources, usually from third party apps, inhouse text files, and other databases. Because the source systems are not fully controlled by the ETL tester, they can change without warning. A common example is an API provider who alters its schema without notifying its consumers. Regularly running ETL tests against the data sources allows you to continuously keep an eye on the entire process to verify that it’s still up to specifications.
4. How to set up ETL testing: The 7-stage process
Getting ETL testing up and running can seem intimidating and technologically challenging, but it can be boiled down to a set of 7 steps:
- Specify business requirements. Clarify the business objectives of the ETL process, define the expected scope and outcomes of ETL testing, and identify the relevant data models, schemas, and business rules which will apply to the tests. We always begin with a top-down approach, as it allows us to understand the specifications needed to address the business use case of acquiring data from an ETL pipeline.
- Define test cases. Test cases are specific examples of what you want to test. On a conceptual level, test cases answer the “what” question of what it is that we’re testing. For example, are we checking duplicate data in the data sources, or whether the data types in the target database conform to the data types of the source database? On a technical level, they answer the “how” question of testing, and are usually implemented as SQL queries against the data or by generating test data values to be fed into the system. We take an in-depth look at testing techniques and cases in the next section.
- Extract data and run tests. Implement the relevant data extractions and run test cases against the extracted data. We usually monitor the ETL system for data counts, aggregated features of extracted data (min, max, avg) and metadata before moving onto the transformation stage.
- Transform data and run tests. Transform the data according to the business rules and needs (e.g. drop irrelevant data, clean data of outliers and missing records, aggregate data, etc.). Run tests against the transformed data to verify that the transformation queries are working as intended. This can be difficult since the expected outcome is unclear, so we often feed manually generated test data to the ETL system and check whether the transformed data complies with our expectations before moving onto the next stage.
- Load data into the target database and run tests. Before loading data into the target destination, we need to perform a series of tests to guarantee data quality, such as counts of records, checks that corrupted data has been cleaned or rejected, etc. This must be done before committing the records from staging to the warehouse.
- Run end-to-end tests. Setting up tests for each stage of the ETL process is comprehensive but not exhaustive. Sometimes, a change of tests in the extraction phase can have consequences down the line at the loading phase. Running the entire ETL testing process end-to-end ensures that we haven’t accidentally altered the reliability of our tests.
- Generate a testing report. Lastly, generate a summary report specifying which tests were run, which ones were altered, and which ones were dropped from the ETL testing pipeline. This is a vital part of ETL testing, used to inform decision-makers and stakeholders of the entire procedure and how to interpret the data from the ETL pipeline.
5. Comprehensive list of ETL testing techniques with test cases
There are multiple types of ETL testing, which you can perform at various stages of the data flow through your ETL pipeline:
5.1 Metadata testing
What is metadata? Metadata is information about the data within the schema, for example, the column names, the data type of a column, the (uniqueness, foreign key, NOT NULL) constraints imposed on the data, etc.
There are multiple tests that we can run against the schema of a database, but the most commonly used test cases include:
- Data type checks. Check whether the data type of the source data matches that of the target data. For example, if you extract data of the type NUMBER, but load it into the destination database into a column of type VARCHAR, that is a data type error. Usually, these errors are easy to spot at the analytic stage, when a NUMBER method does not apply to your VARCHAR column. But sometimes, you want to assert that the data type has changed. For example, if you export Unix timestamps (in the data type TIMESTAMP on SQL Server), but want to save it as a DATETIME object in a target Postgres database, your data type check will not monitor for equality of data types.
- Data length checks. Assert that the length of a field is as specified. For example, if you have a column field which is defined to be a maximum of 150 characters in the target data warehouse, it cannot accept comments of 300 characters from source data without truncating them (and causing valuable data loss!).
- Constraint checks. Check that NOT NULL, uniqueness, and foreign keys have been applied.
- Naming convention checks. Incorrect data field naming will not corrupt your data, but it will cause issues when locating tables for analysis. Be sure to implement the rules of engagement when it comes to the casing (e.g. lowercased, snake_cased, CamelCased), meaningful appendices (e.g. fact tables need to be appended “_FACT”), etc.
5.2 Data completeness testing
Data completeness testing assures that all of the expected data is extracted from the source and loaded into the target without loss or corruption in the process. There are multiple test cases:
- Row count validation. We count the records in the source data and compare it to the count in the target data to see if they match.
- Aggregate column data values validation. Similar to row count, we perform aggregate functions over a column (min, max, avg) and compare it in both the source column and the target column.
- Value distribution validation. For more sensitive and mission-critical data, it is not enough to only understand the gross aggregate metrics (min, max, avg) - we need a more fine-tuned comparison. If that’s the case, we bin the values within a column and perform counts for each bin, then compare it bin-by-bin between the source and target data.
- Primary key validation. Whenever we operate with crucial entity data (for example, customer ID data or country code mappings), we must assert that the same unique values are present in both source and destination data. Often, the unique data is the primary key of a table. In those cases, we perform a record-by-record comparison to check whether or not the data is present in both tables.
- Full-value comparison validation. Certain data regulators require the datasets to match entirely, and there must be no loss or corruption of information. In these circumstances, we write test cases that check whether the tuples of all values within a row match between the source and destination tables.
5.3 Data quality testing
Data quality testing allows us to check that the business rules and logic have been applied to the ETL pipeline. The test cases include:
- Duplicate data validation. We confirm that the records returned at extraction are not duplicated before we load them.
- Rules validation. If we have business rules in place (e.g. a customer can only order a maximum of 5 items per order), we check to see whether those rules are reflected in the data.
- Data integrity validation. We check whether uniqueness and foreign key constraints have been observed. For example, a foreign key column needs to have a matching primary key in another table (no null examples).
5.4 Data transformation testing
Data transformation is one of the trickiest elements to test. We usually have two options:
- Test transformation procedures. Use premade test data that you feed to your system, and check whether the test data has been cleaned and transformed according to your business rules. This approach validates that all transformation procedures work as expected, and have not broken down with changes to the ETL system or testing suite. This use case is an ideal candidate for automation.
- Test transformation results. Run your transformation stage as you normally would. Additionally, implement the same transformation in another language onto the same source data that you fed to your ETL system. For example, compute averages of source data columns within a Jupyter Notebook pandas DataFrame. Finally, compare your independently obtained results with the system-transformed results to see if the two are in sync.
5.5 Regression testing
Regression tests refer to the tests that we run when there’s an important change in the production systems. Let’s say that you work for Twitter. The development team has changed the business constraint of maximum tweet length from 280 characters to 350 characters. Your regression tests would run the following to validate the production data:
- Check that the destination database does not automatically reject tweets longer than 280 characters.
- Check that the destination database does not truncate tweets longer than 280 characters.
- Check that older tweets are not corrupted once you change the max_length constraint in the tweet_text column.
The verification of data in regression tests is domain-specific, so you must think about how to apply regression tests to your production validation testing.
5.6 End-to-end testing
End-to-end testing, also called data integration testing, is used to find out how data fits beyond the ETL pipeline.
As an example, let’s say that the data is fed from the destination warehouse to a BI tool like Looker. Your end-to-end tests would check whether the reports in Looker use data which has the same row counts, primary keys, data distribution, and aggregated metrics (min, max, avg) as your warehouse and your source data.
Integration tests allow you to have an additional layer of control over the quality of data once it leaves your ETL pipeline.
5.7 Performance testing
Performance testing refers to a higher-level analysis than the tests mentioned above. Instead of looking at quality assurance within the data, it checks for quality assurance of the entire infrastructure.
Performance testing determines whether your ETL system can scale with increased data volumes and ingestion velocity.
The main goal is to identify the performance of each query session and eliminate bottlenecks to optimize overall performance.
You should profile queries with the tools needed to understand their performance. These tools depend on the specific vendor you use (e.g. Postgres comes with an out-of-the-box suite of performance analytic tools using EXPLAIN and ANALYZE; SQL Server has the famous Profiler, which allows you to analyze events on a granular level, etc.), so take a good look into the documentation of your specific toolbox.
6. The future of ETL testing: Automation
Automation of ETL testing is extremely beneficial. Not only does it save time that would otherwise be spent on manual testing, automating the testing pipeline is less prone to human error, and can be scaled and re-run without wasting additional management hours on reframing your ETL testing infrastructure.
Usually, we adopt ETL testing tools to unlock the potential of automation.
Keboola offers such tools, acting as a one-stop shop for all of your data operation needs. As an end-to-end data platform, you can use Keboola to:
1. Develop your entire ETL pipeline in a matter of minutes:
a) Set up your extraction procedures with a couple of clicks.
b) Implement the transformation logic layer.
2. Connect your destination databases without additional coding.
a) Instrumentalize the entire ETL pipeline with ETL tests.
b) Orchestrate scripts to achieve test automation.
Because automation is imbued in the infrastructure of Keboola, you can set up your ETL tests alongside your ETL logic in a centralized manner. This gives you a better overview of the system and allows you to alter code from a single place.
With its rich ecosystem of apps, which have been developed as a plug-and-play design, you can additionally click-to-integrate to set up ETL test monitoring. This means that you’ll always be in the know when it comes to the performance of your ETL system.
Ready to give it a try? Automate ETL testing at no cost. Create a free account.