Learn more about data quality, its importance for your business, and how to improve it.
We’ve all heard the war stories born out of wrong data:
- Important packages are sent to the wrong customer.
- Double payments are made to suppliers due to corrupted invoicing records.
- Sales opportunities are missed because of incomplete product records.
These stories don’t just make you and your company look like fools, they also cause great economic damages. And the more your enterprise relies on data, the greater the potential for harm.
Here, we take a look at what data quality is and how the entire data quality management process can be improved.
What is data quality?
Defining data quality is an elusive task. Even though we have an intuitive feeling that it relates to data of high standards, the exact definition is tough to pin down. Various institutions, academics, and industry experts have tried to specify the characteristics of data integrity in their definitions of data quality.
For example, Fleckenstein and Fellows (2018) refer to high-quality data as data that "are fit for their intended uses in operations, decision making and planning". In a similar vein, the National Institute of Standards and Technology defines data quality as: "the usefulness, accuracy, and correctness of data for its application".
So, unless we are a student trying to pass an exam in data management processes, why do we care about these definitions? It’s clear from the definitions above that both are oriented towards the pragmatic aspects of data quality. Having high-quality data allows us to plan, make decisions, and use data in various applications.
But why does this matter? Data quality has huge ramifications on the business’s bottom line. Having a clear understanding (definition) of what constitutes data quality allows us to measure and fix it.
Let’s dive deeper into why data quality is so important.
Why is data quality important?
The war stories mentioned in the introduction speak volumes about the importance of data. But the quality of data is important for a multitude of other reasons:
- Data quality affects the bottom line. Low-quality or corrupted data will affect your business operations from a financial standpoint. From increased expenses when making mistakes (returns of goods sold, double invoicing, etc.) to loss of financial opportunities (negotiating lower supply costs, missing out on sales due to incomplete data or lack of customer trust, etc.), low-quality data costs more than it first might seem.
- Data quality affects trust in data. When issues with data quality are discovered, you lose trust. Customers may not trust you because you’ve made mistakes, while business leaders might not find the data reliable for decision-making. Whatever the case, low data quality has long-term damaging effects on the reputation of data and the people who take care of it.
- High-quality data is necessary for data products. We’re running businesses in an age when more and more products depend on data. Whether it’s applications that use customer data to provide services (financial investment apps, sports apps, etc.) to machine learning products that base their entire performance on data, having high-quality data for your product is the same as having high-quality fuel for your rocket ship. Unless the fuel is of a superior standard, the rocket is not going to fly. Or as machine learning engineers say: “Garbage in, garbage out.” Bad data is just not going to cut it. Ensuring that data is as good as it possibly can be is a prerequisite for a high-performing product line.
What are the common data quality issues?
There are as many issues with data quality as there are data experts with war stories.
Ask any data engineer or architect and they will gladly share how a database design or analytics implementation led to a massive business debacle.
To understand the recurrent issues surrounding data quality, we have to group these issues around common themes, which are known as the dimensions of data quality.
There are multiple dimensions of data quality which matter:
- Data accessibility or availability. Access to data is necessary if we want to analyze it and draw conclusions that lead to profitable business insights. Issues regarding data accessibility can happen at any stage along the ETL pipeline. Our data collection could be broken, skipping the import of some datasets into our database, or we could encounter a problem with sharing permissions, which prevents analysts from accessing the data required for their analysis. This also hinders the collaboration between different analysts because they lack access to the data that is needed to work together.
- Data accuracy or correctness. Accuracy refers to how well the data reflects the real world that it’s trying to describe. This characteristic of data quality is hard to specify in data-quality standards because accuracy issues take on many forms, from changing addresses that are not updated within customer records to misspellings and wrongful insertions. Data accuracy is usually asserted by applying business rules within the data cleansing process, which checks the data for correctness.
- Data completeness or comprehensiveness. Missing data values always present an issue within data operations. Ensuring that the records are complete is one of the characteristics of high-quality data. During the data cleaning process, the data assets with missing values are either removed or they are imputed with the best estimates as replacements.
- Data consistency, coherence, or clarity. When two records about the same unit hold conflicting information, they are not just inconsistent - they also dampen your ability to make data-driven decisions. And let’s not even think about the regulatory compliance issues you can get into if your financial reports show inconsistent data...
- Data relevance, pertinence, or usefulness. You might have collected all of the data in the world, but it’s completely useless if it’s not relevant to your analysis and your business. Collecting relevant or useful data (and discarding the rest) is part of data quality assurance.
- Data timeliness or latency. How quickly is the data available to us? If there is a delay between collecting data from its data sources and analyzing it, we could lose out on the potential of real-time analytics. If the delays are even longer, we might produce reports before all of the data is available, thus painting an incorrect picture between what is reported (with missing data) and what is actually true (with delayed data).
- Data uniqueness. Some data is unique by design, such as the UUID number of your product, or the identity of your customers. The common issue in data quality is record duplication, whereby the same information is inserted multiple times. This issue usually arises during data entry, especially if it’s done manually.
- Data validity or reasonableness. Valid data are those that are in line with the business or technical constraints. For example, your customer is probably not 140 years old, so it’s likely that there’s a validity issue here. But validity does not just refer to semantic constraints (such as age). It also includes the distribution of data and its aggregated metrics. Looking at the mean, median, mode, standard deviations, outliers, and other statistical characteristics allows you to discern the validity of your data.
Who is responsible for data quality?
Data quality is everyone’s business because good data quality allows everyone to trust the process and do their best work. However, depending on the type of operations you run, different people might be responsible for asserting high-quality data.
In enterprises and cross-organizational deployments, there is usually a data management team in charge of asserting data quality. The team comprises a data manager, who oversees the entire data quality assurance operation, as well as practitioners who resolve technical conflicts and data stewards. The latter are responsible for communicating data quality issues and problem resolutions across the silos within the business.
In smaller organizations, startups, and home-businesses, the responsibility often falls on the shoulders of the ‘data person’ (data scientist, business analyst, or data engineer) or someone from the IT department.
How do these teams and individuals achieve high-quality data? They go through the cycle of data quality management and improve it.
How to improve data quality
There is a process of best practices when improving the quality of your data:
- Start by setting up a data governance framework. The data governance framework specifies which standards you will follow and what business requirements and rules need to be applied to achieve high-quality data. This also includes regulatory compliance, i.e. how your data quality practices fulfill the European Union's General Data Protection Regulation (GDPR) and/or California Consumer Privacy Act (CCPA) regulations.
- Set up KPIs or goals for data quality. Identify the data quality dimensions that need fixing and specify them as KPIs. A common way to assess how much ‘data accuracy’ has been improved is to measure the number of data assets (tables, databases, ETL pipelines, etc.) that you have checked for accuracy issues. Make sure that you also set up a logging system for data quality reporting.
- Profile data and establish a list of issues. Data profiling refers to the analysis of data which produces a report on data distribution, frequencies, central tendencies, and deviations. This can then be used in understanding the structural level of data. Use this and other analyses to compile a list of issues which need fixing.
- Fix the issues. It’s as simple as that - fix them. This is usually done by data practitioners (hands-on data managers, data engineers, and data scientists) by cleaning the data (we have written a long guide on the best practices for cleaning data - check it out here). Be sure to log every fix so that you can generate a report of all the findings.
- Iterate or prevent issues from recurring. Fixing data quality issues is cyclical. Once you’re done, you need to recheck your data platforms to verify that everything is according to your standards and set up in your data governance framework. If it’s not, you need to re-clean the data. Advanced approaches prevent data quality issues from recurring, which we expand on in the next section.
How to ensure data quality in the long run
Whether or not you have gone through the process of asserting data quality before and have cleaned your data, there are several issues which are always going to demand your attention:
- Entropy. No matter how well you cleaned your resources before, data is alive and being constantly updated, so new errors are likely to emerge.
- The nature of big data. Big data is best characterized by the 3 Vs: volume, velocity, and variety. Volume refers to how the quantity of data is increasing every day. Velocity relates to how data production is accelerated. And variability refers to how data takes many different forms: while most data in the past was relational (database tables, Excel records, etc.), a lot of data nowadays is unstructured (text files, website link streams, video recordings, etc.). Companies that use data in their decision-making or products sway towards big data and its various advantages and issues. Tapping into the potential of big data means that we also face the challenges of scaling our infrastructure for data collection without causing issues (such as corrupted and missing data), as well as adjusting our quality assurance process to the demands of unstructured data.
- Regulations. Regulations such as GDPR and CCPA are just some of the legal compliances that we have to abide by. Novel regulations are introduced and existing ones are updated, which demands constant supervision and changes to the data quality assurance work that we undertake.
So, how do companies keep their data in check with all of these factors influencing data quality?
The answer is through quality software that’s based on best practices. Good software helps us to manage data in several ways to assure its quality:
- Prevents violations. Good software prevents data quality issues from arising. For example, you might set up (primary key) constraints for your relational table which prevent duplicate records from being inserted.
- Monitors data pipeline. Good software monitors your data platforms and notifies you whenever it suspects corrupted data, or sounds the alarms when it actually does happen (e.g. a data-collection pipeline fails).
- Automate critical ETL processes. Cleaning data boils down to a set of repetitive commands executed in your favorite language (SQL, Python, etc.). Good software allows you to automate these ETL processes to always guarantee your data is of high quality.
- … and more.
A platform to manage data quality
Good software can help you to manage the overall data quality of your data assets.
Keboola is an example of such software. As an unified DataOps platform, you can use Keboola to:
- Set up your data pipeline within the platform itself. The entire ETL process (extracting data sources, transforming raw data by cleaning it, and loading the data into your database of choice) can be achieved in just a couple of clicks.
- Set up your data cleaning process within transformations to guarantee the data quality standards of your data governance framework.
- Orchestrate your transformation to run automatically and rest assured that it will always provide you with reliable data.
- Monitor the end-to-end data pipeline for reliability.
But Keboola takes it a step further:
- It’s fully compliant with global regulatory demands (GDPR, CCPA, and many more).
- Offers the best-in-industry levels of security.
- Allows collaboration between all of your data parties. Access issues are a thing of the past with Keboola’s granular and intuitive permission control.
- Scales seamlessly. Do you want big data? Not a problem with Keboola. The infrastructure takes care of itself, so you won’t suffer growing pains if you choose to include more sources or different data assets.
Ready to give it a try? Check out everything that Keboola has to offer on this (forever) free plan. Yes, forever.