Guided solutions and best practices for the most common data quality problems
Can you put a number to how much bad data is costing your business?
“Every year, poor data quality costs organizations an average of $12.9 million” - Gartner
This includes everything from marketing losing 21 cents of every advertising dollar due to misinformed business decisions to data engineers wasting more than 3h daily resolving data quality issues.
Luckily, you can guarantee high-quality data with 9 simple checks. In this article, we’ll look at:
Below we showcase the 9 most common problems companies face when trying to improve the quality of their data.
Data-driven decision-making is only as effective as the data used to make business decisions. If the data is missing, business choices get detached from the competitive insights that data can provide.
There are three main drivers for inaccessible data during decision-making time:
To make your data accessible, diagnose the cause of the problem and apply one of the following solutions:
“Best-in-Class companies are 30% more likely to have a dedicated data catalog solution” - Aberdeen Strategy & Research, 2020
Data accuracy measures how well your data matches the real world. An example of inaccurate data is a customer changing their residence, but your CRM shipping address doesn’t get updated alongside. Or your sales rep entering the shipping address while talking to your customer on the phone, but mistyping the street name.
Multiple factors create inaccurate data, including human entry or updating errors, data drift, and data decay.
There isn't a single solution to data accuracy. The best approach is to apply business rules to your data cleaning procedure to verify the accuracy of the data. For example, verifying if every CRM shipping address corresponds to a real-world location via the Google Maps API.
From incomplete insights to machine learning models learning only on a (biased) subset of data, missing data can lead your data products and decision-making astray.
Multiple drivers lead to missing values, including bugs in data integration pipelines, errors from data providers, human entry and update errors, and the nature of the data collection (e.g. survey data often have missing fields because people don’t answer every question).
Missing values are often represented as NULL values in SQL systems, or as NaNs or Nones in other systems. Run your data quality tests by checking for missing values in your dataset (column-wise and row-wise). If you’re certain there should be no missing values (e.g. primary key column, unique identifiers), you can create NOT NULL values constraints for your SQL table.
When you look up a customer’s purchase history and the CRM says they only bought one item, while the transaction database shows the customer is a regular big spender you start to question the trust you have in reporting.
Data inconsistency refers to conflicting information about the same operational unit (such as a customer, product, business branch, etc.). This inconsistency erodes trust in data, makes data less usable for decision-making, and can even lead to regulatory compliance issues.
Inconsistencies arise whenever a company grows. This can be due to multiple unsynchronized data integration pipelines, databases that are not fully migrated, or ad hoc datasets prepared for a specific use case but then integrated into the source of truth storage, etc.
You need to ensure that the same data in different locations matches. This can be achieved with multiple data quality tests:
Sometimes we have so much data that the majority of it is irrelevant. For example, the weather on the date our customer purchases our software product. Sure, maybe there is a relationship between rainy days and purchases, but how strong is it if we’re not an umbrella company?
We import every data source imaginable, irrespective of the need for it. Additionally, by not sanitizing the data before making it available for analysis, we risk introducing irrelevant data into the insights process.
Improve your filtering processes. Consider either not importing the irrelevant data at all (limit extraction in the ETL paradigm) or limiting which data moves from the data lake to the data warehouse (data transformation layer in the ELT process).
Data freshness is important in decision-making. Some data ages like fine wine (transaction data), while other data gets irrelevant and stale as time passes by. This particularly affects customer data - for example, shipping addresses change, and purchases made 17 months ago should not influence recent upsell suggestions.
Data becomes stale as a matter of nature. You’ll always have to fight data staleness. Additionally, concept drift and business need changes affect which data is relevant right now.
There is no need to only look at real-time data. But make sure to filter records based on their timeliness. This will be analysis specific - for recent purchases, introduce a sensible business rule (e.g. 1 week for fast-moving goods, 1 year for long B2B sales cycles) that filters out data based on its freshness.
Invalid data doesn’t make sense. The term 'data validity' is sometimes used to encompass all data quality testing, but here, we refer to it in terms of how "reasonable" data is. For example, a customer cannot be 160 years old (semantic invalidity), and the average number of meals per day can’t be over 50 (statistical outlier).
At best, invalid data misguides your business decision-making. At worst, statistically invalid data can severely skew your analytic and machine learning models, making them biased towards outliers and statistical anomalies.
Invalid data is caused by multiple processes: human entry and update errors, and statistical anomalies of systems (some outliers truly exist, but can be deceitful when analyzed).
Use data quality tests that check for invalid data:
Duplicated data values can be problematic in and of themselves. They bias judgment toward duplicated entries. But sometimes duplicated data records can be classified as invalid data. For example, non-unique social numbers or identifiers can be problematic when searching for a particular customer’s data.
Multiple causes: human entry errors, integration mistakes, database migration or replication errors, etc.
Identify duplicate data values and either remove both (or all) copies, keep the first/last entry, or deduplicate - introducing logic to disambiguate duplicate entries.
Data can change without that making sense. This can range from formatting issues (for example, a column representing the purchase value of orders suddenly displaying dates instead of dollar values) to removed columns, and everything in between.
Data formatting issues typically occur due to human error and migration errors. Meanwhile, schema changes (missing, altered, or added columns) usually take place when third-party providers modify the schema at the source (for instance, during an API change).
Make sure to test for data formatting issues. Create checks and alerts for your schema changes as part of the regular data quality assurance by running additional tests.
Where (and when) to run your tests depends on your general data architecture and use cases:
Irrespective of when and where you run data quality checks, make observability a priority. Make data quality part of your data governance and introduce it in every data quality check.
Data quality management should not be the sole responsibility of your data team. Everyone who uses data for their business needs should understand the general quality of data they work with. And should also be responsible for the quality of the data.
How can everyone be involved when data quality checks are usually run in data quality tools by 3rd party providers (which require specialized knowledge) or by using open-source programming libraries (and everyone does not know how to code)?
By using tools that offer you all possible testing solutions:
With Keboola, you can run all three methods of data quality checks.
Pro tip: High data quality is not just a result of running data quality checks automatically, consistently, and thoroughly. High data quality also comes from correctly designing your data pipelines. Check our how-to guide that walks you through the 7 steps for higher quality data operations.
Keboola is a data stack as a service that automates all your data operations, including data quality checks.
With Keboola you can:
Build your ETL and ELT data pipelines with peace of mind and automate data quality checks in Keboola today.