How to clean data to make it ready for analysis and machine learning.
While digging through data, Anna spots an interesting trend - some customers buy 3 times more than others. A segment of super-high spenders? This could make it rain for the company! She rushes to her boss to show him the data, only to hear:
“Yeah, I know. We have a bug, which inserts every order three times in the database. Were you not paying attention during our daily meeting?”
This mistake can be easily avoided by cleaning your data before analyzing it and jumping to conclusions, or worse, showing it to others.
In this guide you will learn steps you can take to ensure you are always working with clean and analysis-ready data:
Put on your detective hat and jot down everything interesting, surprising, or even weird.
Look for the following:
Knowing the problem is half the battle. The other half is solving it.
How do you solve it, though? One ring might rule them all, but one approach is not going to cut it with all your data cleaning problems.
Depending on the type of data-cleaning challenge you’re facing, you’ll need different data-scrubbing techniques.
Ignoring missing data is like ignoring the holes in your boat while at sea - you’ll sink.
Start by spotting all the different disguises missing data wears. It appears in values such as 0, “0”, empty strings, “Not Applicable”, “NA”, “#NA”, None, NaN, NULL, or Inf.
Programmers before you might have put default values instead of missing data (“firstname.lastname@example.org”). When you have a general idea of what your missing data looks like, it is time to answer the crucial question:
“Is missing data telling me something valuable?”
Data might be missing as an alternative to saving other valuable information. For example, an empty “payment_date” row can tell you the customer has not paid yet. If you are trying to separate your customers into late payers vs on-time payers, this is valuable. If not, you should omit it from your analysis.
There are three main approaches to cleaning missing data:
Outliers are data points that are at an extreme. They usually have very high or very low values:
How to interpret those? Outliers usually signify either very interesting behavior or a broken collection process. Both are valuable information but proceed with cleaning only if the behavior is actually interesting.
There are three approaches to dealing with outliers:
This is another red flag for your collection process. Examples of contaminated data include:
With corrupted data, there is not much you can do except for removing it.
This process usually requires a lot of domain expertise, so if you are lacking the knowledge make sure to consult non-analytical members of your team.
It’s also important to check and fix any leakages your data collection pipeline has so that the data corruption does not repeat with future data collection.
“Wait, did we sell ‘Apples’, ‘apples’, or ‘APPLES’ this month? And what is this ‘monitor stand’ for $999 under the same product ID?”
Data inconsistencies are inevitable. Especially with manual data entry when there is a higher possibility of human error (e.g. when salespeople enter the product info on proforma invoices manually).
Data analysis via visualizations is one of the best ways to spot inconsistent representations of the same elements in your database.
Prepare bar charts per product category, or do a count of rows by category if this is easier.
When you spot the inconsistency, standardize all elements into the same format. Humans might understand that ‘apples’ is the same as ‘Apples’ (capitalization) which is the same as ‘appels’ (misspelling), but computers think those three refer to three different things altogether. Lowercasing as default and correcting typos are your friends here.
Similarly to corrupted data, invalid data is illogical.
For example, users who spend -2 hours on our app, or a person whose age is 170. Unlike corrupted data, invalid data does not result from faulty collection processes, but from issues with data processing.
Cleaning invalid data mostly means amending the functions and transformations which caused the data to be invalid.
If this is not possible, remove the invalid data.
This is damaging to your analysis because it can either deflate/inflate your numbers.
There are different reasons why duplicate values happen:
There are three ways to remove duplicates:
On your data cleaning journey you can encounter problems that are specific to the data type.
Strings are usually the messiest part of data cleaning process because they are often human-generated and hence prone to errors.
The common cleaning techniques for strings involve:
Dates and time can be tricky. Sometimes the error will not be apparent until you are running computations on the date and times.
The cleaning process involves:
Rescale your numeric data to a common range (usually between 0 and 1 for normalization).
This is important in cases where, for example, if your payment_value column had numbers in the 1000s and your quantity_purchased 10s, those two features can become comparable (0.8 and 0.3 for instance).
Standardizing and normalizing data has other advantages:
A lot of data cleaning is about preventing data issues rather than resolving data issues.
This is especially true for structural errors that can occur during measurement, data transfer, or other situations and can lead to inconsistent data, data duplication, or contamination.
But unlike the treatment advised above, you are not going to solve structural errors by applying cleaning techniques to them. Because you can clean the data all you want, but at the next import, the structural errors will produce unreliable data again.
To solve them, you will have to check your ETL pipeline and how you collect and transform data from their raw data sources to identify where the source of structural errors is and remove it.
Once cleaned, repeat steps 1 and 2.
This is helpful for three reasons:
Cleaning data is not the sole domain of data science.
High-quality data is necessary for any type of data analytics and decision-making. From startups launching the next Google search algorithm to business enterprises relying on Microsoft Excel for their business intelligence - clean data is the pillar upon which data-driven decision-making rests.
By now it is clear how important data cleaning is. But it still takes way too long and it is not the most intellectually stimulating challenge.
To avoid losing time, you can use Keboola to automate a lot of repetitive tasks.
Keboola is a data stack as a service that is packed with features that help you prepare data:
Keboola is designed to automate and streamline all of your data management tasks, so you can cut the time-consuming manual work and spend more time on strategic tasks that benefit the business.
Data cleansing, also called data cleaning or scrubbing, removes errors, duplicates, and irrelevant data from a raw dataset, creating reliable visualizations and models for business decisions.
What is an example of data cleaning?
Removing any duplicate product listings such as different colors or sizes into a single, organized listing. This helps prevent confusion for customers and ensures accurate inventory management.
What method is used for data validation?
You can clean data manually by following an 11-step process or use automated software tools such as Keboola.
What tools are used for data cleaning?
Users commonly refer to Keboola, OpenRefine, Trifacta, DataWrangler and Microsoft Power Query as some of the best data cleaning tools currently available in the market.
What is the difference between data cleaning and data transformation?
Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data, while data transformation involves converting and reshaping the data to meet specific analysis requirements or business needs.