How to Clean Data: The Ultimate Guide (2023)

Contents

Example H2

Example H3

How To

January 17, 2023

Updated on

5 min read

How to Clean Data: The Ultimate Guide (2023)

No items found.

How to clean data to make it ready for analysis and machine learning.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

While digging through data, Anna spots an interesting trend - some customers buy 3 times more than others. A segment of super-high spenders? This could make it rain for the company! She rushes to her boss to show him the data, only to hear:

“Yeah, I know. We have a bug, which inserts every order three times in the database. Were you not paying attention during our daily meeting?”

Aw-kward.

This mistake can be easily avoided by cleaning your data before analyzing it and jumping to conclusions, or worse, showing it to others.

In this guide you will learn steps you can take to ensure you are always working with clean and analysis-ready data:

Identify the problematic data
Remove, encode, and fill in any missing data
Remove outliers or analyze them separately
Purge contaminated data and correct leaking pipelines
Standardize inconsistent data
Amend the functions and transformations that cause invalid data
Eliminate duplicated data
Foresee and prevent type issues
Standardize and normalize data
Remove structural errors
Rinse and repeat‍‍

MCP Online Meetup + Feedback

Complete the form below to get your complimentary copy.

Oops! Something went wrong while submitting the form.

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.

Download the checklist

Data cleaning is a 3-step process

Step 1: Find out what is wrong with your data

Put on your detective hat and jot down everything interesting, surprising, or even weird.

Look for the following:

Any rows with empty values. Are there entire columns with no data? Which data is missing and why?
Find outliers by checking distributions to see which groups or ranges are more heavily represented in your data set. Remember, visualizations are your friends.
Anything that stands-out as weird. Are there impossible values? Like “date of birth: male”, “address: -1234”.
Make sure the data is consistent. For example, some product names are sometimes written in uppercase and other times in lowercase.

Step 2: Resolving corrupt, inaccurate, or irrelevant data

Knowing the problem is half the battle. The other half is solving it.

How do you solve it, though? One ring might rule them all, but one approach is not going to cut it with all your data cleaning problems.

Depending on the type of data-cleaning challenge you’re facing, you’ll need different data-scrubbing techniques.

2.1 Missing data

Ignoring missing data is like ignoring the holes in your boat while at sea - you’ll sink.

Start by spotting all the different disguises missing data wears. It appears in values such as 0, “0”, empty strings, “Not Applicable”, “NA”, “#NA”, None, NaN, NULL, or Inf.

Programmers before you might have put default values instead of missing data (“email@company.com”). When you have a general idea of what your missing data looks like, it is time to answer the crucial question:

‍“Is missing data telling me something valuable?”

Data might be missing as an alternative to saving other valuable information. For example, an empty “payment_date” row can tell you the customer has not paid yet. If you are trying to separate your customers into late payers vs on-time payers, this is valuable. If not, you should omit it from your analysis.

There are three main approaches to cleaning missing data:

Drop rows and/or columns with missing data. If the missing data is not valuable, just drop the rows (i.e. specific customers, sensor reading, or other individual exemples) from your analysis. If entire columns are filled with missing data, drop them as well. There is no need to analyze the column “Quantity of NewAwesomeProduct Bought” if no one has bought it yet.
Recode missing data into a different format. Numerical computations can break down with missing data. Recoding missing values into a different column saves the day. For example, the column “payment_date” with empty rows can be recoded into a column “payed_yet” with 0 for “no” and 1 for “yes”.
Fill in missing values with “best guesses.” Use moving averages and backfilling to estimate the most probable values of data at that point. This is especially crucial for time-series analyses, where missing data can distort your conclusions.

2.2 Outliers

Outliers are data points that are at an extreme. They usually have very high or very low values:

An antarctic sensor reading the temperature of 100º
A customer who buys $0.01 worth of merchandise per year

How to interpret those? Outliers usually signify either very interesting behavior or a broken collection process. Both are valuable information but proceed with cleaning only if the behavior is actually interesting.

There are three approaches to dealing with outliers:

Remove outliers from the analysis. Having outliers can mess up your analysis by bringing the averages up or down and in general distorting your statistics. Remove them by removing the upper and lower X-percentile of your data.
Segment data so outliers are in a separate group. Put all the “normal-looking” data in one group, and outliers in another. This is especially useful for analysis of interest. You might find out that your highest paying customers, who actually buy 3 times above average, are an interesting target for marketing and sales.
Keep outliers, but use different statistical methods for analysis. Weighted means (which put more weight on the “normal” part of the distribution) and trimmed means are two common approaches of analyzing datasets with outliers, without suffering the negative consequences of outliers.

2.3 Contaminated data

This is another red flag for your collection process. Examples of contaminated data include:

Wind turbine data in your water plant data set.
Purchase information in your customer address data set.
Future data in your current event time-series data.

With corrupted data, there is not much you can do except for removing it.

This process usually requires a lot of domain expertise, so if you are lacking the knowledge make sure to consult non-analytical members of your team.

It’s also important to check and fix any leakages your data collection pipeline has so that the data corruption does not repeat with future data collection.

2.4 Inconsistent data

“Wait, did we sell ‘Apples’, ‘apples’, or ‘APPLES’ this month? And what is this ‘monitor stand’ for $999 under the same product ID?”

Data inconsistencies are inevitable. Especially with manual data entry when there is a higher possibility of human error (e.g. when salespeople enter the product info on proforma invoices manually).

Data analysis via visualizations is one of the best ways to spot inconsistent representations of the same elements in your database.

Prepare bar charts per product category, or do a count of rows by category if this is easier.

When you spot the inconsistency, standardize all elements into the same format. Humans might understand that ‘apples’ is the same as ‘Apples’ (capitalization) which is the same as ‘appels’ (misspelling), but computers think those three refer to three different things altogether. Lowercasing as default and correcting typos are your friends here.

2.5 Invalid data

Similarly to corrupted data, invalid data is illogical.

For example, users who spend -2 hours on our app, or a person whose age is 170. Unlike corrupted data, invalid data does not result from faulty collection processes, but from issues with data processing.

‍Cleaning invalid data mostly means amending the functions and transformations which caused the data to be invalid.

If this is not possible, remove the invalid data.

2.6 Duplicate data

This is damaging to your analysis because it can either deflate/inflate your numbers.

There are different reasons why duplicate values happen:

Data is combined from different sources and each source brings in the same data to your database.
The user might submit information twice by clicking on the submit button.
Your data collection code is off and inserts the same records multiple times.

There are three ways to remove duplicates:

Find the same records and delete all but one.
Compare matched data and take the most relevant one. Usually, this is the most recent one.
Use clustering to combine data into entities. For example, gather all the information you have on a customer Martin Solveig in one entity.

2.7 Data type issues

On your data cleaning journey you can encounter problems that are specific to the data type.

2.7.1 Cleaning string

Strings are usually the messiest part of data cleaning process because they are often human-generated and hence prone to errors.

The common cleaning techniques for strings involve:

Standardizing casing across the strings
Removing whitespace and newlines
Removing stop words (for some linguistic analyses)
Hot-encoding categorical variables represented as strings
Correcting typos
Standardizing encodings

2.7.2 Cleaning date and time

Dates and time can be tricky. Sometimes the error will not be apparent until you are running computations on the date and times.

The cleaning process involves:

Double-check all your dates and times are either a DateTime object or a Unix timestamp (via type coercion). Do not be tricked by strings pretending to be a DateTime object, like “24 Oct 2019”. Check for data type and coerce where necessary.
Make sure to have DateTime in the correct timezone. DateTime objects are often recorded with the time zone or without one. Either of those can cause problems. If you do not care about internationalization, convert all DateTime objects to your timezone.

2.8 Standardize and normalize data

Rescale your numeric data to a common range (usually between 0 and 1 for normalization).

This is important in cases where, for example, if your payment_value column had numbers in the 1000s and your quantity_purchased 10s, those two features can become comparable (0.8 and 0.3 for instance).

Standardizing and normalizing data has other advantages:

Data becomes comparable. Let us say your customer has purchased $3000 of goods in their lifetime (column payment_value = 3000), by making 6 orders (column order_quantity = 6). By standardizing and normalizing numerical values to a common scale, those can be compared.
Faster algorithmic convergence. Normalization and standardization allow for algorithms to converge faster, aka they spend less computing power and time to arrive at their final solution.
Many statistical methods may not work properly if the data is not distributed normally (for example in a bell shape). Standardizing your variables helps you reshape the data in such a form that the statistics still work.

2.9 Structural errors

A lot of data cleaning is about preventing data issues rather than resolving data issues.

This is especially true for structural errors that can occur during measurement, data transfer, or other situations and can lead to inconsistent data, data duplication, or contamination.

But unlike the treatment advised above, you are not going to solve structural errors by applying cleaning techniques to them. Because you can clean the data all you want, but at the next import, the structural errors will produce unreliable data again.

To solve them, you will have to check your ETL pipeline and how you collect and transform data from their raw data sources to identify where the source of structural errors is and remove it.

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.

Step 3: Rinse and repeat

Once cleaned, repeat steps 1 and 2.

This is helpful for three reasons:

You might have missed something. Repeating the cleaning process helps you catch those pesky hidden issues.
Discover new issues. For example, once you removed outliers from your dataset, you noticed that the data is not bell-shaped anymore and needs reshaping before you can analyze it.
Learn more about your data. Every time you sweep through your dataset and look at the distributions of values, you learn more about your data, which gives you hunches as to what to analyze.

Cleaning data is not the sole domain of data science.

High-quality data is necessary for any type of data analytics and decision-making. From startups launching the next Google search algorithm to business enterprises relying on Microsoft Excel for their business intelligence - clean data is the pillar upon which data-driven decision-making rests.

Automate data cleaning with Keboola

By now it is clear how important data cleaning is. But it still takes way too long and it is not the most intellectually stimulating challenge.

To avoid losing time, you can use Keboola to automate a lot of repetitive tasks.

Keboola is a data stack as a service that is packed with features that help you prepare data:

Build ETL data pipelines end-to-end and schedule them with Orchestrations.
Enable data scientists to set up their own data workflows using no-code and low-code features.
The in-built Data Catalog helps you document and share data across the board.

Keboola is designed to automate and streamline all of your data management tasks, so you can cut the time-consuming manual work and spend more time on strategic tasks that benefit the business.

Try Keboola for free today.

Frequently asked questions about data cleaning

What is data cleaning and why is it important?

Data cleansing, also called data cleaning or scrubbing, removes errors, duplicates, and irrelevant data from a raw dataset, creating reliable visualizations and models for business decisions.

What is an example of data cleaning?

Removing any duplicate product listings such as different colors or sizes into a single, organized listing. This helps prevent confusion for customers and ensures accurate inventory management.

What method is used for data validation?

You can clean data manually by following an 11-step process or use automated software tools such as Keboola.

What tools are used for data cleaning?

Users commonly refer to Keboola, OpenRefine, Trifacta, DataWrangler and Microsoft Power Query as some of the best data cleaning tools currently available in the market.

What is the difference between data cleaning and data transformation?

Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data, while data transformation involves converting and reshaping the data to meet specific analysis requirements or business needs.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

How to Clean Data: The Ultimate Guide (2023)

MCP Online Meetup + Feedback

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.

Data cleaning is a 3-step process

Step 1: Find out what is wrong with your data