How to clean your data to make it ready for analysis and machine learning
While digging through data, Anna spots an interesting trend - some customers buy 3 times more than others. A segment of super-high spenders? This could make it rain for the company! She rushes to her boss, to show them the data, only to hear: “Yeah, I know. We have a bug, which inserts every order three times in the database. Were you not paying attention during our daily meeting?” Aw-kward.
The mistake behind jumping to conclusions (or even slow-walking there) is analyzing data before cleaning it. The solution is obvious. Clean your data before analyzing it, or worse, showing it to others.
Knowing how to clean data is advantageous for three reasons:
In this guide, we will take you through the process of getting your hands dirty with cleaning data. Get ready, because we will dive into the practical aspects and little details that make the big picture shine brighter.
Start data cleaning by determining what is wrong with your data.
Look for the following:
Wear your detective hat and jot down everything interesting, surprising or even weird.
Knowing the problem is half the battle. The other half is solving it.
How do you solve it, though? One ring might rule them all, but one approach is not going to cut it with all your data cleaning problems. Depending on the type of data dirt you’re facing, you’ll need different cleaning techniques.
Sometimes you will have rows with missing values. Sometimes, almost entire columns will be empty. What to do with missing data? Ignoring it is like ignoring the holes in your boat while at sea - you’ll sink.
Start by spotting all the different disguises missing data wears. It appears in values such as 0, “0”, empty strings, “Not Applicable”, “NA”, “#NA”, None, NaN, NULL or Inf. Programmers before you might have put default values instead of missing data (“firstname.lastname@example.org”). When you have a general idea of what your missing data looks like, it is time to answer the crucial question:
“Is missing data telling me something valuable?”
Data might be missing as an alternative to saving other valuable information. For example, an empty “payment_date” row can tell you the customer has not paid yet. If you are trying to separate your customers into late payers vs on-time payers, this is valuable. If not, you should omit it from your analysis.
There are 3 main approaches to cleaning missing data:
Outliers are data points which are at an extreme. They usually have very high or very low values:
How to interpret those? Outliers usually signify either very interesting behavior or a broken collection process. Both are valuable information (hey, check your sensors, before checking your outliers), but proceed with cleaning only if the behavior is actually interesting.
There are three approaches to dealing with outliers:
Contaminated data is another red flag for your collection process. Examples of contaminated data include:
The last one is particularly sneaky. Imagine having a row of financial trading information for each day. Columns (or features) would include the date, asset type, asking price, selling price, the difference in asking price from yesterday, the average asking price for this quarter. The average asking price for this quarter is the source of contamination. You can only compute the averages once the quarter is over, but that information would not be given to you on the trading date - thus introducing future data, which contaminates the present data.
With corrupted data, there is not much you can do except for removing it. This requires a lot of domain expertise. When lacking domain knowledge, consult non-analytical members of your team. Make sure to also fix any leakages your data collection pipeline has so that the data corruption does not repeat with future data collection.
“Wait, did we sell ‘Apples’, ‘apples’, or ‘APPLES’ this month? And what is this ‘monitor stand’ for $999 under the same product ID?”
You have to expect inconsistency in your data. Especially when there is a higher possibility of human error (e.g. when salespeople enter the product info on proforma invoices manually).
The best way to spot inconsistent representations of the same elements in your database is to visualize them. Plot bar charts per product category. Do a count of rows by category if this is easier.
When you spot the inconsistency, standardize all elements into the same format. Humans might understand that ‘apples’ is the same as ‘Apples’ (capitalization) which is the same as ‘appels’ (misspelling), but computers think those three refer to three different things altogether. Lowercasing as default and correcting typos are your friends here.
Similarly to corrupted data, invalid data is illogical. For example, users who spend -2 hours on our app, or a person whose age is 170. Unlike corrupted data, invalid data does not result from faulty collection processes, but from issues with data processing (usually during feature preparation or data cleaning).
Let us walk through an example. You are preparing a report for your CEO about the average time spent in your recently launched mobile app. Everything works fine, the activities time looks great, except for a couple of rogue examples. You notice some users spent -22 hours in the app. Digging deeper, you go to the source of this anomaly. In-app time is calculated as finish_hour - start_hour. In other words, someone who started using the app at 23:00 and finished at 01:00 in the morning would have for their time_in_app -22 hours (1 - 23 = - 22). Upon realizing that, you can correct the computations to prevent such illogical data.
Cleaning invalid data mostly means amending the functions and transformations which caused the data to be invalid. If this is not possible, we remove the invalid data.
Duplicate data means the same values repeating for an observation point. This is damaging to our analysis because it can either deflate/inflate our numbers (e.g. we count more customers than there actually are, or the average changes because some values are more often represented).
There are different sources of duplicate data:
There are three ways to eliminate duplicates:
Depending on which data type you work with (DateTime objects, strings, integers, decimals or floats), you can encounter problems specific to data types.
Strings are usually the messiest part of data cleaning because they are often human-generated and hence prone to errors.
The common cleaning techniques for strings involve:
Especially the last one can cause a lot of problems. Encodings are the way of translating between the 0’s and 1’s of computers and the human-readable representation of text. And as there are different languages, there are different encodings.
Everyone has seen strings of the type �����. Which meant our browser or computer could not decode the string. It is the same as trying to play a cassette on your gramophone. Both are made for music, but they represent it in different ways. When in doubt, go for UTF-8 as your encoding standard.
Dates and time can be tricky. Sometimes the error is not apparent until doing computations (like the activity duration example above) on date and times. The cleaning process involves:
Standardization and normalization are used to rescale your numeric data to a common range (usually between 0 and 1 for normalization). So if your payment_value column had numbers in the 1000s and your quantity_purchased 10s, those two features can become comparable (0.8 and 0.3 for instance).
Why does this matter?
Standardizing and normalizing data has several advantages:
Even though we treated data issues comprehensively, there is a class of problems with data, which arise due to structural errors. Structural errors arise during measurement, data transfer, or other situations.
Structural errors can lead to inconsistent data, data duplication, or contamination. But unlike the treatment advised above, you are not going to solve structural errors by applying cleaning techniques to them. Because you can clean the data all you want, but at the next import, the structural errors will produce unreliable data again.
Structural errors are given special treatment to emphasize that a lot of data cleaning is about preventing data issues rather than resolving data issues.
So you need to review your engineering best practices. Check your ETL pipeline and how you collect and transform data from their raw data sources to identify where the source of structural errors is and remove it.
Once cleaned, you repeat steps 1 and 2. This is helpful for three reasons:
Data scientists spend 80% of their time cleaning and organizing data because of the associated benefits. Or as the old machine learning wisdom goes:
Garbage in, garbage out.
All algorithms can do is spot patterns. And if they need to spot patterns in a mess, they are going to return “mess” as the governing pattern. Aka clean data beats fancy algorithms any day.
But cleaning data is not in the sole domain of data science.
High-quality data are necessary for any type of decision-making. From startups launching the next Google search algorithm to business enterprises relying on Microsoft Excel for their business intelligence - clean data is the pillar upon which data-driven decision-making rests.
By now it is clear how important data cleaning is. But it still takes way too long. And it is not the most intellectually stimulating challenge. To avoid losing time, while not neglecting the data cleaning process, data practitioners automate a lot of repetitive cleaning tasks.
Mainly there are two branches of data cleaning that you can automate:
Whether automation is your cup of tea or not, remember the main steps when cleaning data:
Keep a list of those steps by your side and make sure your data gives you the valuable insights you need.