Discover how data exploration is used and how to derive value from it.
Data exploration is the first step of data analysis. It allows us to uncover how the dataset you are working with looks like:
Data scientists, data engineers, data analysts and business users regularly use data exploration as part of their pipeline to understand data, uncover hidden insights and prepare data for further analysis.
Let’s take a deeper dive into data exploration: how exactly is it used, how to perform data exploration properly to avoid any traps, and how to derive value from data exploration?
There are 4 ways in which data exploration provides value:
Exploratory Data Analysis (EDA) is a field of data analytics that analyzes data by describing the dataset’s main characteristics either through descriptive statistics (mean, standard deviation) or by visualizing the data points.
Data exploration can serve as a standalone approach to data analytics, allowing us to quickly uncover the tendencies and attributes of the data sets you are working with.
This is especially true of large datasets, where it might be hard to “eyeball” the tendencies and structures in the data points.
Let’s look at an example.
Looking at raw data alone, it might be hard to understand if there is a pattern of increasing, decreasing, or stagnating customer acquisitions:
But if you visualize the data points on a scatter plot, you can easily see that customer acquisition is showing an increasing trend through time (yaaay!).
Data exploration is an integral part of the entire ETL data pipeline. Especially with new data sources, before you transform the raw data into cleaned and sanitized data, you might explore it, to determine how to clean it:
Check out Keboola’s data cleaning checklist to make sure you don’t skip any data preparation steps.
Machine learning algorithms rarely work out-of-the-box. A lot of data preparation is needed before data science can bear its fruits.
For example, depending on your machine learning model, you might need to sanitize your data in different ways:
Data exploration can also be used for more advanced big data techniques, such as data mining.
Data mining is an advanced form of data exploration. Through the use of automation and search algorithms, data mining discovers structures and relationships within data automatically.
Data exploration is not just used for building predictive models with machine learning. It is also useful for data management.
Consistently exploring metadata can help you quickly identify where your data quality is suffering.
By building quality assurance metrics such as “the absolute number of missing values needs to be under 5% of all data” (as an example) you can track the health of your data ecosystem through exploratory analysis.
There are three classes of data exploration techniques. Each with its own advantages and disadvantages.
Descriptive analytics offers a quick glance at data through two types of metrics:
Measures of centrality include the mean (“average”), mode (most commonly occurring data point), and median (the point in the middle of all data points - 50% of data points are above it, 50% are below it).
When your boss asks you “What is our average customer order size?”, it is unclear how to answer it.
Usually, data scientists and data analysts revert to the mean, by saying “On average, our customers order $24 of products.”
But the number could be deceiving. If you have a couple of high ticket items ($1000) and a lot of more reasonably priced products (<$10), the average could be skewed by the high-priced outliers. In these cases it is better to answer the question by saying “Over half the customers’ order products at the <$10 price point” (median) or “The most commonly bought articles cost $9.9” (mode).
To help disambiguate such circumstances, data analysts and data scientists often report measures of dispersion alongside measures of centrality. Such as the standard deviation, variance, or the range (from min to max).
The main advantage of descriptive analytics is its speed of insight. With just a few numbers (mean, standard deviation), you understand what the data is telling you.
The disadvantage of descriptive analytics is that the insights might be misleading.
For example, distributions with the same measures of centrality and dispersion could tell completely different stories.
Imagine the following visualization, where “value” refers to the amount a customer spent on their order and “density” tells you what percentage of customers fall in a value bracket. Each color is a different customer profile (but turns out all three stories have the same mean and standard deviations):
The red story tells you there are differences between customers, but the majority spend around $100 on their order. The green story tells you that the customers are equally likely to spend anywhere between $70 and $130 on their purchase. And the blue story tells you, you have two customer profiles: a more price aversive and a more price prone.
Visualizations are usually necessary to see the “big picture” behind descriptive analytics.
Visualizations help us gauge the relationships between data.
Scatter plots can help us see if there are trends in data:
Histograms plot the number or frequency of values occurrence over a given range:
The histogram above visualizes the customer acquisition data in buckets. For example, the bucket “2.00-4.00” tells us how many days you saw 2-4 new customers. From the visualization, you can easily see that the majority of days you saw 4-6 new customers, and rarely over 10 new customers.
Bar charts are useful for categorical data. They help us understand what categorical variable is most often occurring in our data:
The bar chart above helps us visualize the computers you have in-house. As you can see, you are well stocked on Lenovo, but have run out of Windows laptops (need to reorder soon!). Also, Chromebooks might be out of stock soon, you might need to reorder as well.Visualizations are great because they allow us to quickly spot trends and relationships between data. However, when working with machine learning models, they might be insufficient. You might need to use more rigorous - statistical - approaches.
Statistical approaches allow us to quantify what you see in visualizations and data. For example, you can determine whether a data point that looks like an outlier (“more than 10 new customers in a day”) is actually an outlier and will affect our further analysis.
Statisticians and data scientists use multiple approaches to sanitize their data during data exploration :
With so many different data exploration techniques, data operatives revert to specialized tools to help them speed up data analytics.
Broadly there are two sets of data exploration tools:
There is a third option between the world of custom scripting and automated visualizations.
Keboola and ThoughtSpot can help you unlock the best of both worlds.
Keboola centralizes and automates your entire stack of data tools within a single solution, consolidating your data ingestion, exploration, cleaning, storage, outputting, analysis, and data science apps into a single platform.
Keboola gets your data house in order. It brings all your disparate tools under a single, centralized roof.
With Keboola you can automate the entire data pipeline:
Once you automate your data, you send it to ThoughtSpot. ThoughtSpot helps you expose your data to anyone in the company so they can perform their own data explorations.
ThoughtSpot turns your static data into Liveboards that allow your users to easily build custom visualizations, drill down or up in the data model at scale, and offer AI-powered features that turn natural language queries into intelligent database queries.
Here's a tutorial for how to set up Keboola and ThoughtSpot:
Let’s jump on a quick call to discuss how to make your data exploration easier.