Data Exploration: Theory & Techniques

Contents

Example H2

Example H3

How To

December 27, 2022

Updated on

5 min read

Data Exploration: Theory & Techniques

No items found.

Discover how data exploration is used and how to derive value from it.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

Data exploration is the first step of data analysis. It allows us to uncover how the dataset you are working with looks like:

How big is the dataset (number of rows, columns/features, shape of the data)
What are the variables or features of the dataset
How are data points distributed - are there weird outliers?
What are the relationships between the data points

‍

Data scientists, data engineers, data analysts and business users regularly use data exploration as part of their pipeline to understand data, uncover hidden insights and prepare data for further analysis.

Let’s take a deeper dive into data exploration: how exactly is it used, how to perform data exploration properly to avoid any traps, and how to derive value from data exploration?

Complete the form below to get your complimentary copy.

Oops! Something went wrong while submitting the form.

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.

Download the checklist

1. What are data exploration’s use cases?

There are 4 ways in which data exploration provides value:

1.1 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a field of data analytics that analyzes data by describing the dataset’s main characteristics either through descriptive statistics (mean, standard deviation) or by visualizing the data points.

Data exploration can serve as a standalone approach to data analytics, allowing us to quickly uncover the tendencies and attributes of the data sets you are working with.

This is especially true of large datasets, where it might be hard to “eyeball” the tendencies and structures in the data points.

Let’s look at an example.

Looking at raw data alone, it might be hard to understand if there is a pattern of increasing, decreasing, or stagnating customer acquisitions:

But if you visualize the data points on a scatter plot, you can easily see that customer acquisition is showing an increasing trend through time (yaaay!).

1.2 Data cleaning

Data exploration is an integral part of the entire ETL data pipeline. Especially with new data sources, before you transform the raw data into cleaned and sanitized data, you might explore it, to determine how to clean it:

Remove outliers
Understand where data is missing values and what to do with that data (discard the rows with missing values, fill in the gaps with the best guestimate, …)
Data quality checks - is there inconsistent data? For example, when plotting frequency distributions of categorical variables, do you see “Apple” appearing also as “Apple Inc.”? Those two should be standardized to a common company name.

‍Check out Keboola’s data cleaning checklist to make sure you don’t skip any data preparation steps.

1.3 Machine learning and Data mining

Machine learning algorithms rarely work out-of-the-box. A lot of data preparation is needed before data science can bear its fruits.

For example, depending on your machine learning model, you might need to sanitize your data in different ways:

Linear regression will perform poorly in the presence of outliers. Visualizing your data with a box-plots can help you determine where those outliers are.
Some graph algorithms will demand acyclic connections between your data, which you can easily check by visualizing your graph data as a network.
And other constraints specific to machine learning models.

Data exploration can also be used for more advanced big data techniques, such as data mining.

Data mining is an advanced form of data exploration. Through the use of automation and search algorithms, data mining discovers structures and relationships within data automatically.

1.4 Data management

Data exploration is not just used for building predictive models with machine learning. It is also useful for data management.

Consistently exploring metadata can help you quickly identify where your data quality is suffering.

By building quality assurance metrics such as “the absolute number of missing values needs to be under 5% of all data” (as an example) you can track the health of your data ecosystem through exploratory analysis.

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.

2. The 3 best data exploration techniques

There are three classes of data exploration techniques. Each with its own advantages and disadvantages.

2.1 Descriptive analytics

Descriptive analytics offers a quick glance at data through two types of metrics:

Measures of centrality
Measures of dispersion

Measures of centrality include the mean (“average”), mode (most commonly occurring data point), and median (the point in the middle of all data points - 50% of data points are above it, 50% are below it).

When your boss asks you “What is our average customer order size?”, it is unclear how to answer it.

Usually, data scientists and data analysts revert to the mean, by saying “On average, our customers order $24 of products.”

But the number could be deceiving. If you have a couple of high ticket items ($1000) and a lot of more reasonably priced products (<$10), the average could be skewed by the high-priced outliers. In these cases it is better to answer the question by saying “Over half the customers’ order products at the <$10 price point” (median) or “The most commonly bought articles cost $9.9” (mode).

To help disambiguate such circumstances, data analysts and data scientists often report measures of dispersion alongside measures of centrality. Such as the standard deviation, variance, or the range (from min to max).

The main advantage of descriptive analytics is its speed of insight. With just a few numbers (mean, standard deviation), you understand what the data is telling you.

The disadvantage of descriptive analytics is that the insights might be misleading.

For example, distributions with the same measures of centrality and dispersion could tell completely different stories.

Imagine the following visualization, where “value” refers to the amount a customer spent on their order and “density” tells you what percentage of customers fall in a value bracket. Each color is a different customer profile (but turns out all three stories have the same mean and standard deviations):

The red story tells you there are differences between customers, but the majority spend around $100 on their order. The green story tells you that the customers are equally likely to spend anywhere between $70 and $130 on their purchase. And the blue story tells you, you have two customer profiles: a more price aversive and a more price prone.

Visualizations are usually necessary to see the “big picture” behind descriptive analytics.

2.2 Visualizations

Visualizations help us gauge the relationships between data.

Scatter plots can help us see if there are trends in data:

Histograms plot the number or frequency of values occurrence over a given range:

The histogram above visualizes the customer acquisition data in buckets. For example, the bucket “2.00-4.00” tells us how many days you saw 2-4 new customers. From the visualization, you can easily see that the majority of days you saw 4-6 new customers, and rarely over 10 new customers.

Bar charts are useful for categorical data. They help us understand what categorical variable is most often occurring in our data:

The bar chart above helps us visualize the computers you have in-house. As you can see, you are well stocked on Lenovo, but have run out of Windows laptops (need to reorder soon!). Also, Chromebooks might be out of stock soon, you might need to reorder as well.Visualizations are great because they allow us to quickly spot trends and relationships between data. However, when working with machine learning models, they might be insufficient. You might need to use more rigorous - statistical - approaches.

2.3 Statistical approaches

Statistical approaches allow us to quantify what you see in visualizations and data. For example, you can determine whether a data point that looks like an outlier (“more than 10 new customers in a day”) is actually an outlier and will affect our further analysis.

Statisticians and data scientists use multiple approaches to sanitize their data during data exploration :

Missing values determination and handling of missing data
Outlier identification
Normality checks (standard deviations, kurtosis, skew, etc.)
Correlations between variables
And many more.

With so many different data exploration techniques, data operatives revert to specialized tools to help them speed up data analytics.

3. What are the best tools for data exploration?

Broadly there are two sets of data exploration tools:

Custom Scripting. Python, R and SQL are the most common open source programming languages used in data exploration. They offer technical individuals specialized statistical and visualization libraries, which can be fully customized for any task. Freedom and power come at a cost though. Programming languages are not novice-friendly. They require you to learn the programming language, which can be a steep learning curve to surmount for enterprises that would simply like to democratize their data.
Devoted third-party data visualization tools. Multiple software vendors offer data visualization or business intelligence software - from Microsoft Excel (or Google Spreadsheets), to Power BI and Tableau, the range of offerings is wide and encompasses anything from free to high-ticket items capable of automatically running machine learning for you. They are usually more user-friendly for novices, but often lack the statistical power of their scripting counterparts.

There is a third option between the world of custom scripting and automated visualizations.

Turn raw data into visualizations in a couple of clicks

Keboola and ThoughtSpot can help you unlock the best of both worlds.

Keboola centralizes and automates your entire stack of data tools within a single solution, consolidating your data ingestion, exploration, cleaning, storage, outputting, analysis, and data science apps into a single platform.

Keboola gets your data house in order. It brings all your disparate tools under a single, centralized roof.

With Keboola you can automate the entire data pipeline:

Configure your data extractors to collect raw data from their data sources.
Transform and clean your extracted data.
Write the cleaned data to a database, data warehouse, or data lake of your choice.
Use automation to set-it-and-forget-it.

Once you automate your data, you send it to ThoughtSpot. ThoughtSpot helps you expose your data to anyone in the company so they can perform their own data explorations.

ThoughtSpot turns your static data into Liveboards that allow your users to easily build custom visualizations, drill down or up in the data model at scale, and offer AI-powered features that turn natural language queries into intelligent database queries.

Here's a tutorial for how to set up Keboola and ThoughtSpot:

Let’s jump on a quick call to discuss how to make your data exploration easier.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Data Exploration: Theory & Techniques

Download our free data-cleaning checklist to identify and resolve any quality issues with your data in just 11 steps.