A comprehensive overview for beginners and advanced data scientists.
Cluster analysis is a process used in artificial intelligence and data mining to discover the hidden structure in your data. There is no single cluster analysis algorithm. Instead, data practitioners choose the algorithm which best fits their needs for structure discovery.
Here, we present a comprehensive overview of cluster analysis, which can be used as a guide for both beginners and advanced data scientists.
Cluster analysis is a popular machine learning approach used in data mining and exploratory data analysis. Its purpose is to discover groups in seemingly unstructured data.
Let’s begin with a real business situation to illustrate what cluster analysis actually does.
Your company sells a variety of products online, from high-end TVs to sports equipment and kitchen utensils. It’s a one-stop e-commerce shop. The Head of Marketing comes to you and says that they would like to tailor their marketing messages. Your budget is not big enough to go fully personalized by creating a message that is specific to each customer, but you do have the resources and manpower to create 3 - 5 different campaigns.
How do you solve this problem?
This is where cluster analysis can help you. Cluster analysis is an approach in machine learning which uses different clustering algorithms to discover clusters (or groups) within seemingly disorganized data.
For the situation above, you would start by looking at purchase records:
Using clustering algorithms, you could determine that customers fall into three different groups based on what they purchase:
Clustering analysis helps you to identify three groups of customers based on their purchasing patterns. Now you can go to the Head of Marketing and tell them, so that they can create three different campaigns (one for each cluster):
Clustering algorithms will not help you to name the clusters that are identified - that’s up to you. But they will help you find groups based on members who show similarities and differences.
Clustering is used whenever we need to segment our customers, users, partners, or products, in order to better understand their inner structure:
Use clustering as a stepping stone to other machine learning. For example, cluster your prospects by how likely they are to convert from leads to customers, then use those clusters with logistic regression to develop a lead qualification prediction algorithm. Or, cluster your products using past purchase records: clustering algorithms can identify which products are usually bought together. You can then use this information for a product recommendation engine.
In production, clustering algorithms are rarely used as a self-standing machine learning product.
Instead, clustering algorithms have two advantages, which often lead to them being used as an in-between step towards a final goal:
Download the entire modeling process with this Jupyter Notebook.
Clustering analysis is an umbrella term that encompasses a myriad of clustering algorithms, all of which solve unsupervised classification tasks. That is, they try to discern the underlying structure in the dataset without any guidance or labels (unsupervised machine learning) with the end goal of assigning each example to a discrete category or class (classification).
There are multiple clustering algorithms based on the statistic method used to classify examples:
Typical cluster models all employ a common method: they find an arrangement of homogenous groups, whereby the elements within each specific group are as similar to each other, and as dissimilar to members of the other groups, as possible.
The specific measure of similarity and dissimilarity depends on the choice of the clustering algorithm. Here, we will take a deeper look at one of the most popular clustering algorithms: K-means.
K-means is an elegant algorithm used for partitioning the dataset into K distinct and non-overlapping clusters.
The ‘k’ in K-means represents the top-down defined number of clusters that we expect the algorithm to find in the segmentation analysis.
The pseudocode for K-means is as follows:
1. Decide on the number of clusters, k, that the model needs to find.
2. Randomly assign each point to a cluster from 1 to k.
3. Iterate until the data points no longer change clusters:
a) For each cluster (1 to k), compute its centroid.
The centroid is the central point between all points of the same cluster.
b) Reassign each data point to the cluster that is closest to it.
i) Closeness is defined using Euclidean distance.
In layman’s terms, K-means assigns each point to a cluster at random. It then computes the distances between the cluster’s centroids (centers) and other data points. If a point is closer to another cluster’s centroid than the one it belongs to, it gets reassigned.
We can visualize the K-means iterative procedure for points on a 2D plane:
Unlike supervised learning, where we can check how well our machine learning model performed in comparison to some external truth (e.g. class labels), there is no ground truth for unsupervised learning models.
There are, however, two main methods for evaluating the success of our K-means solution:
The Elbow method is used to determine the optimal number of clusters. Sometimes, the ideal number of groups is imposed by our business constraints (e.g. we need to find exactly three customer personas because that is the maximum that we can handle in our campaigns).
But if we want to discover the optimal number of groups, the Elbow method can assist us.
How does it work?
We generated the image below with Python (check the entire Jupyter Notebook here). As you can see, the biggest decrease between SSE happened between clusters 1 and 2, and then it continued to decrease monotonically. So, the ideal number of clusters would be 2 (i.e. where the elbow is).
The Silhouette method assesses how well each point lies within each cluster. In other words, it determines whether the similarity of within-cluster points is stronger than the similarity of out-cluster points.
How does it work?
Finally, the silhouette s for each data point i is computed as:
How do we interpret this?
Silhouette analysis can also be viewed as a measure of how separated clusters are. Are they strongly distinct, or are they partially or significantly overlapping?
K-means has multiple strengths and weaknesses:
K-means will generally run well, so long as we follow the best practices of cleaning data as part of preprocessing.
The way in which you use clustering algorithms in practice depends on how much you know about the entire data science process.
We recommend that beginners start by modeling data on datasets that have already been collected and cleaned. Experienced data scientists, on the other hand, can scale their operations by choosing the right software for the task at hand.
There are over 70 datasets that you can use to familiarize yourself with clustering. For beginners, we recommend the following projects:
Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.
Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done using Julia, R, or Python.
Deep dive into the data science process with this Jupyter Notebook:
Want to take it a step further? Keboola can help you to instrumentalize your entire data operations pipeline.
Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare which ones perform best. Start building models today with our free trial.