How To

The Ultimate Guide to Cluster Analysis

Cluster analysis is a process used in artificial intelligence and data mining to discover the hidden structure in your data. There is no single cluster analysis algorithm. Instead, data practitioners choose the algorithm which best fits their needs for structure discovery.

Here, we present a comprehensive overview of cluster analysis, which can be used as a guide for both beginners and advanced data scientists.

Cluster analysis is a popular machine learning approach used in data mining and exploratory data analysis. Its purpose is to discover groups in seemingly unstructured data.

Let’s begin with a real business situation to illustrate what cluster analysis actually does.

Your company sells a variety of products online, from high-end TVs to sports equipment and kitchen utensils. It’s a one-stop e-commerce shop. The Head of Marketing comes to you and says that they would like to tailor their marketing messages. Your budget is not big enough to go fully personalized by creating a message that is specific to each customer, but you do have the resources and manpower to create 3 - 5 different campaigns.

How do you solve this problem?

This is where cluster analysis can help you. Cluster analysis is an approach in machine learning which uses different clustering algorithms to discover clusters (or groups) within seemingly disorganized data.

For the situation above, you would start by looking at purchase records:

Using clustering algorithms, you could determine that customers fall into three different groups based on what they purchase:

Clustering analysis helps you to identify three groups of customers based on their purchasing patterns. Now you can go to the Head of Marketing and tell them, so that they can create three different campaigns (one for each cluster):

- (A) Fashionable Geeks - customers who buy gaming tech and clothes
- (B) Pragmatists - customers who buy products that solve a concrete problem
- (C) Audio-Video aficionados - customers who are into high-tech gadgets

Clustering algorithms will not help you to name the clusters that are identified - that’s up to you. But they will help you find groups based on members who show similarities and differences.

Clustering is used whenever we need to segment our customers, users, partners, or products, in order to better understand their inner structure:

**Create marketing personas**. Find a manageable number of segments (3 - 5 is usually ideal) which represent your customer personas. These are idealized customer profiles, which can be used by marketing and sales to tailor their messaging and sales approaches.**Segment churned users based on reason for churning**. Cluster your churned users to identify the most common reasons for churning. This approach is especially useful for identifying commonalities in your products or services that are problematic for your customer user base. For example, some customers stop subscribing to your service because they - in their words - find the software too slow, while others churn because they need to manually enter their information multiple times due to a bug. Clustering can help you to discover that these customers experience the same problems (as in, those who complained about slowness also experienced manual entry issues).

**Use clustering as a stepping stone to other machine learning**. For example, cluster your prospects by how likely they are to convert from leads to customers, then use those clusters with logistic regression to develop a **lead qualification prediction algorithm**. Or, cluster your products using past purchase records: clustering algorithms can identify which products are usually bought together. You can then use this information for a **product recommendation engine**.

In production, clustering algorithms are rarely used as a self-standing machine learning product.

Instead, clustering algorithms have two advantages, which often lead to them being used as an in-between step towards a final goal:

**Surprising discoveries**. In many cases, clustering shows us how data points group together and which features are commonly shared among members of the same group. This discovery often changes how we think about those groups. For example, the e-commerce setting described above grouped customers by product choice. But clustering could reveal that there is no smart way of dividing customers based on the products that they have purchased. Instead, our customers could be grouped by the time at which their purchase was made, e.g. night owls buying large quantities of products, business-hour buyers who make impulsive low-value purchases, and considerate buyers who make purchases throughout the rest of the day. This would alter the course of our marketing efforts from a product-centric approach to one that is centered around lifestyle.**Supporting other algorithms**. Oftentimes, our data has not been cleaned and organized in the format that is required by other algorithms. In the example above, we sometimes don’t know which products are bought together, or how to qualify prospects. Even in cases where we do, manually labeling each entry can be extremely time-consuming. Clustering analysis helps us by labelling the examples and speeding up the process, so that other predictors and recommender algorithms can use the data for their work.

Download the entire modeling process with this Jupyter Notebook.

Clustering analysis is an umbrella term that encompasses a myriad of clustering algorithms, all of which solve unsupervised classification tasks. That is, they try to discern the underlying structure in the dataset without any guidance or labels (unsupervised machine learning) with the end goal of assigning each example to a discrete category or class (classification).

There are multiple clustering algorithms based on the statistic method used to classify examples:

- Partitioning models (K-means)
- Connectivity models or Agglomerative models (hierarchical clustering)
- Density models (DBSCAN or OPTICS)
- Graph-based models (HCS clustering)
- Fuzzy clustering models
- Dimensionality reduction models (Principal Component Analysis, Factor Analysis)

Typical cluster models all employ a common method: they find an arrangement of homogenous groups, whereby the elements within each specific group are as similar to each other, and as dissimilar to members of the other groups, as possible.

The specific measure of similarity and dissimilarity depends on the choice of the clustering algorithm. Here, we will take a deeper look at one of the most popular clustering algorithms: K-means.

K-means is an elegant algorithm used for partitioning the dataset into K distinct and non-overlapping clusters.

The ‘k’ in K-means represents the top-down defined number of clusters that we expect the algorithm to find in the segmentation analysis.

The pseudocode for K-means is as follows:

1. Decide on the number of clusters, k, that the model needs to find.

2. Randomly assign each point to a cluster from 1 to k.

3. Iterate until the data points no longer change clusters:

a) For each cluster (1 to k), compute its *centroid*.

The centroid is the central point between all points of the same cluster.

b) Reassign each data point to the cluster that is closest to it.

i) Closeness is defined using Euclidean distance.

In layman’s terms, K-means assigns each point to a cluster at random. It then computes the distances between the cluster’s centroids (centers) and other data points. If a point is closer to another cluster’s centroid than the one it belongs to, it gets reassigned.

We can visualize the K-means iterative procedure for points on a 2D plane:

Unlike supervised learning, where we can check how well our machine learning model performed in comparison to some external truth (e.g. class labels), there is no ground truth for unsupervised learning models.

There are, however, two main methods for evaluating the success of our K-means solution:

- The Elbow method
- Silhouette analysis

The Elbow method is used to determine the optimal number of clusters. Sometimes, the ideal number of groups is imposed by our business constraints (e.g. we need to find exactly three customer personas because that is the maximum that we can handle in our campaigns).

But if we want to discover the optimal number of groups, the Elbow method can assist us.

How does it work?

- Compute the Sum of Squared Distance (SSE) between data points and the centroids of their assigned clusters for different values of k.
- Visualize the data on a graph, where the x-axis represents the number of clusters (k) and the y-axis shows the SSE for each cluster number.
- The ‘elbow’ is the point of inflection at which SSE does not (significantly) increase anymore, even when you add additional clusters. The last point of inflection is the optimal number of clusters.

We generated the image below with Python (check the entire Jupyter Notebook here). As you can see, the biggest decrease between SSE happened between clusters 1 and 2, and then it continued to decrease monotonically. So, the ideal number of clusters would be 2 (i.e. where the elbow is).

The Silhouette method assesses how well each point lies within each cluster. In other words, it determines whether the similarity of within-cluster points is stronger than the similarity of out-cluster points.

How does it work?

- For each point i within the cluster C, compute its average distance to every other point within the same cluster. Let’s call this “distance a(i)”. This distance measures how well a point is assigned to its cluster: the lower the distance, the more homogenous the cluster.
- For each point i within cluster C, compute the average distance to every other point which is
*not*in cluster C. Let’s call this “distance b(i)”. This distance measures dissimilarity: how different is point i from the points of other clusters?

Finally, the silhouette s for each data point i is computed as:

How do we interpret this?

- Values of s(i) close to 1 mean that the data point is far away from neighboring clusters and usually well-clustered.
- Values of s(i) close to 0 mean that the data point is close to neighboring clusters and we could improve clustering.
- Values of s(i) close to -1 mean that the data point is assigned to the wrong cluster and we should redo our clustering analysis.

Silhouette analysis can also be viewed as a measure of how separated clusters are. Are they strongly distinct, or are they partially or significantly overlapping?

K-means has multiple strengths and weaknesses:

**Efficient vs. NP-hard**. The official formulation of K-means is what is technically called an NP-hard problem. This means that it gets progressively computationally harder to train the model as the number of examples you have increases. Python libraries like Scikit-learn have designed certain optimizations under the hood to make the algorithm run effectively so that it can scale to larger datasets.**Sensitive to outliers**. K-means is extremely sensitive to outliers. If some points lie outside of the expected range, they are going to disproportionately affect the centroid position (which is based on average distance) and therefore skew cluster assignment. A common technique is to remove outliers or use K-medoids clustering - a variant of K-means that is more robust to noise and outliers.**Non-deterministic**. The end result of K-means depends very strongly on the initial random assignment of points to clusters. This is especially true if data points are grouped close together and the real clusters are (partially) overlapping. In practice, we would rerun the K-means algorithm multiple times until we found a stable solution (or one that appears most often).**Sensitive to missing values**. K-means computes Euclidean distances between points. If a point has missing values in one of its features (independent variables), the algorithm will fail to compute the difference. We either need to drop that point from the analysis entirely, impute the best alternative for that missing value, or remove the feature from the set of features used in the analysis.

K-means will generally run well, so long as we follow the best practices of cleaning data as part of preprocessing.

The way in which you use clustering algorithms in practice depends on how much you know about the entire data science process.

We recommend that beginners start by modeling data on datasets that have already been collected and cleaned. Experienced data scientists, on the other hand, can scale their operations by choosing the right software for the task at hand.

There are over 70 datasets that you can use to familiarize yourself with clustering. For beginners, we recommend the following projects:

- The Iris flower dataset is the baseline for all clustering. The biological structures in the petals of the flowers lead to nicely separated clusters of different flower species.
- Look at over 50 customer attributes to predict whether clients of a major telecommunications firm will churn. This dataset is especially suited for beginners because you can check your clustering algorithm against the actual churning behavior (the examples are labeled).
- Cluster stocks into different segment types to discover if there is a structure behind dividend yields and stock returns.

Production data science means spending more than 80% of your time on data collection and cleaning. If you want to speed up the entire data pipeline, use software that automates tasks to give you more time for data modeling.

Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done using Julia, R, or Python.

Deep dive into the data science process with this Jupyter Notebook:

- Collect the relevant data.
- Explore and clean the data to discover patterns.
- Train your cluster analysis model.
- Evaluate the model with a variety of metrics.

Want to take it a step further? Keboola can help you to instrumentalize your entire data operations pipeline.

Being a data-centric platform, Keboola also allows you to build your ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare which ones perform best. Start building models today with our free trial.