A Brief Overview of Clustering Algorithms

I previously covered one of the main types of machine learning, linear regression. Another main group are clustering algorithms, which serve a similar purpose to regression algorithms. Clustering is a way of grouping objects based on definable and distinct traits. For example, animals are grouped into species and families based on physical traits. Algorithmically it's a pretty simple process, and there are really only two main ways it's done. Those two ways are hierarchical clustering and K-means clustering. Both methods have their pros and cons, and this is an overview of both methods. First, let's touch on K-means which will help make hierarchical clustering easier to work with.

K-Means clustering, which is the form of clustering used in the feature image of this article, is the method of clustering to use when you can eyeball samples within a dataset and notice well defined groups within that dataset. The "K" in "K-Means" represents the amount of data clusters you want to group the data into. If there happens to be a little ambiguity there are algorithms such as K-Means++ that can help you identify the number of clusters present in a dataset. K-Means++ is an algorithm that computes the distance every point from each other and centroid point within a potential cluster. The optimal number of clusters are shown by a sharp turn in the graph. However, this method can also require a little intuition to make the best choice.

A graph showing the optimal number of clusters is 5.

Once you've identified the amount of clusters that you expect to exist in the dataset, it starts computing the Pythagorean Distance to get an overview of the clusters. During these computations it also works to figure out central points of each cluster. While hierarchical clusters use similar math, the methodology is a little different.

The primary difference between hierarchical clusters and K-Means are the way the clusters are represented and formed. Hierarchical clustering seeks to form a hierarchy, as the name implies. There are two ways the hierarchy can be formed which are agglomerative and divisive. In simpler terms hierarchies can be built starting without any data and building outward, or starting with all the data and walking through the data. This method is a bit more computationally and theoretically heavy, so it's worth trying to avoid. Part of the reason is because while K-Means adjusts the central points of each cluster as it computes the clusters, hierarchical clustering creates a Dendrogram. As you can see below a Dendrogram is a type of line graph that tracks the values of each data point and relative distance between datapoints.

A graph using connected lines (dendrograms) to show the optimal number of clusters is five

When clusters are not quite as easy to eyeball it is a better algorithm than K-Means. You can visualize the difference below because the same dataset was used to create clusters with both methods, but the image below used the Dendrogram above and not central point of each cluster to determine clusters.

Clusters without a defined center

There are several applications of clustering algorithms in the real world. One easy to explain example is its application in marketing. Clustering can help explain how someone may react to an item based on their age and other features. This is important because recommender systems that rely on Apriori algorithms can produce strange results on their own. For example, if a recommender system suggests a person who buys beer will also buy diapers that may not be helpful. Combining that information with their age, marital status, and so on may help provide more context. Understanding how certain groups of people behave, and the characteristics of those groups can be powerful. Clustering can also help provide more insight into A/B testing results. Assuming more than just the results alone were captured, clustering can help you choose how to best target different audiences based on several definitive features.

At the end of the day, where linear regression can expose number trends clustering helps use definitive traits to define groups. This can be done through line graphs or Pythagorean Distance and a defined set of clusters. It can provide insight into marketing experiments, be used by biologists to group species and families, and many other things where understanding the traits of a group can provide insight. Furthermore, since more than one algorithm alone is necessary to get full insight into a group, it is important to consider the most computationally appropriate approach. Thankfully it can be as simple as asking simple questions about the dataset or just eyeballing the datapoints. As always, you can find my source code for clustering algorithms on Github.