Clustering is a common unsupervised machine learning technique. Used to detect homogenous groupings in data, clustering frequently plays a role in applications as diverse as recommender systems, social network analysis and market segmentation.
In this article, we’ll cover clustering algorithms and explain how you can use them to add value to your data analyses.
What Are Clustering Algorithms?
Clustering, also known as cluster analysis, is an unsupervised machine learning task of assigning data into groups. These groups (or clusters) are created by uncovering hidden patterns in the data, to the end of grouping data points with similar patterns in the same cluster. The main advantage of clustering lies in its ability to make sense of unlabeled data.
If you’re just starting to study programming or machine learning, terms like “unsupervised” and “unlabeled data” might sound arcane. Below, we’ll explain what they mean.
Unlabeled data is the type of data that’s abundant and relatively easy to obtain. It may be a collection of photos scraped from the internet, a corpus of tweets or any other collection of data points that do not have a label. Labeled data, on the other hand, comes with a label. This could be a corpus of texts with each text labeled as either “slasher novel” or “magical realism,” or a dataset of images with labels such as “cat” and “dog.” Labeled data is more informative but scarce, since labeling is an arduous process often requiring human annotators to manually provide a label for each data point.
All machine learning algorithms require data to learn, but it’s the form of the data — namely, whether it’s labeled or unlabeled — that dictates which algorithms we may apply to it. Based on this, we can distinguish between two main approaches to machine learning: supervised and unsupervised learning.
As its name indicates, supervised machine learning learns under supervision provided in the form of data labels. The results of learning for these algorithms are mappings of data points to labels. By contrast, unsupervised learning lacks the signal provided by the data’s labels. Instead, unsupervised learning algorithms employ various statistical methods to derive the labels themselves.
For example, a supervised learning algorithm applied to a dataset of emails could require every email to have a label like “professional email” or “personal email.” The learning objective of such an algorithm would be to learn language patterns common to the two types of emails.
On the other hand, a clustering algorithm applied to the same corpus might very well create clusters that correspond to “professional emails” and “personal emails.” Additionally, if the dataset contained emails that were sent as advertisements, a clustering algorithm could discover these as well.
Types of Clustering Algorithms
Although there’s a rich body of knowledge on clustering algorithms, there’s little consensus on how they should be taxonomized. Different sources will classify them based on varying criteria. In our view, there are two categorizations that are useful for practical application: (i) based on the number of clusters that a data point may belong to; and, (ii) based on the shapes of the resulting clusters.
The first distinction is essentially one between “hard” and “soft” clustering algorithms. More specifically, a data point may be a member of just one cluster (hard clustering) or it may belong to multiple clusters with varying degrees (soft clustering). Depending on your application, you may want your clusters to be rigid or you may want them to overlap, so this is an important consideration.
The other categorization is based on the shape and types of clusters that an algorithm produces. This division contains many different cluster types with some of the most established being hierarchical, centroid-based and density-based algorithms. We’ll cover these in more detail in the next section.
Note that these two categorizations are not mutually exclusive. For example, a popular clustering algorithm called k-means clustering is both hard and centroid-based. However, these classifications should be taken with a grain of salt because many clustering algorithms do not belong strictly to one clustering approach. As such, the categorizations rather serve as a guideline for choosing the most suitable algorithm for your application.
The main idea behind centroid clustering is to identify the centroids in the dataset. Centroids constitute the centers of the clusters whereas the clusters themselves are created by assigning every data point to the centroid that’s closest to it. Unlike other clustering methods that find the optimal number of clusters automatically, centroid-based algorithms require the analyst to define this number beforehand. Therefore, it’s important to have an idea of the clusters that might exist in your data prior to applying the algorithm.
Hierarchical clustering is used to build a cluster ranking system. This type of clustering detects the main, separate clusters in addition to detecting subclusters, which are clusters that exist within other clusters. Compared to centroid-based methods, hierarchical clustering is more useful if you want to discover hidden substructures in your data.
Whereas the previous two approaches consider all data points when creating a cluster, density-based clustering considers only areas with a high concentration of data points. Data points outside a manually chosen radius are considered to be outliers and are discarded from analysis.
K-Means Algorithm Walkthrough
K-means is a popular centroid-based, hard clustering algorithm. Its ubiquity is due to the algorithm’s sheer power despite being simple and intuitive to grasp. In fact, many other clustering algorithms build on top of k-means or are a slight variation of it. Below, we provide a step-by-step overview of the algorithm’s learning process:
- The algorithm creates a fixed (“k”) number of centroids which represent the cluster centers. The centroids are initialized by randomly sampling a “k” number of data points from the dataset and using them as the centroids.
- Once the centroids are selected, every data point is assigned to the cluster with the closest center. The “closeness” is often a metric like Cosine distance or something similar; the choice depends on what the algorithm is being applied to.
- Once the algorithm assigns every data point to a cluster, it’s time to recalibrate the clusters’ centroids.The algorithm does this by calculating the mean of all data points assigned to a cluster and reassigning the centroid to this mean, hence the algorithm’s name. With new centroid values in place, the algorithm goes back to the second step and each data point is once again assigned to the closest cluster. However, this time the clusters will be assigned differently because of the new centroid locations.
The last two steps are crucial to k-means clustering. They ensure that the algorithm learns by iteratively updating the centroids so that they better reflect the actual groups present in the data. The algorithm performs these updates for as long as a shift in the cluster’s centroid locations causes data to shift from one cluster to another. When an update in the centroids’ locations causes data to be assigned to the same clusters as in the previous iteration, the algorithm finishes computation.
It falls upon the analyst to set the hyperparameter k — the number of clusters that the algorithm creates. You should have a rough idea of how many distinct clusters might exist in the data and choose that number accordingly. However, if you’re unable to make an estimate, you can use algorithms like the elbow method and the silhouette method to determine the optimal number of clusters.
When To Use Clustering Algorithms?
Clustering by itself will not answer all of your questions. Rather, it brings value only when used as a part of a broader analysis strategy. Here are a few common scenarios in which you can employ clustering to generate value.
Exploratory Data Analysis
The first and most crucial step to any data analysis is exploring and getting to know your data. Clustering is often performed as part of this process to give a better sense of the data and inform the trajectory of future analyses.
Additionally, if you’re working with a large, unlabeled and unstructured dataset, clustering can be helpful to trim the dataset down to a size that’s more manageable for analysis. For example, you may use clustering to discard all data belonging to clusters of no interest to you. But clustering may prove useful even when used within a supervised learning scenario. For example, when applied to a labeled dataset, clustering can reveal if the dataset contains more groups than the ones specified by labels.
You can also use clustering to detect and discard outliers. Doing so results in a cleaner dataset with less noise, making it easier for machine learning algorithms to learn a better model of your data. This is important because models trained on a dataset with many outliers often end up having poor predictive power.
Some clustering algorithms such as those based on regions of density are especially good at finding outliers. These are often used in anomaly detection applications such as detecting fraud, system malfunctions or network intrusion.
Clustering User Demographics
Lastly, you can also leverage clustering to get an idea of an idea of your customer demographic.
For example, an online retailer can create distinct clusters based on its customers’ purchase history. By doing so, it might identify a cluster belonging to environmentally conscious Generation Z shoppers as a group that was previously not on their radar. The retailer might then use this information to develop better marketing strategies, like creating targeted ads.
Good Practices For Clustering Algorithms
Like much of data science, clustering relies on the process of trial and error.
Often, the output of a clustering algorithm may not immediately make much sense. It’s up to the analyst to determine the significance of the resulting clusters, before deciding whether to try another algorithm.
Seeing as clustering is one of the oldest and most researched machine learning techniques in theory, there are a lot of algorithms to choose from. To minimize time spent on the trial-and-error process, it’s a good idea to learn the advantages of the various algorithms, including the type of data and tasks they’re best suited for and the type of clusters they output.
Some clustering algorithms are sensitive to their initializing values. In such cases, you may need to simply run the algorithm a few more times. In other cases, your data simply cannot be clustered meaningfully by the chosen method, leading you to opt for another.
Another important consideration is the size of your data. For instance, hierarchical clustering algorithms have a cubic time complexity, meaning that they fare badly with large datasets. In such scenarios, k-means would be a better choice — if a shift from a hierarchical to a centroid-based method is acceptable — since the runtime of k-means is a magnitude of order smaller. Dataset size can thus have a major effect on the algorithm’s performance.
Finally, consider the type of your data. Many clustering algorithms use a distance-based measure for calculating clusters, which means that your dataset’s features need to be numeric. Although categorical values can be one-hot-encoded into binary values, calculating distances between them does not make a lot of sense. As an alternative, you may try out k-modes clustering — which is made for handling both numeric and categorical data — or a different approach altogether.
Learn Programming Fundamentals Online
In this article, we provided an overview of clustering algorithms and explained how you can incorporate them into your data analysis process. We hope you’ll keep building on this knowledge.
As part of our Introduction to Programming Nanodegree program, we start from scratch to teach you the programming skills you’ll need to jumpstart your career as a data specialist.