Skip to content

A Simplified Guide to Hierarchical Clustering in R: Made Easy!

In the realm of cluster analysis, the question often arises: "What's next?". Whether you're seeking a semblance of order in your data disorder or aren't bound by a pre-set hypothesis, you have a plethora of techniques at your disposal. These include the traditional kMeans, density-based methods...

Demystifying Hierarchical Clustering in R: Simplified!
Demystifying Hierarchical Clustering in R: Simplified!

A Simplified Guide to Hierarchical Clustering in R: Made Easy!

In the realm of data analysis, cluster analysis has emerged as a powerful tool to bring order to the chaos of data, without the need for a pre-defined hypothesis. This method, which is the primary focus of our discussion, aims not just to identify which samples cluster together, but also to understand why they do so.

Using the popular R programming language, a new package called "hclustR" makes creating appealing hierarchical clusters a breeze. This tool allows for easy manipulation of the resulting dendrogram, providing a visual representation of the clusters and their relationships.

The article provides the code to generate the figures, making it accessible for readers to replicate the analysis. The mtcars dataset, a classic in data analysis, was used to run a hierarchical cluster analysis. Continuous variables were cut by their quartiles to assign discrete colours, and the resulting dendrogram was transformed using Tal Galili's dendextend package.

Classical methods such as kMeans, dbscan, and hierarchical clustering were employed in the analysis. The resulting dendrogram revealed some interesting findings. For instance, the Honda Civic's rear axle ratio and mpg combination is so different from the Mercedes450 group that it clusters with Toyota and Fiat, despite similar parameters. Conversely, the group in the middle with Merc450 series shares similarities in weight class (wt), rear axle ratio (drat), number of gears, cylinders, transmission type (am, vs), but differs in displacement and horsepower (disp, hp).

However, the question often asked in cluster analysis is: "Now what?" Understanding the features that bring samples together and drive them apart provides value from the analysis. Colored bars are added to represent feature levels of samples, and dendrogram lines and leaf labels can be colored based on engine type, providing a more detailed visual representation.

The analysis aims to make the insights it provides self-explanatory, without stealing the show. As for where Tesla would be grouped if the dataset was renewed, that remains to be seen. The search results do not contain information about who developed the R package for easy creation of hierarchical clusters.

In conclusion, cluster analysis, particularly hierarchical cluster analysis, offers a valuable approach to data analysis. By using tools like hclustR and dendextend, analysts can gain insights into the relationships between data points and make informed decisions based on those insights.

Read also: