Hierarchical Clustering

An unsupervised algorithm that builds a tree of clusters (dendrogram) without needing to specify the number of clusters upfront.

What It Is

Hierarchical clustering groups similar data points into clusters while building a hierarchy (tree-like structure). Unlike K-Means, you do not need to specify the number of clusters in advance. The output is a dendrogram that you cut at the desired level.

Two approaches exist: Agglomerative (bottom-up, most common) starts with each point as its own cluster and merges upward. Divisive (top-down) starts with one cluster and splits downward.

Agglomerative Clustering (Bottom-Up)

Algorithm Steps

Initialize — Each data point starts as its own cluster. For points A, B, C, D, E you get {A}, {B}, {C}, {D}, {E}.
Compute distance matrix — Calculate pairwise distances between all clusters (Euclidean is common).
Merge closest — Find the two clusters with the smallest distance and merge them. E.g., if A and B are closest, merge into {A, B}.
Update distances — Recompute distances using a linkage method (see below).
Repeat — Keep merging until all points form a single cluster.
Cut the dendrogram — Choose a distance threshold to get the desired number of clusters.

Linkage Methods

Method	Distance Between Clusters	Behavior
Single Linkage	Minimum distance between any two points	Can create elongated, chain-like clusters
Complete Linkage	Maximum distance between any two points	Produces compact, spherical clusters
Average Linkage	Average of all pairwise distances	Balanced approach
Ward's Method	Minimizes total within-cluster variance	Tends to create equal-sized clusters (most popular)

Code: Agglomerative Clustering

import pandas as pd from sklearn.preprocessing import StandardScaler from scipy.cluster.hierarchy import dendrogram, linkage, fcluster import matplotlib.pyplot as plt data = { "Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55], "Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35], } df = pd.DataFrame(data) # Select and scale features X = df[["Age", "Income_LPA"]] X_scaled = StandardScaler().fit_transform(X) # Perform Agglomerative Clustering (Ward's method) Z = linkage(X_scaled, method="ward") # Plot dendrogram plt.figure(figsize=(10, 5)) dendrogram(Z, truncate_mode="level", p=5) plt.title("Agglomerative Hierarchical Clustering Dendrogram") plt.xlabel("Data Points") plt.ylabel("Distance") plt.show()

Cut the Dendrogram into Clusters

# Cut dendrogram into 2 clusters clusters = fcluster(Z, t=2, criterion="maxclust") df["Cluster"] = clusters print(df) import seaborn as sns # Visualize clusters plt.figure(figsize=(8, 6)) sns.scatterplot(data=df, x="Age", y="Income_LPA", hue="Cluster", palette="Set1", s=100) plt.title("Agglomerative Clustering on Age vs Income") plt.show()

Divisive Clustering (Top-Down)

The opposite approach: start with all points in one cluster, then recursively split the cluster with the largest variance using a method like K-Means.

import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt data = { "Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55], "Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35] } df = pd.DataFrame(data) # Start with all points in one cluster df['Cluster'] = 0 # Split the largest cluster into 2 using KMeans kmeans = KMeans(n_clusters=2, random_state=42) df['Cluster'] = kmeans.fit_predict(df[['Age', 'Income_LPA']]) plt.figure(figsize=(8, 6)) for cluster in df['Cluster'].unique(): subset = df[df['Cluster'] == cluster] plt.scatter(subset['Age'], subset['Income_LPA'], label=f'Cluster {cluster}') plt.xlabel("Age") plt.ylabel("Income (LPA)") plt.title("Divisive Clustering (Top-Down Example)") plt.legend() plt.show()

When to Use Hierarchical Clustering

Good For	Not Ideal For
You do not know the number of clusters	Very large datasets (O(n^3) time, O(n^2) memory)
You want to explore cluster structure visually	High-dimensional data without reduction
Small to medium datasets	Real-time applications
Taxonomies, gene expression analysis	When speed matters more than interpretability

Hierarchical clustering is computationally expensive. For datasets larger than ~10,000 points, consider K-Means or DBSCAN instead. If you must use it, Ward's linkage is the best default choice.

Unsupervised Clustering Dendrogram scipy