Hierarchical Clustering
An unsupervised algorithm that builds a tree of clusters (dendrogram) without needing to specify the number of clusters upfront.
What It Is
Hierarchical clustering groups similar data points into clusters while building a hierarchy (tree-like structure). Unlike K-Means, you do not need to specify the number of clusters in advance. The output is a dendrogram that you cut at the desired level.
Two approaches exist: Agglomerative (bottom-up, most common) starts with each point as its own cluster and merges upward. Divisive (top-down) starts with one cluster and splits downward.
Agglomerative Clustering (Bottom-Up)
Algorithm Steps
- Initialize — Each data point starts as its own cluster. For points A, B, C, D, E you get {A}, {B}, {C}, {D}, {E}.
- Compute distance matrix — Calculate pairwise distances between all clusters (Euclidean is common).
- Merge closest — Find the two clusters with the smallest distance and merge them. E.g., if A and B are closest, merge into {A, B}.
- Update distances — Recompute distances using a linkage method (see below).
- Repeat — Keep merging until all points form a single cluster.
- Cut the dendrogram — Choose a distance threshold to get the desired number of clusters.
Linkage Methods
| Method | Distance Between Clusters | Behavior |
| Single Linkage | Minimum distance between any two points | Can create elongated, chain-like clusters |
| Complete Linkage | Maximum distance between any two points | Produces compact, spherical clusters |
| Average Linkage | Average of all pairwise distances | Balanced approach |
| Ward's Method | Minimizes total within-cluster variance | Tends to create equal-sized clusters (most popular) |
Code: Agglomerative Clustering
import pandas as pd
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
data = {
"Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55],
"Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35],
}
df = pd.DataFrame(data)
# Select and scale features
X = df[["Age", "Income_LPA"]]
X_scaled = StandardScaler().fit_transform(X)
# Perform Agglomerative Clustering (Ward's method)
Z = linkage(X_scaled, method="ward")
# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z, truncate_mode="level", p=5)
plt.title("Agglomerative Hierarchical Clustering Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()
Cut the Dendrogram into Clusters
# Cut dendrogram into 2 clusters
clusters = fcluster(Z, t=2, criterion="maxclust")
df["Cluster"] = clusters
print(df)
import seaborn as sns
# Visualize clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="Age", y="Income_LPA", hue="Cluster", palette="Set1", s=100)
plt.title("Agglomerative Clustering on Age vs Income")
plt.show()
Divisive Clustering (Top-Down)
The opposite approach: start with all points in one cluster, then recursively split the cluster with the largest variance using a method like K-Means.
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = {
"Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55],
"Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35]
}
df = pd.DataFrame(data)
# Start with all points in one cluster
df['Cluster'] = 0
# Split the largest cluster into 2 using KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
df['Cluster'] = kmeans.fit_predict(df[['Age', 'Income_LPA']])
plt.figure(figsize=(8, 6))
for cluster in df['Cluster'].unique():
subset = df[df['Cluster'] == cluster]
plt.scatter(subset['Age'], subset['Income_LPA'], label=f'Cluster {cluster}')
plt.xlabel("Age")
plt.ylabel("Income (LPA)")
plt.title("Divisive Clustering (Top-Down Example)")
plt.legend()
plt.show()
When to Use Hierarchical Clustering
| Good For | Not Ideal For |
| You do not know the number of clusters | Very large datasets (O(n^3) time, O(n^2) memory) |
| You want to explore cluster structure visually | High-dimensional data without reduction |
| Small to medium datasets | Real-time applications |
| Taxonomies, gene expression analysis | When speed matters more than interpretability |
Hierarchical clustering is computationally expensive. For datasets larger than ~10,000 points, consider K-Means or DBSCAN instead. If you must use it, Ward's linkage is the best default choice.
Unsupervised Clustering Dendrogram scipy