ML Playground / Mean Shift View Notebook

Mean Shift Clustering

A density-based algorithm that finds cluster centers by iteratively shifting toward regions of highest data density. No need to specify the number of clusters.

What It Is

Mean Shift finds clusters by locating the peaks (modes) of the data's density function. Each data point iteratively moves toward the mean of points within a window (bandwidth), converging at density peaks. Points that converge to the same peak belong to the same cluster.

Mean Shift automatically determines the number of clusters. You only need to set the bandwidth (window size), which controls the granularity of the clustering.

Key Concept: Bandwidth

The bandwidth defines the radius of the window used to compute the local mean. Small bandwidth = many small clusters. Large bandwidth = fewer, larger clusters.

How It Works

Algorithm Steps

Place a window (circle of radius = bandwidth) around each data point
Compute the mean of all points inside the window
Shift the window center to that mean position
Repeat until the center stops moving (convergence)
Group points that converge to the same peak into one cluster

Code: Mean Shift Example

import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import MeanShift from sklearn.datasets import make_blobs # Generate sample data X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.8, random_state=42) # Apply Mean Shift ms = MeanShift(bandwidth=2) # bandwidth = window size labels = ms.fit_predict(X) centers = ms.cluster_centers_ print("Cluster centers:") print(centers) # Plot results plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="plasma", s=30) plt.scatter(centers[:, 0], centers[:, 1], c="black", marker="x", s=200) plt.title("Mean Shift Clustering") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.show()

Bandwidth Selection

Sklearn provides an automatic bandwidth estimator:

from sklearn.cluster import estimate_bandwidth # Automatically estimate bandwidth bandwidth = estimate_bandwidth(X, quantile=0.2) print(f"Estimated bandwidth: {bandwidth}") ms = MeanShift(bandwidth=bandwidth) labels = ms.fit_predict(X)

Mean Shift vs K-Means

Feature	K-Means	Mean Shift
Number of clusters	Must specify K	Determined automatically
Cluster shape	Spherical only	Arbitrary shape
Parameters	K (number of clusters)	Bandwidth (window size)
Speed	Fast (O(nKt))	Slow (O(n^2) per iteration)
Outlier handling	Assigns to nearest cluster	Forms tiny clusters for outliers

When to Use Mean Shift

Good For	Not Ideal For
Unknown number of clusters	Large datasets (slow, O(n^2))
Non-spherical cluster shapes	High-dimensional data
Image segmentation, object tracking	When speed is critical
Small to medium datasets	Very different cluster densities

Mean Shift is computationally expensive (O(n^2) per iteration). For large datasets, use K-Means or DBSCAN instead. The bandwidth parameter heavily influences results; use estimate_bandwidth() as a starting point.

Unsupervised Clustering Density-based Non-parametric