ML Playground / DBSCAN View Notebook

DBSCAN Clustering

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape and automatically detects outliers.

What It Is

DBSCAN groups together points that are tightly packed (dense regions) and labels points in low-density areas as noise (outliers). Unlike K-Means, it does not need the number of clusters specified upfront and can find clusters of any shape.

DBSCAN has two key advantages: it automatically determines the number of clusters and it identifies outliers as noise points (label = -1).

Key Parameters

epsilon (eps)

The radius of the neighborhood around each point. Points within this radius are considered neighbors.

min_samples (MinPts)

Minimum number of points required within eps radius to form a dense region (core point).

Point Types

Type	Definition
Core point	Has at least MinPts neighbors within eps radius
Border point	Within eps of a core point, but has fewer than MinPts neighbors
Noise point	Not within eps of any core point. Labeled as -1

How It Works

Algorithm Steps

Pick a random unvisited point
Find all neighbors within radius eps
If neighbors ≥ MinPts — mark as core point and start a new cluster
Expand the cluster by recursively visiting all neighbors of core points
If neighbors < MinPts — mark as noise (may later become a border point if reached by a core point)
Repeat until all points are visited

Code: DBSCAN Example

import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN # Example data: coordinates of points X = np.array([ [1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80] # last one is far away ]) # Run DBSCAN db = DBSCAN(eps=2, min_samples=2).fit(X) labels = db.labels_ # Plot clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="rainbow", s=100) plt.title("DBSCAN Clustering Example") plt.show() print(labels) # Output: [ 0, 0, 0, 1, 1, -1] # Cluster 0: first three points # Cluster 1: points [8,7] and [8,8] # Noise (-1): point [25,80] is too far from everything

Understanding the Output

Points with the same label belong to the same cluster
Label -1 means noise (outlier)
In the example above, [25, 80] is labeled -1 because it has no neighbors within eps=2

When to Use DBSCAN

Good For	Not Ideal For
Clusters of arbitrary shape (non-spherical)	Clusters with very different densities
Automatic outlier detection	High-dimensional data (distance becomes meaningless)
Unknown number of clusters	When you need every point assigned to a cluster
Geospatial data, anomaly detection	Very large datasets without spatial indexing

Choosing eps and min_samples is critical. A common approach: use a k-distance plot (sort distances to the k-th nearest neighbor and look for an "elbow"). Poor parameter choices lead to either one giant cluster or all noise.

Unsupervised Clustering Density-based Outlier Detection