ML Playground / DBSCAN View Notebook

DBSCAN Clustering

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape and automatically detects outliers.

What It Is

DBSCAN groups together points that are tightly packed (dense regions) and labels points in low-density areas as noise (outliers). Unlike K-Means, it does not need the number of clusters specified upfront and can find clusters of any shape.

DBSCAN has two key advantages: it automatically determines the number of clusters and it identifies outliers as noise points (label = -1).

Key Parameters

epsilon (eps)

The radius of the neighborhood around each point. Points within this radius are considered neighbors.

min_samples (MinPts)

Minimum number of points required within eps radius to form a dense region (core point).

Point Types

TypeDefinition
Core pointHas at least MinPts neighbors within eps radius
Border pointWithin eps of a core point, but has fewer than MinPts neighbors
Noise pointNot within eps of any core point. Labeled as -1

How It Works

Algorithm Steps
  1. Pick a random unvisited point
  2. Find all neighbors within radius eps
  3. If neighbors ≥ MinPts — mark as core point and start a new cluster
  4. Expand the cluster by recursively visiting all neighbors of core points
  5. If neighbors < MinPts — mark as noise (may later become a border point if reached by a core point)
  6. Repeat until all points are visited

Code: DBSCAN Example

import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN # Example data: coordinates of points X = np.array([ [1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80] # last one is far away ]) # Run DBSCAN db = DBSCAN(eps=2, min_samples=2).fit(X) labels = db.labels_ # Plot clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="rainbow", s=100) plt.title("DBSCAN Clustering Example") plt.show() print(labels) # Output: [ 0, 0, 0, 1, 1, -1] # Cluster 0: first three points # Cluster 1: points [8,7] and [8,8] # Noise (-1): point [25,80] is too far from everything

Understanding the Output

When to Use DBSCAN

Good ForNot Ideal For
Clusters of arbitrary shape (non-spherical)Clusters with very different densities
Automatic outlier detectionHigh-dimensional data (distance becomes meaningless)
Unknown number of clustersWhen you need every point assigned to a cluster
Geospatial data, anomaly detectionVery large datasets without spatial indexing

Choosing eps and min_samples is critical. A common approach: use a k-distance plot (sort distances to the k-th nearest neighbor and look for an "elbow"). Poor parameter choices lead to either one giant cluster or all noise.

Unsupervised Clustering Density-based Outlier Detection