Isolation Forest - ML Playground

Detect anomalies by isolating outliers with random partitioning. Anomalies are "few and different" — they're easier to isolate.

Core Idea

Normal points are clustered together and need many splits to isolate. Anomalies are far from the cluster and can be isolated with very few splits. Isolation Forest builds random trees and measures how quickly each point gets isolated — shorter paths = more anomalous.

Traditional methods model what's "normal" and flag what doesn't fit. Isolation Forest does the opposite — it directly finds what's easy to isolate, which is much more efficient.

How It Works

Algorithm Steps

Build trees: For each tree, randomly select a feature and a random split value within the feature's range
Split recursively until each point is isolated (alone in its partition) or max depth is reached
Measure path length: Count how many splits it took to isolate each point
Average across trees: Points with short average path lengths are anomalies
Compute anomaly score: Normalize path lengths to a score between 0 and 1

Anomaly Score: s(x, n) = 2^(-E(h(x)) / c(n)) Where: h(x) = path length of point x (average across trees) c(n) = average path length in a binary tree with n samples E(·) = expected value (mean across all trees) Score ≈ 1 → Anomaly (short path, easy to isolate) Score ≈ 0.5 → Normal (average path length) Score ≈ 0 → Very normal (deep in the cluster)

Code Implementation

import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Generate data: normal cluster + outliers np.random.seed(42) X_normal = np.random.randn(300, 2) # Normal data X_outliers = np.random.uniform(-4, 4, size=(20, 2)) # Outliers X = np.vstack([X_normal, X_outliers]) # Train Isolation Forest model = IsolationForest( n_estimators=100, # Number of trees contamination=0.06, # Expected fraction of outliers (~20/320) random_state=42 ) predictions = model.fit_predict(X) # 1 = normal, -1 = anomaly scores = model.decision_function(X) # Lower = more anomalous # Results n_anomalies = (predictions == -1).sum() print(f"Detected {n_anomalies} anomalies out of {len(X)} points") # Visualize plt.figure(figsize=(10, 6)) plt.scatter(X[predictions == 1, 0], X[predictions == 1, 1], c='blue', s=20, label='Normal') plt.scatter(X[predictions == -1, 0], X[predictions == -1, 1], c='red', s=50, marker='x', label='Anomaly') plt.legend() plt.title("Isolation Forest Anomaly Detection") plt.show()

Key Parameters

Real-World Applications

Isolation Forest vs Other Methods

Parameter	Default	Description
n_estimators	100	Number of trees. More = more stable results
contamination	'auto'	Expected proportion of outliers (0 to 0.5). Affects the threshold
max_samples	'auto' (256)	Samples per tree. Smaller = faster, more randomness
max_features	1.0	Features per tree. Less = more diversity between trees

Method	Approach	Scalability
Isolation Forest	Isolation-based (how easy to separate)	Excellent (linear time)
One-Class SVM	Find boundary around normal data	Poor (quadratic)
LOF	Local density comparison	Moderate
DBSCAN	Density-based clustering	Good
Autoencoder	Reconstruction error	Good (needs GPU)

Isolation Forest works best with low-dimensional data (< 20 features) and a clear distinction between normal and anomalous. For high-dimensional data, consider using PCA first.