ML Playground / Isolation Forest

Isolation Forest

Detect anomalies by isolating outliers with random partitioning. Anomalies are "few and different" — they're easier to isolate.

Core Idea

Normal points are clustered together and need many splits to isolate. Anomalies are far from the cluster and can be isolated with very few splits. Isolation Forest builds random trees and measures how quickly each point gets isolated — shorter paths = more anomalous.

Traditional methods model what's "normal" and flag what doesn't fit. Isolation Forest does the opposite — it directly finds what's easy to isolate, which is much more efficient.

How It Works

Algorithm Steps
  1. Build trees: For each tree, randomly select a feature and a random split value within the feature's range
  2. Split recursively until each point is isolated (alone in its partition) or max depth is reached
  3. Measure path length: Count how many splits it took to isolate each point
  4. Average across trees: Points with short average path lengths are anomalies
  5. Compute anomaly score: Normalize path lengths to a score between 0 and 1
Anomaly Score: s(x, n) = 2^(-E(h(x)) / c(n)) Where: h(x) = path length of point x (average across trees) c(n) = average path length in a binary tree with n samples E(·) = expected value (mean across all trees) Score ≈ 1 → Anomaly (short path, easy to isolate) Score ≈ 0.5 → Normal (average path length) Score ≈ 0 → Very normal (deep in the cluster)

Code Implementation

import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Generate data: normal cluster + outliers np.random.seed(42) X_normal = np.random.randn(300, 2) # Normal data X_outliers = np.random.uniform(-4, 4, size=(20, 2)) # Outliers X = np.vstack([X_normal, X_outliers]) # Train Isolation Forest model = IsolationForest( n_estimators=100, # Number of trees contamination=0.06, # Expected fraction of outliers (~20/320) random_state=42 ) predictions = model.fit_predict(X) # 1 = normal, -1 = anomaly scores = model.decision_function(X) # Lower = more anomalous # Results n_anomalies = (predictions == -1).sum() print(f"Detected {n_anomalies} anomalies out of {len(X)} points") # Visualize plt.figure(figsize=(10, 6)) plt.scatter(X[predictions == 1, 0], X[predictions == 1, 1], c='blue', s=20, label='Normal') plt.scatter(X[predictions == -1, 0], X[predictions == -1, 1], c='red', s=50, marker='x', label='Anomaly') plt.legend() plt.title("Isolation Forest Anomaly Detection") plt.show()

Key Parameters

ParameterDefaultDescription
n_estimators100Number of trees. More = more stable results
contamination'auto'Expected proportion of outliers (0 to 0.5). Affects the threshold
max_samples'auto' (256)Samples per tree. Smaller = faster, more randomness
max_features1.0Features per tree. Less = more diversity between trees

Real-World Applications

Isolation Forest vs Other Methods

MethodApproachScalability
Isolation ForestIsolation-based (how easy to separate)Excellent (linear time)
One-Class SVMFind boundary around normal dataPoor (quadratic)
LOFLocal density comparisonModerate
DBSCANDensity-based clusteringGood
AutoencoderReconstruction errorGood (needs GPU)

Isolation Forest works best with low-dimensional data (< 20 features) and a clear distinction between normal and anomalous. For high-dimensional data, consider using PCA first.