Cross-Validation - ML Playground

Properly estimate model performance by training and testing on different subsets of your data.

Why Cross-Validation?

A single train-test split can be misleading. If your test set happens to be "easy", you'll overestimate performance. Cross-validation uses multiple splits to get a reliable, unbiased estimate of how your model will perform on unseen data.

Types of Cross-Validation

K-Fold Cross-Validation

Data split into K equal folds: Fold 1: [TEST] [Train] [Train] [Train] [Train] → Score 1 Fold 2: [Train] [TEST] [Train] [Train] [Train] → Score 2 Fold 3: [Train] [Train] [TEST] [Train] [Train] → Score 3 Fold 4: [Train] [Train] [Train] [TEST] [Train] → Score 4 Fold 5: [Train] [Train] [Train] [Train] [TEST] → Score 5 Final Score = Mean(Score 1..5) ± Std(Score 1..5)

Stratified K-Fold

Same as K-Fold but ensures each fold has the same class distribution as the full dataset. Essential for imbalanced datasets where one class is rare.

Leave-One-Out (LOO)

K = N (number of samples). Each sample is used as test set once. Very thorough but computationally expensive. Best for very small datasets.

Time Series Split

For time-ordered data where future data can't be used to predict the past. Training set grows with each fold:

Code Implementation

from sklearn.model_selection import (cross_val_score, KFold, StratifiedKFold, LeaveOneOut, TimeSeriesSplit) from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris import numpy as np X, y = load_iris(return_X_y=True) model = RandomForestClassifier(n_estimators=100, random_state=42) # --- K-Fold (5 folds) --- kf = KFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy') print(f"K-Fold: {scores.mean():.4f} ± {scores.std():.4f}") print(f" Per fold: {scores}") # --- Stratified K-Fold (preserves class distribution) --- skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy') print(f"Stratified: {scores.mean():.4f} ± {scores.std():.4f}") # --- Leave-One-Out --- loo = LeaveOneOut() scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy') print(f"LOO: {scores.mean():.4f} (150 folds)") # --- Time Series Split --- tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X, y, cv=tscv, scoring='accuracy') print(f"TimeSeries: {scores.mean():.4f} ± {scores.std():.4f}") # --- Shorthand (just pass k) --- scores = cross_val_score(model, X, y, cv=10) # 10-fold print(f"10-Fold: {scores.mean():.4f} ± {scores.std():.4f}")

Which to Use?

Cross-Validation with Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 10, None] } # GridSearchCV uses cross-validation internally grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=StratifiedKFold(n_splits=5), scoring='accuracy', n_jobs=-1 ) grid_search.fit(X, y) print(f"Best params: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.4f}")

Method	When to Use	K Value
K-Fold	General purpose, balanced classes	5 or 10
Stratified K-Fold	Imbalanced classes (always prefer this)	5 or 10
Leave-One-Out	Very small datasets (<50 samples)	N
Repeated K-Fold	Need very stable estimates	5×10 repeats
Time Series Split	Time-ordered data (stock prices, weather)	5-10