Cross-Validation
Properly estimate model performance by training and testing on different subsets of your data.
Why Cross-Validation?
A single train-test split can be misleading. If your test set happens to be "easy", you'll overestimate performance. Cross-validation uses multiple splits to get a reliable, unbiased estimate of how your model will perform on unseen data.
A single 80/20 split gives you ONE accuracy number. 5-fold cross-validation gives you FIVE, plus a mean and standard deviation — much more trustworthy.
Types of Cross-Validation
K-Fold Cross-Validation
Data split into K equal folds:
Fold 1: [TEST] [Train] [Train] [Train] [Train] → Score 1
Fold 2: [Train] [TEST] [Train] [Train] [Train] → Score 2
Fold 3: [Train] [Train] [TEST] [Train] [Train] → Score 3
Fold 4: [Train] [Train] [Train] [TEST] [Train] → Score 4
Fold 5: [Train] [Train] [Train] [Train] [TEST] → Score 5
Final Score = Mean(Score 1..5) ± Std(Score 1..5)
Stratified K-Fold
Same as K-Fold but ensures each fold has the same class distribution as the full dataset. Essential for imbalanced datasets where one class is rare.
Leave-One-Out (LOO)
K = N (number of samples). Each sample is used as test set once. Very thorough but computationally expensive. Best for very small datasets.
Time Series Split
For time-ordered data where future data can't be used to predict the past. Training set grows with each fold:
Fold 1: [Train] [Test] [----] [----] [----]
Fold 2: [Train] [Train] [Test] [----] [----]
Fold 3: [Train] [Train] [Train] [Test] [----]
Fold 4: [Train] [Train] [Train] [Train] [Test]
Code Implementation
from sklearn.model_selection import (cross_val_score, KFold, StratifiedKFold,
LeaveOneOut, TimeSeriesSplit)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# --- K-Fold (5 folds) ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"K-Fold: {scores.mean():.4f} ± {scores.std():.4f}")
print(f" Per fold: {scores}")
# --- Stratified K-Fold (preserves class distribution) ---
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f"Stratified: {scores.mean():.4f} ± {scores.std():.4f}")
# --- Leave-One-Out ---
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print(f"LOO: {scores.mean():.4f} (150 folds)")
# --- Time Series Split ---
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv, scoring='accuracy')
print(f"TimeSeries: {scores.mean():.4f} ± {scores.std():.4f}")
# --- Shorthand (just pass k) ---
scores = cross_val_score(model, X, y, cv=10) # 10-fold
print(f"10-Fold: {scores.mean():.4f} ± {scores.std():.4f}")
Which to Use?
| Method | When to Use | K Value |
| K-Fold | General purpose, balanced classes | 5 or 10 |
| Stratified K-Fold | Imbalanced classes (always prefer this) | 5 or 10 |
| Leave-One-Out | Very small datasets (<50 samples) | N |
| Repeated K-Fold | Need very stable estimates | 5×10 repeats |
| Time Series Split | Time-ordered data (stock prices, weather) | 5-10 |
Cross-Validation with Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None]
}
# GridSearchCV uses cross-validation internally
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=StratifiedKFold(n_splits=5),
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
Never use cross-validation scores as final test scores. Always keep a completely separate holdout test set that is never used during training or hyperparameter tuning.