ML Playground / XGBoost View Notebook

XGBoost

Extreme Gradient Boosting -- an optimized, regularized version of gradient boosting that is fast and highly accurate.

What is XGBoost?

XGBoost (Extreme Gradient Boosting) is a gradient boosting algorithm with several optimizations that make it faster and more accurate than standard gradient boosting. It builds decision trees sequentially, where each new tree corrects the errors of the previous ones.

XGBoost is the most winning algorithm in Kaggle competitions for structured/tabular data. It combines the power of gradient boosting with regularization, parallelization, and smart handling of missing values.

How XGBoost Improves on Gradient Boosting

XGBoost Optimizations
  1. Regularization (L1 + L2) — Adds penalties on tree complexity to prevent overfitting
  2. Parallelization — Builds trees faster using multiple CPU cores
  3. Missing Value Handling — Learns the optimal way to handle missing data automatically
  4. Tree Pruning — Grows trees leaf-wise instead of level-wise (more flexible)
  5. Sparsity-aware — Efficient with sparse data (like TF-IDF matrices)
  6. Cache optimization — Optimized memory access patterns for speed

Regularization in XGBoost

Regularization penalizes overly complex models to improve generalization:

L1 (Lasso) - reg_alpha

Adds absolute value of weights as penalty. Forces some weights to exactly 0, effectively removing unimportant features. Creates a sparse model.

L2 (Ridge) - reg_lambda

Adds squared value of weights as penalty. Shrinks large weights but keeps all features. Distributes importance smoothly across features.

L1 (Lasso): penalty = alpha * sum(|weights|) --> Some weights become exactly 0 (feature selection) L2 (Ridge): penalty = lambda * sum(weights^2) --> All weights shrink toward 0 but none reach exactly 0

Code: XGBoost with Regularization Comparison

import xgboost as xgb import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Dataset data = { "House Size": [500, 800, 1000, 1500, 1800, 2000, 2500, 3000, 3500, 4000], "Price": [2000000, 3000000, 4000000, 6000000, 7200000, 8000000, 10000000, 12000000, 14000000, 16000000] } df = pd.DataFrame(data) X = df[["House Size"]] y = df["Price"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Model with L1 regularization model_l1 = xgb.XGBRegressor( n_estimators=100, learning_rate=0.1, max_depth=3, reg_alpha=10, reg_lambda=0, random_state=42 ) model_l1.fit(X_train, y_train) y_pred_l1 = model_l1.predict(X_test) print("L1 MSE:", mean_squared_error(y_test, y_pred_l1)) # Model with L2 regularization model_l2 = xgb.XGBRegressor( n_estimators=100, learning_rate=0.1, max_depth=3, reg_alpha=0, reg_lambda=10, random_state=42 ) model_l2.fit(X_train, y_train) y_pred_l2 = model_l2.predict(X_test) print("L2 MSE:", mean_squared_error(y_test, y_pred_l2)) # Model with no regularization (baseline) model_none = xgb.XGBRegressor( n_estimators=1000, learning_rate=0.5, max_depth=3, reg_alpha=0, reg_lambda=0, random_state=42 ) model_none.fit(X_train, y_train) y_pred_none = model_none.predict(X_test) print("No Reg MSE:", mean_squared_error(y_test, y_pred_none))

Key Parameters

ParameterDescription
n_estimatorsNumber of boosting rounds. More rounds with small learning rate is preferred
learning_rate (eta)Step size shrinkage. Typical: 0.01-0.3
max_depthMaximum tree depth. Typical: 3-8
reg_alphaL1 regularization. Higher = more feature selection
reg_lambdaL2 regularization. Higher = smoother weights
subsampleFraction of samples per tree (stochastic). Typical: 0.6-0.9
colsample_bytreeFraction of features per tree. Typical: 0.6-0.9

When to Use XGBoost

Good ForNot Ideal For
Structured/tabular dataImage, audio, or text (use deep learning)
Competition-level accuracyVery small datasets
Datasets with missing valuesWhen simple interpretability is needed
Both classification and regressionStreaming/online learning

XGBoost with no regularization can overfit. Always use reg_alpha and/or reg_lambda. Start with reg_lambda=1 (L2) as a baseline.

Ensemble Boosting Regularization Classification Regression