XGBoost
Extreme Gradient Boosting -- an optimized, regularized version of gradient boosting that is fast and highly accurate.
What is XGBoost?
XGBoost (Extreme Gradient Boosting) is a gradient boosting algorithm with several optimizations that make it faster and more accurate than standard gradient boosting. It builds decision trees sequentially, where each new tree corrects the errors of the previous ones.
XGBoost is the most winning algorithm in Kaggle competitions for structured/tabular data. It combines the power of gradient boosting with regularization, parallelization, and smart handling of missing values.
How XGBoost Improves on Gradient Boosting
XGBoost Optimizations
- Regularization (L1 + L2) — Adds penalties on tree complexity to prevent overfitting
- Parallelization — Builds trees faster using multiple CPU cores
- Missing Value Handling — Learns the optimal way to handle missing data automatically
- Tree Pruning — Grows trees leaf-wise instead of level-wise (more flexible)
- Sparsity-aware — Efficient with sparse data (like TF-IDF matrices)
- Cache optimization — Optimized memory access patterns for speed
Regularization in XGBoost
Regularization penalizes overly complex models to improve generalization:
L1 (Lasso) - reg_alpha
Adds absolute value of weights as penalty. Forces some weights to exactly 0, effectively removing unimportant features. Creates a sparse model.
L2 (Ridge) - reg_lambda
Adds squared value of weights as penalty. Shrinks large weights but keeps all features. Distributes importance smoothly across features.
L1 (Lasso): penalty = alpha * sum(|weights|)
--> Some weights become exactly 0 (feature selection)
L2 (Ridge): penalty = lambda * sum(weights^2)
--> All weights shrink toward 0 but none reach exactly 0
Code: XGBoost with Regularization Comparison
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Dataset
data = {
"House Size": [500, 800, 1000, 1500, 1800, 2000, 2500, 3000, 3500, 4000],
"Price": [2000000, 3000000, 4000000, 6000000, 7200000, 8000000,
10000000, 12000000, 14000000, 16000000]
}
df = pd.DataFrame(data)
X = df[["House Size"]]
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model with L1 regularization
model_l1 = xgb.XGBRegressor(
n_estimators=100, learning_rate=0.1, max_depth=3,
reg_alpha=10, reg_lambda=0, random_state=42
)
model_l1.fit(X_train, y_train)
y_pred_l1 = model_l1.predict(X_test)
print("L1 MSE:", mean_squared_error(y_test, y_pred_l1))
# Model with L2 regularization
model_l2 = xgb.XGBRegressor(
n_estimators=100, learning_rate=0.1, max_depth=3,
reg_alpha=0, reg_lambda=10, random_state=42
)
model_l2.fit(X_train, y_train)
y_pred_l2 = model_l2.predict(X_test)
print("L2 MSE:", mean_squared_error(y_test, y_pred_l2))
# Model with no regularization (baseline)
model_none = xgb.XGBRegressor(
n_estimators=1000, learning_rate=0.5, max_depth=3,
reg_alpha=0, reg_lambda=0, random_state=42
)
model_none.fit(X_train, y_train)
y_pred_none = model_none.predict(X_test)
print("No Reg MSE:", mean_squared_error(y_test, y_pred_none))
Key Parameters
| Parameter | Description |
| n_estimators | Number of boosting rounds. More rounds with small learning rate is preferred |
| learning_rate (eta) | Step size shrinkage. Typical: 0.01-0.3 |
| max_depth | Maximum tree depth. Typical: 3-8 |
| reg_alpha | L1 regularization. Higher = more feature selection |
| reg_lambda | L2 regularization. Higher = smoother weights |
| subsample | Fraction of samples per tree (stochastic). Typical: 0.6-0.9 |
| colsample_bytree | Fraction of features per tree. Typical: 0.6-0.9 |
When to Use XGBoost
| Good For | Not Ideal For |
| Structured/tabular data | Image, audio, or text (use deep learning) |
| Competition-level accuracy | Very small datasets |
| Datasets with missing values | When simple interpretability is needed |
| Both classification and regression | Streaming/online learning |
XGBoost with no regularization can overfit. Always use reg_alpha and/or reg_lambda. Start with reg_lambda=1 (L2) as a baseline.
Ensemble Boosting Regularization Classification Regression