Gradient Boosting
A sequential ensemble method that builds models one after another, each correcting the errors of the previous one using gradient descent.
What is Gradient Boosting?
Gradient Boosting is an ensemble technique where weak models (typically small decision trees) are trained sequentially. Each new tree focuses on correcting the residual errors of the combined predictions so far. The method uses gradient descent to minimize a loss function.
Unlike Random Forest (bagging) where trees are independent, Gradient Boosting trains trees sequentially. Each tree learns from the mistakes of all previous trees combined.
How Boosting Works
Algorithm Steps
- Start with a simple prediction (e.g., the mean for regression, or a log-odds for classification)
- Calculate residuals — the errors between actual and predicted values
- Train a new weak tree to predict these residuals
- Add the new tree's predictions (scaled by learning rate) to the ensemble
- Repeat for N iterations, each time reducing the remaining error
For regression:
Prediction = initial_prediction + lr * tree_1(x) + lr * tree_2(x) + ... + lr * tree_n(x)
For classification:
Each tree corrects the classification errors of the previous ensemble.
Final prediction is a weighted combination of all weak classifiers.
Code: Spam Detection with Gradient Boosting
import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load SMS spam dataset
dataset_url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(dataset_url, sep='\t', header=None, names=['label', 'message'])
# Encode labels: ham=0, spam=1
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
# Text preprocessing
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = word_tokenize(text)
words = [w for w in words if w not in stopwords.words('english')]
return ' '.join(words)
df['cleaned_message'] = df['message'].apply(preprocess_text)
# TF-IDF feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned_message'])
y = df['label']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Gradient Boosting Classifier
gbc = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gbc.fit(X_train, y_train)
# Evaluate
y_pred = gbc.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
# Predict custom message
custom = "Free entry in 2 a wkly comp to win FA Cup final tkts"
cleaned = preprocess_text(custom)
vector = vectorizer.transform([cleaned])
result = "spam" if gbc.predict(vector)[0] == 1 else "ham"
print(f"Prediction: {result}")
Key Parameters
| Parameter | Description |
| n_estimators | Number of boosting rounds (trees). More = better but slower, risk of overfitting |
| learning_rate | Shrinks contribution of each tree. Lower = needs more trees but generalizes better |
| max_depth | Depth of each tree. Typically 3-5 for boosting (shallow trees are "weak learners") |
| subsample | Fraction of samples used per tree. <1.0 adds randomness (stochastic gradient boosting) |
When to Use Gradient Boosting
| Good For | Not Ideal For |
| Structured/tabular data | Image or sequence data |
| High prediction accuracy | When training speed is critical |
| Both classification and regression | Very small datasets |
| Kaggle competitions (top performer) | When interpretability is required |
The learning_rate and n_estimators are coupled: a smaller learning rate needs more estimators. A common strategy is to set a small learning_rate (0.01-0.1) and use early stopping to find the right number of trees.
Ensemble Boosting Classification Regression Supervised