Random Forest
An ensemble of decision trees that combines their predictions for better accuracy and reduced overfitting.
What is Random Forest?
Random Forest is an ensemble learning technique that builds multiple decision trees and combines their outputs. Instead of relying on a single tree, it takes the majority vote (classification) or average prediction (regression) from many trees. This reduces overfitting and improves accuracy.
Random Forest = Bagging (Bootstrap Aggregating) + Random Feature Selection. Each tree is trained on a random subset of data AND a random subset of features, making every tree unique.
How It Works
Algorithm Steps
- Create N bootstrap samples (random subsets of training data, with replacement)
- Train a decision tree on each sample, but at each split only consider a random subset of features
- For classification: each tree votes, majority wins
- For regression: take the average of all tree predictions
Key Concepts
- Bagging: Each tree trains on a different random subset of data (reduces variance)
- Feature Randomness: Each tree only considers a random subset of features at each split (reduces correlation between trees)
- Aggregation: Combine all trees for the final prediction
Code: Predicting House Rent (Simple)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Dataset
data = {
"Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500],
"Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5],
"Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000]
}
df = pd.DataFrame(data)
# Features and target
X = df[["Size_sqft", "Bedrooms"]]
y = df["Rent"]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest
rf = RandomForestRegressor(n_estimators=1500)
rf.fit(X_train, y_train)
# Predict and evaluate
y_pred = rf.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R2 Score: {r2:.2f}")
Code: With Categorical Features
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Dataset with more features including categorical
data = {
"Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500],
"Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5],
"Bathrooms": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4],
"Floor": [1, 2, 3, 1, 5, 3, 6, 10, 12, 15],
"City": ["Chennai", "Bangalore", "Mumbai", "Chennai", "Delhi",
"Mumbai", "Bangalore", "Delhi", "Chennai", "Mumbai"],
"Furnished": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
"Parking_Spots": [0, 1, 1, 2, 2, 1, 2, 3, 2, 3],
"Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000]
}
df = pd.DataFrame(data)
X = df.drop("Rent", axis=1)
y = df["Rent"]
# One-hot encode categorical columns
categorical_features = ["City"]
encoder = ColumnTransformer(
transformers=[("encoded", OneHotEncoder(drop="first"), categorical_features)],
remainder="passthrough"
)
X_encoded = encoder.fit_transform(X)
feature_names = encoder.get_feature_names_out()
X_encoded_df = pd.DataFrame(X_encoded, columns=feature_names)
# Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X_encoded_df, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators=50)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")
Key Parameters
| Parameter | Description |
| n_estimators | Number of trees in the forest. More trees = better accuracy, slower training |
| max_depth | Maximum depth of each tree. Limits overfitting |
| max_features | Number of features to consider at each split ("sqrt", "log2", or int) |
| min_samples_split | Minimum samples needed to split a node |
When to Use Random Forest
| Good For | Not Ideal For |
| Better accuracy than single decision tree | When interpretability is critical |
| Large and complex datasets | Very high-dimensional sparse data |
| Reducing overfitting | Real-time predictions (slower than single tree) |
| Both classification and regression | When you need feature importance explanations |
More trees (n_estimators) generally means better performance, but with diminishing returns. Beyond ~100-500 trees, improvement is minimal while training time increases linearly.
Ensemble Bagging Classification Regression Supervised