Random Forest

An ensemble of decision trees that combines their predictions for better accuracy and reduced overfitting.

What is Random Forest?

Random Forest is an ensemble learning technique that builds multiple decision trees and combines their outputs. Instead of relying on a single tree, it takes the majority vote (classification) or average prediction (regression) from many trees. This reduces overfitting and improves accuracy.

Random Forest = Bagging (Bootstrap Aggregating) + Random Feature Selection. Each tree is trained on a random subset of data AND a random subset of features, making every tree unique.

How It Works

Algorithm Steps

Create N bootstrap samples (random subsets of training data, with replacement)
Train a decision tree on each sample, but at each split only consider a random subset of features
For classification: each tree votes, majority wins
For regression: take the average of all tree predictions

Key Concepts

Bagging: Each tree trains on a different random subset of data (reduces variance)
Feature Randomness: Each tree only considers a random subset of features at each split (reduces correlation between trees)
Aggregation: Combine all trees for the final prediction

Code: Predicting House Rent (Simple)

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Dataset data = { "Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500], "Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5], "Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000] } df = pd.DataFrame(data) # Features and target X = df[["Size_sqft", "Bedrooms"]] y = df["Rent"] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train Random Forest rf = RandomForestRegressor(n_estimators=1500) rf.fit(X_train, y_train) # Predict and evaluate y_pred = rf.predict(X_test) mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"MAE: {mae:.2f}") print(f"RMSE: {rmse:.2f}") print(f"R2 Score: {r2:.2f}")

Code: With Categorical Features

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Dataset with more features including categorical data = { "Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500], "Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5], "Bathrooms": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4], "Floor": [1, 2, 3, 1, 5, 3, 6, 10, 12, 15], "City": ["Chennai", "Bangalore", "Mumbai", "Chennai", "Delhi", "Mumbai", "Bangalore", "Delhi", "Chennai", "Mumbai"], "Furnished": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1], "Parking_Spots": [0, 1, 1, 2, 2, 1, 2, 3, 2, 3], "Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000] } df = pd.DataFrame(data) X = df.drop("Rent", axis=1) y = df["Rent"] # One-hot encode categorical columns categorical_features = ["City"] encoder = ColumnTransformer( transformers=[("encoded", OneHotEncoder(drop="first"), categorical_features)], remainder="passthrough" ) X_encoded = encoder.fit_transform(X) feature_names = encoder.get_feature_names_out() X_encoded_df = pd.DataFrame(X_encoded, columns=feature_names) # Train and evaluate X_train, X_test, y_train, y_test = train_test_split(X_encoded_df, y, test_size=0.2, random_state=42) rf = RandomForestRegressor(n_estimators=50) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}") print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}") print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")

Key Parameters

Parameter	Description
n_estimators	Number of trees in the forest. More trees = better accuracy, slower training
max_depth	Maximum depth of each tree. Limits overfitting
max_features	Number of features to consider at each split ("sqrt", "log2", or int)
min_samples_split	Minimum samples needed to split a node

When to Use Random Forest

Good For	Not Ideal For
Better accuracy than single decision tree	When interpretability is critical
Large and complex datasets	Very high-dimensional sparse data
Reducing overfitting	Real-time predictions (slower than single tree)
Both classification and regression	When you need feature importance explanations

More trees (n_estimators) generally means better performance, but with diminishing returns. Beyond ~100-500 trees, improvement is minimal while training time increases linearly.

Ensemble Bagging Classification Regression Supervised