ML Playground / Random Forest View Notebook

Random Forest

An ensemble of decision trees that combines their predictions for better accuracy and reduced overfitting.

What is Random Forest?

Random Forest is an ensemble learning technique that builds multiple decision trees and combines their outputs. Instead of relying on a single tree, it takes the majority vote (classification) or average prediction (regression) from many trees. This reduces overfitting and improves accuracy.

Random Forest = Bagging (Bootstrap Aggregating) + Random Feature Selection. Each tree is trained on a random subset of data AND a random subset of features, making every tree unique.

How It Works

Algorithm Steps
  1. Create N bootstrap samples (random subsets of training data, with replacement)
  2. Train a decision tree on each sample, but at each split only consider a random subset of features
  3. For classification: each tree votes, majority wins
  4. For regression: take the average of all tree predictions

Key Concepts

Code: Predicting House Rent (Simple)

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Dataset data = { "Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500], "Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5], "Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000] } df = pd.DataFrame(data) # Features and target X = df[["Size_sqft", "Bedrooms"]] y = df["Rent"] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train Random Forest rf = RandomForestRegressor(n_estimators=1500) rf.fit(X_train, y_train) # Predict and evaluate y_pred = rf.predict(X_test) mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"MAE: {mae:.2f}") print(f"RMSE: {rmse:.2f}") print(f"R2 Score: {r2:.2f}")

Code: With Categorical Features

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Dataset with more features including categorical data = { "Size_sqft": [500, 700, 800, 900, 1200, 1500, 1800, 2000, 2200, 2500], "Bedrooms": [1, 1, 2, 2, 3, 3, 3, 4, 4, 5], "Bathrooms": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4], "Floor": [1, 2, 3, 1, 5, 3, 6, 10, 12, 15], "City": ["Chennai", "Bangalore", "Mumbai", "Chennai", "Delhi", "Mumbai", "Bangalore", "Delhi", "Chennai", "Mumbai"], "Furnished": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1], "Parking_Spots": [0, 1, 1, 2, 2, 1, 2, 3, 2, 3], "Rent": [5000, 7000, 8500, 9000, 12000, 15000, 18000, 20000, 22000, 25000] } df = pd.DataFrame(data) X = df.drop("Rent", axis=1) y = df["Rent"] # One-hot encode categorical columns categorical_features = ["City"] encoder = ColumnTransformer( transformers=[("encoded", OneHotEncoder(drop="first"), categorical_features)], remainder="passthrough" ) X_encoded = encoder.fit_transform(X) feature_names = encoder.get_feature_names_out() X_encoded_df = pd.DataFrame(X_encoded, columns=feature_names) # Train and evaluate X_train, X_test, y_train, y_test = train_test_split(X_encoded_df, y, test_size=0.2, random_state=42) rf = RandomForestRegressor(n_estimators=50) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}") print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}") print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")

Key Parameters

ParameterDescription
n_estimatorsNumber of trees in the forest. More trees = better accuracy, slower training
max_depthMaximum depth of each tree. Limits overfitting
max_featuresNumber of features to consider at each split ("sqrt", "log2", or int)
min_samples_splitMinimum samples needed to split a node

When to Use Random Forest

Good ForNot Ideal For
Better accuracy than single decision treeWhen interpretability is critical
Large and complex datasetsVery high-dimensional sparse data
Reducing overfittingReal-time predictions (slower than single tree)
Both classification and regressionWhen you need feature importance explanations

More trees (n_estimators) generally means better performance, but with diminishing returns. Beyond ~100-500 trees, improvement is minimal while training time increases linearly.

Ensemble Bagging Classification Regression Supervised