Feature Selection

Pick the most important features and drop the noise. Fewer features = faster training, less overfitting, better interpretability.

Why Feature Selection?

Not all features contribute equally to predictions. Some are noisy, redundant, or irrelevant. Feature selection identifies and keeps only the useful ones, improving model performance and reducing complexity.

With 100 features, your model might overfit noise. With the right 15 features, it could perform better and train 10x faster.

Three Categories

Filter Methods

Score each feature independently using statistical tests. Fast but ignores feature interactions. Examples: correlation, chi-squared, mutual information.

Wrapper Methods

Train models with different feature subsets and pick the best. Accurate but slow. Examples: RFE, forward/backward selection.

Embedded Methods

Feature selection is built into the model training. Best of both worlds. Examples: Lasso (L1), tree-based feature importance.

Method 1: Correlation Analysis

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # Remove highly correlated features (redundant) df = pd.DataFrame(np.random.randn(100, 5), columns=['A', 'B', 'C', 'D', 'E']) df['B'] = df['A'] * 0.95 + np.random.randn(100) * 0.1 # B ≈ A (highly correlated) corr_matrix = df.corr().abs() # Find pairs with correlation > 0.9 upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [col for col in upper.columns if any(upper[col] > 0.9)] print(f"Drop: {to_drop}") # ['B'] - redundant with A

Method 2: SelectKBest (Filter)

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, chi2 from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) # Select top 2 features using F-statistic (ANOVA) selector = SelectKBest(score_func=f_classif, k=2) X_selected = selector.fit_transform(X, y) print(f"Original: {X.shape}") # (150, 4) print(f"Selected: {X_selected.shape}") # (150, 2) print(f"Scores: {selector.scores_}") print(f"Selected features: {selector.get_support()}") # [True, False, True, True] etc.

Method 3: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) # Select top 2 features by recursively removing the least important rfe = RFE(estimator=model, n_features_to_select=2, step=1) rfe.fit(X, y) print(f"Selected: {rfe.support_}") # [False, False, True, True] print(f"Ranking: {rfe.ranking_}") # [3, 2, 1, 1] (1 = selected) print(f"Feature names: {np.array(load_iris().feature_names)[rfe.support_]}")

Method 4: Tree-Based Importance (Embedded)

from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X, y) importances = model.feature_importances_ feature_names = load_iris().feature_names # Sort by importance idx = np.argsort(importances)[::-1] for i in idx: print(f" {feature_names[i]:20s}: {importances[i]:.4f}") # Threshold-based selection from sklearn.feature_selection import SelectFromModel selector = SelectFromModel(model, threshold="mean") # Keep above-average importance X_selected = selector.fit_transform(X, y) print(f"Selected {X_selected.shape[1]} features out of {X.shape[1]}")

Method 5: L1 Regularization (Lasso)

from sklearn.linear_model import LassoCV from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Lasso drives unimportant feature weights to exactly 0 lasso = LassoCV(cv=5).fit(X_scaled, y) print(f"Coefficients: {lasso.coef_}") # Non-zero coefficients = selected features selected = np.where(lasso.coef_ != 0)[0] print(f"Selected feature indices: {selected}")

Comparison

Method	Speed	Considers Interactions	Best For
Correlation	Very Fast	Pairwise only	Removing redundant features
SelectKBest	Fast	No	Quick baseline, high-dimensional data
RFE	Slow	Yes	When accuracy matters more than speed
Tree Importance	Medium	Yes	General purpose, interpretable
Lasso (L1)	Medium	Partially	Linear models, automatic selection

Always do feature selection inside cross-validation, not before. Otherwise you leak information from the test set into the selection process.