PCA — Principal Component Analysis

A dimensionality reduction technique that transforms high-dimensional data into fewer dimensions while preserving the most important variance.

What It Is

PCA reduces the number of features (columns) in your data while keeping the most important information. It finds new axes called Principal Components that capture the maximum variance in the data, then projects data onto the top few components.

PCA is the most widely used dimensionality reduction technique. Use it to visualize high-dimensional data in 2D/3D, speed up ML models, remove noise, and handle multicollinearity.

Key Idea

PCA finds directions (principal components) along which your data varies the most. The first component captures maximum variance, the second captures the next most (orthogonal to the first), and so on. You keep only the top components and discard the rest.

Original: 64 features (8x8 pixel images) After PCA: 2 features (PC1, PC2) Each principal component is a linear combination of original features. PC1 = w1*x1 + w2*x2 + ... + wn*xn

How It Works

Algorithm Steps

Standardize the data — Make all features have mean=0, std=1 (essential when features have different scales)
Compute covariance matrix — Measures how features vary together
Find eigenvectors and eigenvalues — Eigenvectors = new directions (principal components). Eigenvalues = importance (variance captured by each)
Sort by eigenvalues (largest first)
Project data onto the top k components to get the reduced dataset

Code: PCA on Handwritten Digits

import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.datasets import load_digits # Load sample dataset (handwritten digits 0-9) digits = load_digits() X = digits.data # 64 features (8x8 pixels) y = digits.target # Show first 10 digit images fig, axes = plt.subplots(1, 10, figsize=(10, 3)) for i in range(10): axes[i].imshow(digits.images[i], cmap="gray") axes[i].set_title(f"Label: {y[i]}") axes[i].axis("off") plt.show()

Apply PCA and Visualize

# Apply PCA to reduce 64 dimensions to 2 pca = PCA(n_components=2) X_reduced = pca.fit_transform(X) print("Original shape:", X.shape) # (1797, 64) print("Reduced shape:", X_reduced.shape) # (1797, 2) # Plot results plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="tab10", s=30) plt.title("PCA - Digits dataset reduced to 2D") plt.xlabel("PC1") plt.ylabel("PC2") plt.colorbar() plt.show()

Compare Original vs Reduced

# Side-by-side visualization fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # LEFT: Show some original digit images for i in range(10): axes[0].imshow(digits.images[i], cmap="gray") axes[0].set_title("Original Digits (8x8 images)") axes[0].axis("off") # RIGHT: Show PCA-reduced 2D scatter plot scatter = axes[1].scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="tab10", s=15) axes[1].set_title("PCA Projection (64D -> 2D)") axes[1].set_xlabel("PC1") axes[1].set_ylabel("PC2") legend = axes[1].legend(*scatter.legend_elements(), title="Digits") axes[1].add_artist(legend) plt.tight_layout() plt.show()

How Many Components to Keep?

Use the explained variance ratio to decide. Plot cumulative variance and pick the number of components that explain 90-95% of the total variance.

pca_full = PCA().fit(X) cumulative_var = np.cumsum(pca_full.explained_variance_ratio_) plt.plot(cumulative_var) plt.xlabel("Number of Components") plt.ylabel("Cumulative Explained Variance") plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold') plt.title("Explained Variance vs Components") plt.legend() plt.show()

When to Use PCA

Good For	Not Ideal For
Reducing features before training ML models	When all features are equally important
Visualizing high-dimensional data in 2D/3D	Non-linear relationships (use t-SNE or UMAP instead)
Removing noise and redundant features	Categorical data (PCA needs numeric features)
Speeding up model training	When interpretability of features is critical

Always standardize your data before PCA. Features on larger scales will dominate the principal components otherwise. PCA captures linear relationships only; for non-linear data, consider kernel PCA, t-SNE, or UMAP.

Key Parameters

n_components — Number of components to keep. Can be an integer (exact count) or a float between 0 and 1 (e.g., 0.95 means "keep enough to explain 95% of variance")
svd_solver — Algorithm to use ('auto', 'full', 'arpack', 'randomized')

Unsupervised Dimensionality Reduction Feature Engineering sklearn