PCA — Principal Component Analysis
A dimensionality reduction technique that transforms high-dimensional data into fewer dimensions while preserving the most important variance.
What It Is
PCA reduces the number of features (columns) in your data while keeping the most important information. It finds new axes called Principal Components that capture the maximum variance in the data, then projects data onto the top few components.
PCA is the most widely used dimensionality reduction technique. Use it to visualize high-dimensional data in 2D/3D, speed up ML models, remove noise, and handle multicollinearity.
Key Idea
PCA finds directions (principal components) along which your data varies the most. The first component captures maximum variance, the second captures the next most (orthogonal to the first), and so on. You keep only the top components and discard the rest.
Original: 64 features (8x8 pixel images)
After PCA: 2 features (PC1, PC2)
Each principal component is a linear combination of original features.
PC1 = w1*x1 + w2*x2 + ... + wn*xn
How It Works
Algorithm Steps
- Standardize the data — Make all features have mean=0, std=1 (essential when features have different scales)
- Compute covariance matrix — Measures how features vary together
- Find eigenvectors and eigenvalues — Eigenvectors = new directions (principal components). Eigenvalues = importance (variance captured by each)
- Sort by eigenvalues (largest first)
- Project data onto the top k components to get the reduced dataset
Code: PCA on Handwritten Digits
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# Load sample dataset (handwritten digits 0-9)
digits = load_digits()
X = digits.data # 64 features (8x8 pixels)
y = digits.target
# Show first 10 digit images
fig, axes = plt.subplots(1, 10, figsize=(10, 3))
for i in range(10):
axes[i].imshow(digits.images[i], cmap="gray")
axes[i].set_title(f"Label: {y[i]}")
axes[i].axis("off")
plt.show()
Apply PCA and Visualize
# Apply PCA to reduce 64 dimensions to 2
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("Original shape:", X.shape) # (1797, 64)
print("Reduced shape:", X_reduced.shape) # (1797, 2)
# Plot results
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="tab10", s=30)
plt.title("PCA - Digits dataset reduced to 2D")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar()
plt.show()
Compare Original vs Reduced
# Side-by-side visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# LEFT: Show some original digit images
for i in range(10):
axes[0].imshow(digits.images[i], cmap="gray")
axes[0].set_title("Original Digits (8x8 images)")
axes[0].axis("off")
# RIGHT: Show PCA-reduced 2D scatter plot
scatter = axes[1].scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="tab10", s=15)
axes[1].set_title("PCA Projection (64D -> 2D)")
axes[1].set_xlabel("PC1")
axes[1].set_ylabel("PC2")
legend = axes[1].legend(*scatter.legend_elements(), title="Digits")
axes[1].add_artist(legend)
plt.tight_layout()
plt.show()
How Many Components to Keep?
Use the explained variance ratio to decide. Plot cumulative variance and pick the number of components that explain 90-95% of the total variance.
pca_full = PCA().fit(X)
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
plt.plot(cumulative_var)
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.title("Explained Variance vs Components")
plt.legend()
plt.show()
When to Use PCA
| Good For | Not Ideal For |
| Reducing features before training ML models | When all features are equally important |
| Visualizing high-dimensional data in 2D/3D | Non-linear relationships (use t-SNE or UMAP instead) |
| Removing noise and redundant features | Categorical data (PCA needs numeric features) |
| Speeding up model training | When interpretability of features is critical |
Always standardize your data before PCA. Features on larger scales will dominate the principal components otherwise. PCA captures linear relationships only; for non-linear data, consider kernel PCA, t-SNE, or UMAP.
Key Parameters
- n_components — Number of components to keep. Can be an integer (exact count) or a float between 0 and 1 (e.g., 0.95 means "keep enough to explain 95% of variance")
- svd_solver — Algorithm to use ('auto', 'full', 'arpack', 'randomized')
Unsupervised Dimensionality Reduction Feature Engineering sklearn