Regularization Techniques

Prevent overfitting and build models that generalize to unseen data. The art of adding the right constraints.

What is Overfitting?

A model overfits when it memorizes the training data (including noise) instead of learning the underlying pattern. It performs great on training data but poorly on new data. Regularization adds constraints to prevent this.

Training accuracy: 99% | Test accuracy: 72% → Overfitting! Training accuracy: 92% | Test accuracy: 89% → Good generalization

Techniques Overview

1. L1 Regularization (Lasso)

Adds the sum of absolute weights to the loss. Drives some weights to exactly zero → automatic feature selection.

Loss = Original Loss + λ * Σ|w_i| Effect: Sparse weights. Some features get weight = 0 (removed). Use when: You suspect many features are irrelevant.

2. L2 Regularization (Ridge)

Adds the sum of squared weights to the loss. Shrinks all weights towards zero but never exactly zero.

Loss = Original Loss + λ * Σ(w_i²) Effect: Smaller, more distributed weights. No feature removed entirely. Use when: All features might be relevant but you want to prevent large weights.

3. Dropout

During training, randomly set a fraction of neurons to zero in each forward pass. Forces the network to not rely on any single neuron.

import tensorflow as tf from tensorflow.keras import layers, Sequential model = Sequential([ layers.Dense(256, activation='relu', input_shape=(784,)), layers.Dropout(0.5), # Randomly drop 50% of neurons during training layers.Dense(128, activation='relu'), layers.Dropout(0.3), # Drop 30% here layers.Dense(10, activation='softmax') ]) # Dropout is ONLY active during training, not during prediction # model.predict() automatically disables dropout

Dropout rate 0.5 for hidden layers and 0.2 for input layer is a good starting point. Higher dropout = stronger regularization.

4. Batch Normalization

Normalizes the output of each layer to have zero mean and unit variance. Stabilizes training, acts as mild regularization, and allows higher learning rates.

model = Sequential([ layers.Dense(256, input_shape=(784,)), layers.BatchNormalization(), # Normalize activations layers.Activation('relu'), layers.Dense(128), layers.BatchNormalization(), layers.Activation('relu'), layers.Dense(10, activation='softmax') ]) # BatchNorm has learnable parameters (gamma, beta) for scaling and shifting # It tracks running mean/variance for inference

5. Early Stopping

Monitor validation loss during training. Stop when it starts increasing (the model is beginning to overfit).

from tensorflow.keras.callbacks import EarlyStopping early_stop = EarlyStopping( monitor='val_loss', # Watch validation loss patience=5, # Wait 5 epochs for improvement restore_best_weights=True # Go back to the best model ) model.fit(X_train, y_train, validation_split=0.2, epochs=100, # Set high, early stopping will cut it short callbacks=[early_stop]) # Typically stops at epoch 15-30 instead of running all 100

6. Data Augmentation

Artificially increase training data by applying random transformations. More data = less overfitting.

from tensorflow.keras.preprocessing.image import ImageDataGenerator datagen = ImageDataGenerator( rotation_range=20, # Random rotation ±20 degrees width_shift_range=0.2, # Horizontal shift ±20% height_shift_range=0.2, # Vertical shift ±20% horizontal_flip=True, # Random horizontal flip zoom_range=0.2, # Random zoom ±20% shear_range=0.1 # Random shearing ) # Use augmented data for training model.fit(datagen.flow(X_train, y_train, batch_size=32), epochs=50, validation_data=(X_test, y_test))

7. Weight Decay

Same as L2 regularization but applied directly in the optimizer. Decays weights each step.

# In TensorFlow/Keras optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, weight_decay=1e-4) # In PyTorch optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

When to Use What

Technique	Best For	Strength
L1 (Lasso)	Feature selection in linear models	Removes irrelevant features
L2 (Ridge)	General purpose regularization	Prevents large weights
Dropout	Deep neural networks	Prevents co-adaptation of neurons
Batch Norm	Deep networks (especially CNNs)	Stabilizes + regularizes
Early Stopping	Any iterative model	Simple, no hyperparameters
Data Augmentation	Image classification	More data = less overfit
Weight Decay	Transformers, large models	Cleaner than L2 in loss

Combining Techniques

In practice, you combine multiple regularization techniques:

# A well-regularized CNN model = Sequential([ layers.Conv2D(32, 3, input_shape=(32,32,3)), layers.BatchNormalization(), layers.Activation('relu'), layers.MaxPooling2D(), layers.Dropout(0.25), layers.Conv2D(64, 3), layers.BatchNormalization(), layers.Activation('relu'), layers.MaxPooling2D(), layers.Dropout(0.25), layers.Flatten(), layers.Dense(128), layers.BatchNormalization(), layers.Activation('relu'), layers.Dropout(0.5), layers.Dense(10, activation='softmax') ]) # + Early Stopping + Data Augmentation during training

Too much regularization = underfitting (model is too constrained). Start with mild regularization and increase until validation performance peaks.