Regularization Techniques
Prevent overfitting and build models that generalize to unseen data. The art of adding the right constraints.
What is Overfitting?
A model overfits when it memorizes the training data (including noise) instead of learning the underlying pattern. It performs great on training data but poorly on new data. Regularization adds constraints to prevent this.
Training accuracy: 99% | Test accuracy: 72% → Overfitting!
Training accuracy: 92% | Test accuracy: 89% → Good generalization
Techniques Overview
1. L1 Regularization (Lasso)
Adds the sum of absolute weights to the loss. Drives some weights to exactly zero → automatic feature selection.
Loss = Original Loss + λ * Σ|w_i|
Effect: Sparse weights. Some features get weight = 0 (removed).
Use when: You suspect many features are irrelevant.
2. L2 Regularization (Ridge)
Adds the sum of squared weights to the loss. Shrinks all weights towards zero but never exactly zero.
Loss = Original Loss + λ * Σ(w_i²)
Effect: Smaller, more distributed weights. No feature removed entirely.
Use when: All features might be relevant but you want to prevent large weights.
3. Dropout
During training, randomly set a fraction of neurons to zero in each forward pass. Forces the network to not rely on any single neuron.
import tensorflow as tf
from tensorflow.keras import layers, Sequential
model = Sequential([
layers.Dense(256, activation='relu', input_shape=(784,)),
layers.Dropout(0.5), # Randomly drop 50% of neurons during training
layers.Dense(128, activation='relu'),
layers.Dropout(0.3), # Drop 30% here
layers.Dense(10, activation='softmax')
])
# Dropout is ONLY active during training, not during prediction
# model.predict() automatically disables dropout
Dropout rate 0.5 for hidden layers and 0.2 for input layer is a good starting point. Higher dropout = stronger regularization.
4. Batch Normalization
Normalizes the output of each layer to have zero mean and unit variance. Stabilizes training, acts as mild regularization, and allows higher learning rates.
model = Sequential([
layers.Dense(256, input_shape=(784,)),
layers.BatchNormalization(), # Normalize activations
layers.Activation('relu'),
layers.Dense(128),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dense(10, activation='softmax')
])
# BatchNorm has learnable parameters (gamma, beta) for scaling and shifting
# It tracks running mean/variance for inference
5. Early Stopping
Monitor validation loss during training. Stop when it starts increasing (the model is beginning to overfit).
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor='val_loss', # Watch validation loss
patience=5, # Wait 5 epochs for improvement
restore_best_weights=True # Go back to the best model
)
model.fit(X_train, y_train,
validation_split=0.2,
epochs=100, # Set high, early stopping will cut it short
callbacks=[early_stop])
# Typically stops at epoch 15-30 instead of running all 100
6. Data Augmentation
Artificially increase training data by applying random transformations. More data = less overfitting.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20, # Random rotation ±20 degrees
width_shift_range=0.2, # Horizontal shift ±20%
height_shift_range=0.2, # Vertical shift ±20%
horizontal_flip=True, # Random horizontal flip
zoom_range=0.2, # Random zoom ±20%
shear_range=0.1 # Random shearing
)
# Use augmented data for training
model.fit(datagen.flow(X_train, y_train, batch_size=32),
epochs=50, validation_data=(X_test, y_test))
7. Weight Decay
Same as L2 regularization but applied directly in the optimizer. Decays weights each step.
# In TensorFlow/Keras
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, weight_decay=1e-4)
# In PyTorch
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
When to Use What
| Technique | Best For | Strength |
| L1 (Lasso) | Feature selection in linear models | Removes irrelevant features |
| L2 (Ridge) | General purpose regularization | Prevents large weights |
| Dropout | Deep neural networks | Prevents co-adaptation of neurons |
| Batch Norm | Deep networks (especially CNNs) | Stabilizes + regularizes |
| Early Stopping | Any iterative model | Simple, no hyperparameters |
| Data Augmentation | Image classification | More data = less overfit |
| Weight Decay | Transformers, large models | Cleaner than L2 in loss |
Combining Techniques
In practice, you combine multiple regularization techniques:
# A well-regularized CNN
model = Sequential([
layers.Conv2D(32, 3, input_shape=(32,32,3)),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Conv2D(64, 3),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
# + Early Stopping + Data Augmentation during training
Too much regularization = underfitting (model is too constrained). Start with mild regularization and increase until validation performance peaks.