ML Playground / Transfer Learning

Transfer Learning

Leverage knowledge from pre-trained models instead of training from scratch. Fine-tune massive models for your specific task.

What is Transfer Learning?

Transfer learning takes a model trained on a large dataset (like ImageNet with 14 million images) and adapts it for a different but related task. Instead of learning from scratch, you start with learned features and fine-tune for your specific problem. This dramatically reduces training time and data requirements.

A model trained on ImageNet already knows edges, textures, shapes, and objects. You just teach it the last step — your specific classification task.

Why Transfer Learning Works

Layer 1-3: Universal features (edges, corners, textures) ← Keep frozen Layer 4-6: Mid-level features (patterns, parts, shapes) ← Optionally fine-tune Layer 7+: Task-specific features (faces, cars, diseases) ← Replace and retrain

Early layers learn generic features useful across all image tasks. Only the deeper layers learn task-specific patterns. This is why you can reuse most of the network.

Two Strategies

Feature Extraction

Freeze the entire pre-trained model. Remove the last classification layer and add your own. Only train the new layer. Fast and works with small datasets.

Fine-Tuning

Unfreeze some or all pre-trained layers and retrain with a very low learning rate. Better accuracy but needs more data and compute.

Popular Pre-Trained Models

ModelParamsTop-5 AccuracyBest For
VGG16/19138M / 144M92.7%Simple, easy to understand
ResNet50/10125M / 44M93.3%Deeper without vanishing gradients
InceptionV324M93.7%Multi-scale feature extraction
MobileNetV23.4M90.1%Mobile/edge deployment (lightweight)
EfficientNet5-66M97.1%Best accuracy-efficiency tradeoff

Code: Feature Extraction

import tensorflow as tf from tensorflow.keras import layers, Model from tensorflow.keras.applications import ResNet50 from tensorflow.keras.preprocessing.image import ImageDataGenerator # Load pre-trained ResNet50 (without the top classification layer) base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3)) # Freeze all layers base_model.trainable = False # Add custom classification head model = tf.keras.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dense(256, activation='relu'), layers.Dropout(0.5), layers.Dense(5, activation='softmax') # 5 classes ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.summary() # Train only the new layers (base_model is frozen) # model.fit(train_data, epochs=10)

Code: Fine-Tuning

# After feature extraction training, unfreeze top layers for fine-tuning base_model.trainable = True # Freeze all layers except the last 20 for layer in base_model.layers[:-20]: layer.trainable = False # Recompile with a VERY low learning rate (important!) model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), # 10x-100x smaller loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) # Continue training with fine-tuning # model.fit(train_data, epochs=5)

When to Use Which Strategy

ScenarioStrategyWhy
Small dataset, similar to ImageNetFeature ExtractionEnough shared features, avoid overfitting
Large dataset, similar to ImageNetFine-Tune top layersEnough data to adapt features
Small dataset, different from ImageNetFeature Extract from earlier layersLater features too specific
Large dataset, very differentFine-Tune entire modelEnough data to learn new features

Transfer Learning for NLP

The same concept applies to text. Pre-trained language models have learned grammar, facts, and reasoning from billions of words:

Always use a low learning rate (1e-5 to 1e-4) when fine-tuning. High learning rates destroy the pre-trained features — this is called "catastrophic forgetting".