Transfer Learning

Leverage knowledge from pre-trained models instead of training from scratch. Fine-tune massive models for your specific task.

What is Transfer Learning?

Transfer learning takes a model trained on a large dataset (like ImageNet with 14 million images) and adapts it for a different but related task. Instead of learning from scratch, you start with learned features and fine-tune for your specific problem. This dramatically reduces training time and data requirements.

A model trained on ImageNet already knows edges, textures, shapes, and objects. You just teach it the last step — your specific classification task.

Why Transfer Learning Works

Layer 1-3: Universal features (edges, corners, textures) ← Keep frozen Layer 4-6: Mid-level features (patterns, parts, shapes) ← Optionally fine-tune Layer 7+: Task-specific features (faces, cars, diseases) ← Replace and retrain

Early layers learn generic features useful across all image tasks. Only the deeper layers learn task-specific patterns. This is why you can reuse most of the network.

Two Strategies

Feature Extraction

Freeze the entire pre-trained model. Remove the last classification layer and add your own. Only train the new layer. Fast and works with small datasets.

Fine-Tuning

Unfreeze some or all pre-trained layers and retrain with a very low learning rate. Better accuracy but needs more data and compute.

Popular Pre-Trained Models

Model	Params	Top-5 Accuracy	Best For
VGG16/19	138M / 144M	92.7%	Simple, easy to understand
ResNet50/101	25M / 44M	93.3%	Deeper without vanishing gradients
InceptionV3	24M	93.7%	Multi-scale feature extraction
MobileNetV2	3.4M	90.1%	Mobile/edge deployment (lightweight)
EfficientNet	5-66M	97.1%	Best accuracy-efficiency tradeoff

Code: Feature Extraction

import tensorflow as tf from tensorflow.keras import layers, Model from tensorflow.keras.applications import ResNet50 from tensorflow.keras.preprocessing.image import ImageDataGenerator # Load pre-trained ResNet50 (without the top classification layer) base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3)) # Freeze all layers base_model.trainable = False # Add custom classification head model = tf.keras.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dense(256, activation='relu'), layers.Dropout(0.5), layers.Dense(5, activation='softmax') # 5 classes ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.summary() # Train only the new layers (base_model is frozen) # model.fit(train_data, epochs=10)

Code: Fine-Tuning

# After feature extraction training, unfreeze top layers for fine-tuning base_model.trainable = True # Freeze all layers except the last 20 for layer in base_model.layers[:-20]: layer.trainable = False # Recompile with a VERY low learning rate (important!) model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), # 10x-100x smaller loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) # Continue training with fine-tuning # model.fit(train_data, epochs=5)

When to Use Which Strategy

Scenario	Strategy	Why
Small dataset, similar to ImageNet	Feature Extraction	Enough shared features, avoid overfitting
Large dataset, similar to ImageNet	Fine-Tune top layers	Enough data to adapt features
Small dataset, different from ImageNet	Feature Extract from earlier layers	Later features too specific
Large dataset, very different	Fine-Tune entire model	Enough data to learn new features

Transfer Learning for NLP

The same concept applies to text. Pre-trained language models have learned grammar, facts, and reasoning from billions of words:

BERT — Bidirectional, great for classification, NER, QA
GPT — Autoregressive, great for text generation
RoBERTa — Optimized BERT training procedure
T5 — Text-to-text framework, converts all tasks to text

Always use a low learning rate (1e-5 to 1e-4) when fine-tuning. High learning rates destroy the pre-trained features — this is called "catastrophic forgetting".