Leverage knowledge from pre-trained models instead of training from scratch. Fine-tune massive models for your specific task.
Transfer learning takes a model trained on a large dataset (like ImageNet with 14 million images) and adapts it for a different but related task. Instead of learning from scratch, you start with learned features and fine-tune for your specific problem. This dramatically reduces training time and data requirements.
A model trained on ImageNet already knows edges, textures, shapes, and objects. You just teach it the last step — your specific classification task.
Early layers learn generic features useful across all image tasks. Only the deeper layers learn task-specific patterns. This is why you can reuse most of the network.
Freeze the entire pre-trained model. Remove the last classification layer and add your own. Only train the new layer. Fast and works with small datasets.
Unfreeze some or all pre-trained layers and retrain with a very low learning rate. Better accuracy but needs more data and compute.
| Model | Params | Top-5 Accuracy | Best For |
|---|---|---|---|
| VGG16/19 | 138M / 144M | 92.7% | Simple, easy to understand |
| ResNet50/101 | 25M / 44M | 93.3% | Deeper without vanishing gradients |
| InceptionV3 | 24M | 93.7% | Multi-scale feature extraction |
| MobileNetV2 | 3.4M | 90.1% | Mobile/edge deployment (lightweight) |
| EfficientNet | 5-66M | 97.1% | Best accuracy-efficiency tradeoff |
| Scenario | Strategy | Why |
|---|---|---|
| Small dataset, similar to ImageNet | Feature Extraction | Enough shared features, avoid overfitting |
| Large dataset, similar to ImageNet | Fine-Tune top layers | Enough data to adapt features |
| Small dataset, different from ImageNet | Feature Extract from earlier layers | Later features too specific |
| Large dataset, very different | Fine-Tune entire model | Enough data to learn new features |
The same concept applies to text. Pre-trained language models have learned grammar, facts, and reasoning from billions of words:
Always use a low learning rate (1e-5 to 1e-4) when fine-tuning. High learning rates destroy the pre-trained features — this is called "catastrophic forgetting".