BERT / GPT Fine-Tuning
Take pre-trained language models and adapt them for your specific NLP tasks using HuggingFace Transformers.
What is Fine-Tuning?
Fine-tuning takes a pre-trained language model (BERT, GPT, RoBERTa) that has already learned language patterns from billions of words, and further trains it on your specific task with your labeled data. The model keeps its language understanding and just learns the new task on top.
BERT was pre-trained on all of English Wikipedia + BookCorpus (3.3 billion words). Fine-tuning lets you leverage all that knowledge with just a few hundred labeled examples.
Pre-Training vs Fine-Tuning
PRE-TRAINING (done by Google/OpenAI, takes weeks on 100s of GPUs):
Task: Masked Language Model (BERT) or Next Token Prediction (GPT)
Data: Billions of words from the internet
Result: General language understanding
FINE-TUNING (done by you, takes minutes-hours on 1 GPU):
Task: Your specific task (sentiment, NER, QA, etc.)
Data: Your labeled dataset (hundreds to thousands of examples)
Result: Task-specific model with pre-trained knowledge
HuggingFace Ecosystem
transformers
Library with 100,000+ pre-trained models. Simple API for loading models and tokenizers.
datasets
Library with 1000s of ready-to-use datasets. Easy loading, processing, and caching.
Trainer
High-level training API. Handles training loops, evaluation, saving, logging automatically.
Pipeline
One-line inference. pipeline("sentiment-analysis")("I love this!") → Positive 0.99
Code: Sentiment Classification with BERT
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score
# Load dataset
dataset = load_dataset("imdb") # 25K train, 25K test movie reviews
# Load pre-trained BERT tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize data
def tokenize(batch):
return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)
tokenized = dataset.map(tokenize, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5, # Low learning rate for fine-tuning!
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Metrics
def compute_metrics(eval_pred):
preds = np.argmax(eval_pred.predictions, axis=1)
return {"accuracy": accuracy_score(eval_pred.label_ids, preds)}
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
# Inference
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
print(classifier("This movie was absolutely fantastic!"))
# → [{'label': 'POSITIVE', 'score': 0.998}]
Code: Named Entity Recognition (NER)
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load pre-trained NER model
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
entities = ner(text)
for entity in entities:
print(f" {entity['word']:20s} → {entity['entity_group']:5s} (score: {entity['score']:.3f})")
# Output:
# Elon Musk → PER (score: 0.998)
# SpaceX → ORG (score: 0.997)
# Hawthorne → LOC (score: 0.993)
# California → LOC (score: 0.999)
Common NLP Tasks
| Task | Model Class | Example |
| Text Classification | AutoModelForSequenceClassification | Sentiment, spam, topic |
| Named Entity Recognition | AutoModelForTokenClassification | Find names, places, dates |
| Question Answering | AutoModelForQuestionAnswering | Extract answers from context |
| Summarization | AutoModelForSeq2SeqLM | Summarize articles |
| Translation | AutoModelForSeq2SeqLM | English → French |
| Text Generation | AutoModelForCausalLM | Complete text, chat |
Fine-Tuning Tips
- Learning rate: 2e-5 to 5e-5 (much lower than training from scratch)
- Epochs: 2-4 is usually enough (more can overfit)
- Batch size: 16-32 (limited by GPU memory)
- Max length: BERT max = 512 tokens. Truncate or use a long-context model
- Warmup: Use linear warmup for first 10% of steps
- Weight decay: 0.01 helps prevent overfitting
Fine-tuning BERT-base needs ~11GB GPU RAM. If you're limited, use DistilBERT (40% smaller, 97% accuracy) or use Google Colab's free GPU.