BERT / GPT Fine-Tuning

Take pre-trained language models and adapt them for your specific NLP tasks using HuggingFace Transformers.

What is Fine-Tuning?

Fine-tuning takes a pre-trained language model (BERT, GPT, RoBERTa) that has already learned language patterns from billions of words, and further trains it on your specific task with your labeled data. The model keeps its language understanding and just learns the new task on top.

BERT was pre-trained on all of English Wikipedia + BookCorpus (3.3 billion words). Fine-tuning lets you leverage all that knowledge with just a few hundred labeled examples.

Pre-Training vs Fine-Tuning

PRE-TRAINING (done by Google/OpenAI, takes weeks on 100s of GPUs): Task: Masked Language Model (BERT) or Next Token Prediction (GPT) Data: Billions of words from the internet Result: General language understanding FINE-TUNING (done by you, takes minutes-hours on 1 GPU): Task: Your specific task (sentiment, NER, QA, etc.) Data: Your labeled dataset (hundreds to thousands of examples) Result: Task-specific model with pre-trained knowledge

HuggingFace Ecosystem

transformers

Library with 100,000+ pre-trained models. Simple API for loading models and tokenizers.

datasets

Library with 1000s of ready-to-use datasets. Easy loading, processing, and caching.

Trainer

High-level training API. Handles training loops, evaluation, saving, logging automatically.

Pipeline

One-line inference. pipeline("sentiment-analysis")("I love this!") → Positive 0.99

Code: Sentiment Classification with BERT

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset import numpy as np from sklearn.metrics import accuracy_score # Load dataset dataset = load_dataset("imdb") # 25K train, 25K test movie reviews # Load pre-trained BERT tokenizer and model model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Tokenize data def tokenize(batch): return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256) tokenized = dataset.map(tokenize, batched=True) # Training arguments training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, learning_rate=2e-5, # Low learning rate for fine-tuning! weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) # Metrics def compute_metrics(eval_pred): preds = np.argmax(eval_pred.predictions, axis=1) return {"accuracy": accuracy_score(eval_pred.label_ids, preds)} # Train trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], compute_metrics=compute_metrics, ) trainer.train() # Inference from transformers import pipeline classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) print(classifier("This movie was absolutely fantastic!")) # → [{'label': 'POSITIVE', 'score': 0.998}]

Code: Named Entity Recognition (NER)

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline # Load pre-trained NER model model_name = "dslim/bert-base-NER" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = "Elon Musk founded SpaceX in Hawthorne, California in 2002." entities = ner(text) for entity in entities: print(f" {entity['word']:20s} → {entity['entity_group']:5s} (score: {entity['score']:.3f})") # Output: # Elon Musk → PER (score: 0.998) # SpaceX → ORG (score: 0.997) # Hawthorne → LOC (score: 0.993) # California → LOC (score: 0.999)

Common NLP Tasks

Task	Model Class	Example
Text Classification	AutoModelForSequenceClassification	Sentiment, spam, topic
Named Entity Recognition	AutoModelForTokenClassification	Find names, places, dates
Question Answering	AutoModelForQuestionAnswering	Extract answers from context
Summarization	AutoModelForSeq2SeqLM	Summarize articles
Translation	AutoModelForSeq2SeqLM	English → French
Text Generation	AutoModelForCausalLM	Complete text, chat

Fine-Tuning Tips

Learning rate: 2e-5 to 5e-5 (much lower than training from scratch)
Epochs: 2-4 is usually enough (more can overfit)
Batch size: 16-32 (limited by GPU memory)
Max length: BERT max = 512 tokens. Truncate or use a long-context model
Warmup: Use linear warmup for first 10% of steps
Weight decay: 0.01 helps prevent overfitting

Fine-tuning BERT-base needs ~11GB GPU RAM. If you're limited, use DistilBERT (40% smaller, 97% accuracy) or use Google Colab's free GPU.