Transformer Architecture
The architecture behind GPT, BERT, and all modern LLMs. Self-attention is all you need.
What is a Transformer?
Introduced in the 2017 paper "Attention Is All You Need", the Transformer is a neural network architecture that processes sequences using self-attention instead of recurrence (RNN/LSTM). It can look at all positions in a sequence simultaneously, making it parallelizable and much faster to train than RNNs.
Before Transformers, we processed words one at a time (RNN). Transformers see the entire sequence at once and learn which words to "pay attention to" for each word.
Key Innovation: Self-Attention
Self-attention lets each word look at every other word in the sequence and decide how much to "attend to" each one. For the sentence "The cat sat on the mat because it was tired", the word "it" needs to attend strongly to "cat" to understand the reference.
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
Where:
Q (Query) = "What am I looking for?"
K (Key) = "What do I contain?"
V (Value) = "What information do I provide?"
d_k = dimension of keys (for scaling)
Each word generates its own Q, K, V vectors from learned weight matrices.
Step-by-Step Attention
Self-Attention Calculation
- Create Q, K, V — Multiply each word embedding by learned weight matrices W_Q, W_K, W_V
- Compute scores — Dot product of Q with every K: score = Q · K^T
- Scale — Divide by √d_k to prevent large values that cause vanishing gradients in softmax
- Softmax — Normalize scores to get attention weights (sum to 1)
- Weighted sum — Multiply each V by its attention weight and sum up
Multi-Head Attention
Instead of one attention function, run multiple attention "heads" in parallel. Each head can focus on different types of relationships:
MultiHead(Q, K, V) = Concat(head_1, head_2, ..., head_h) × W_O
head_i = Attention(Q × W_Q_i, K × W_K_i, V × W_V_i)
Example: 8 heads, each with d_k = 64, total dimension = 512
Head 1 might focus on: syntactic relationships (subject-verb)
Head 2 might focus on: coreference (pronoun-noun)
Head 3 might focus on: positional proximity
Full Transformer Architecture
ENCODER (processes input):
Input Embedding + Positional Encoding
→ [Multi-Head Self-Attention → Add & Norm → Feed-Forward → Add & Norm] × N layers
DECODER (generates output):
Output Embedding + Positional Encoding
→ [Masked Multi-Head Self-Attention → Add & Norm
→ Cross-Attention (attend to encoder) → Add & Norm
→ Feed-Forward → Add & Norm] × N layers
→ Linear → Softmax → Output Probabilities
Positional Encoding
Since Transformers process all words in parallel (no recurrence), they have no sense of word order. Positional encoding adds position information to each word embedding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
This gives each position a unique pattern that the model can learn to use.
Code: Simple Self-Attention
import numpy as np
def self_attention(Q, K, V):
d_k = Q.shape[-1]
scores = np.matmul(Q, K.T) / np.sqrt(d_k) # Scale dot product
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax
output = np.matmul(weights, V) # Weighted sum of values
return output, weights
# Example: 3 words, embedding dim = 4
np.random.seed(42)
embeddings = np.random.randn(3, 4) # 3 words, 4 dimensions
# In practice, Q/K/V come from learned projections
# Here we simplify: Q = K = V = embeddings
output, attention_weights = self_attention(embeddings, embeddings, embeddings)
print("Attention weights (which words attend to which):")
print(np.round(attention_weights, 3))
print("\nOutput (context-aware representations):")
print(np.round(output, 3))
BERT vs GPT
BERT (Encoder-only)
Bidirectional — sees full context. Pre-trained with Masked Language Model. Best for: classification, NER, QA, embeddings.
GPT (Decoder-only)
Autoregressive — left-to-right only. Pre-trained to predict next token. Best for: text generation, conversation, code.
Why Transformers Won
| Feature | RNN/LSTM | Transformer |
| Parallelization | Sequential (slow) | Fully parallel (fast) |
| Long-range dependencies | Vanishing gradients | Direct attention to any position |
| Training speed | Slow on long sequences | Fast with GPU parallelism |
| Scalability | Limited | Scales to billions of parameters |
Self-attention has O(n²) complexity with sequence length. For very long sequences (>4096 tokens), variants like Longformer or Flash Attention are needed.