Transformer Architecture

The architecture behind GPT, BERT, and all modern LLMs. Self-attention is all you need.

What is a Transformer?

Introduced in the 2017 paper "Attention Is All You Need", the Transformer is a neural network architecture that processes sequences using self-attention instead of recurrence (RNN/LSTM). It can look at all positions in a sequence simultaneously, making it parallelizable and much faster to train than RNNs.

Before Transformers, we processed words one at a time (RNN). Transformers see the entire sequence at once and learn which words to "pay attention to" for each word.

Key Innovation: Self-Attention

Self-attention lets each word look at every other word in the sequence and decide how much to "attend to" each one. For the sentence "The cat sat on the mat because it was tired", the word "it" needs to attend strongly to "cat" to understand the reference.

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V Where: Q (Query) = "What am I looking for?" K (Key) = "What do I contain?" V (Value) = "What information do I provide?" d_k = dimension of keys (for scaling) Each word generates its own Q, K, V vectors from learned weight matrices.

Step-by-Step Attention

Self-Attention Calculation

Create Q, K, V — Multiply each word embedding by learned weight matrices W_Q, W_K, W_V
Compute scores — Dot product of Q with every K: score = Q · K^T
Scale — Divide by √d_k to prevent large values that cause vanishing gradients in softmax
Softmax — Normalize scores to get attention weights (sum to 1)
Weighted sum — Multiply each V by its attention weight and sum up

Multi-Head Attention

Instead of one attention function, run multiple attention "heads" in parallel. Each head can focus on different types of relationships:

MultiHead(Q, K, V) = Concat(head_1, head_2, ..., head_h) × W_O head_i = Attention(Q × W_Q_i, K × W_K_i, V × W_V_i) Example: 8 heads, each with d_k = 64, total dimension = 512 Head 1 might focus on: syntactic relationships (subject-verb) Head 2 might focus on: coreference (pronoun-noun) Head 3 might focus on: positional proximity

Full Transformer Architecture

ENCODER (processes input): Input Embedding + Positional Encoding → [Multi-Head Self-Attention → Add & Norm → Feed-Forward → Add & Norm] × N layers DECODER (generates output): Output Embedding + Positional Encoding → [Masked Multi-Head Self-Attention → Add & Norm → Cross-Attention (attend to encoder) → Add & Norm → Feed-Forward → Add & Norm] × N layers → Linear → Softmax → Output Probabilities

Positional Encoding

Since Transformers process all words in parallel (no recurrence), they have no sense of word order. Positional encoding adds position information to each word embedding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) This gives each position a unique pattern that the model can learn to use.

Code: Simple Self-Attention

import numpy as np def self_attention(Q, K, V): d_k = Q.shape[-1] scores = np.matmul(Q, K.T) / np.sqrt(d_k) # Scale dot product weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax output = np.matmul(weights, V) # Weighted sum of values return output, weights # Example: 3 words, embedding dim = 4 np.random.seed(42) embeddings = np.random.randn(3, 4) # 3 words, 4 dimensions # In practice, Q/K/V come from learned projections # Here we simplify: Q = K = V = embeddings output, attention_weights = self_attention(embeddings, embeddings, embeddings) print("Attention weights (which words attend to which):") print(np.round(attention_weights, 3)) print("\nOutput (context-aware representations):") print(np.round(output, 3))

BERT vs GPT

BERT (Encoder-only)

Bidirectional — sees full context. Pre-trained with Masked Language Model. Best for: classification, NER, QA, embeddings.

GPT (Decoder-only)

Autoregressive — left-to-right only. Pre-trained to predict next token. Best for: text generation, conversation, code.

Why Transformers Won

Feature	RNN/LSTM	Transformer
Parallelization	Sequential (slow)	Fully parallel (fast)
Long-range dependencies	Vanishing gradients	Direct attention to any position
Training speed	Slow on long sequences	Fast with GPU parallelism
Scalability	Limited	Scales to billions of parameters

Self-attention has O(n²) complexity with sequence length. For very long sequences (>4096 tokens), variants like Longformer or Flash Attention are needed.