Transformer Architecture

The Transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al., revolutionized AI by replacing recurrent networks with a purely attention-based design. Here’s a breakdown of how it works:

Core Idea: Self-Attention

Instead of processing sequences step-by-step (like RNNs), Transformers process all tokens in parallel and learn relationships between every pair of tokens simultaneously. This is done via self-attention.

For each token, three vectors are computed:

Query (Q) — what this token is looking for
Key (K) — what this token offers
Value (V) — the actual content to pass forward

Attention is computed as: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The scaling by √d_k prevents dot products from growing too large in high dimensions.

Multi-Head Attention

Rather than one attention computation, the model runs multiple attention heads in parallel — each learning different types of relationships (e.g., syntactic, semantic, co-reference). Their outputs are concatenated and projected.

Positional Encoding

Since there’s no recurrence, the model has no inherent sense of order. Positional encodings (sinusoidal functions or learned embeddings) are added to token embeddings to inject sequence position information.

The Encoder-Decoder Structure

The original Transformer had two stacks:

Encoder (used in models like BERT):

Input embeddings + positional encoding
Multi-head self-attention
Feed-forward network (FFN)
Layer norm + residual connections around each sub-layer

Decoder (used in models like GPT):

Same as encoder, but adds masked self-attention (tokens can only attend to past tokens)
Cross-attention layer — attends to the encoder’s output
FFN + layer norm + residuals

Modern LLMs like GPT are decoder-only; models like BERT are encoder-only.

Feed-Forward Network (FFN)

After attention, each position passes through a small 2-layer MLP independently: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

This adds non-linearity and expands the model’s representational capacity.

Why Transformers Won

Property	RNNs/LSTMs	Transformers
Parallelism	Sequential	Fully parallel
Long-range dependencies	Struggles	Handles natively
Training speed	Slow	Fast (on GPUs/TPUs)
Scalability	Limited	Scales to billions of params

Key Variants

BERT — Encoder-only, trained with masked language modeling; great for classification and understanding tasks.
GPT — Decoder-only, trained autoregressively; great for generation.
T5 / BART — Full encoder-decoder; great for seq2seq tasks like translation and summarization.
Vision Transformer (ViT) — Applies the same architecture to image patches instead of text tokens.

The Transformer’s combination of parallelism, expressiveness, and scalability is what enabled the modern LLM era.

Core Idea: Self-Attention#

Multi-Head Attention#

Positional Encoding#

The Encoder-Decoder Structure#

Feed-Forward Network (FFN)#

Why Transformers Won#

Key Variants#