Module 13: Transformers

A transformer block’s cost profile has two regimes that fight for the same HBM: attention at O(N²) memory in sequence length, and MLPs at O(N · d²) compute in hidden width. Stack twelve blocks and the attention matrices alone can exceed the weight matrices once N crosses a few thousand tokens, which is why every production LLM ships with KV caching, activation checkpointing, and attention kernels tuned for the SRAM hierarchy. This module wires LayerNorm, MLPs, and causal self-attention into a working GPT so you can see exactly which components hit which wall first.

NoteModule Info

ARCHITECTURE TIER | Difficulty: ●●●● | Time: 8-10 hours | Prerequisites: 01-08, 10-12

You need tensors, layers, training loops, tokenization, embeddings, and attention already in place. If you can explain how multi-head attention turns queries, keys, and values into a weighted representation, you are ready for this chapter.

🎧 Audio Overview

Listen to an AI-generated overview.

🚀 Launch Binder

Run interactively in your browser.

Open in Binder →

📄 View Source

Browse the source code on GitHub.

View on GitHub →
🔥 Slide Deck · AI-generated
1 / -
Loading slides...

Overview

This is the chapter where everything snaps together. You have tensors, autograd, layers, a training loop, embeddings, and attention. In this module you wire them into a transformer block, stack the blocks, and end up with a working GPT — the same architecture that powers GPT, Claude, and LLaMA. By the end you can run model.generate(prompt) on something you wrote yourself.

A transformer block is a small recipe: layer-normalize, run multi-head attention, add a residual; layer-normalize, run an MLP, add a residual. That is it. Stack twelve of those between an embedding table and a language head, train on next-token prediction, and you have a 60M-parameter language model that produces coherent text.

The patterns you implement here — pre-norm, residual streams, causal masking, 4× MLPs — are exactly what runs in production at billion-token scale. The optimizations differ; the architecture does not.

Learning Objectives

TipBy completing this module, you will:
  • Implement layer normalization to stabilize training across deep networks with learnable scale and shift parameters
  • Design complete transformer blocks combining self-attention, feed-forward networks, and residual connections using pre-norm architecture
  • Build a full GPT model with token embeddings, positional encoding, stacked transformer blocks, and autoregressive generation
  • Analyze parameter scaling and memory requirements, understanding why attention memory grows quadratically with sequence length
  • Master causal masking to enable autoregressive generation while preventing information leakage from future tokens

What You’ll Build

Figure 1 shows the full GPT stack you will assemble, from token IDs at the bottom to logits at the top.

Complete GPT architecture as a vertical stack. Token IDs (orange entry) -> Embeddings (token + position, blue processing) -> Transformer Block 1 (green) -> Transformer Block 2 (green) -> N Blocks (green, dashed border signaling repetition) -> Final LayerNorm (blue) -> Language Head (grey neutral exit).
Figure 1

Implementation roadmap:

Table 1 lays out the implementation in order, one part at a time.

Table 1: Implementation roadmap for LayerNorm, MLP, TransformerBlock, and GPT.
Step What You’ll Implement Key Concept
1 LayerNorm with learnable gamma/beta Stabilizes training by normalizing activations
2 MLP with 4x expansion and GELU Provides non-linear transformation capacity
3 TransformerBlock with pre-norm architecture Combines attention and MLP with residual connections
4 GPT model with embeddings and blocks Complete autoregressive language model
5 Autoregressive generation with temperature Text generation with controllable randomness

The pattern you’ll enable:

# Building and using a complete language model
model = GPT(vocab_size=50000, embed_dim=768, num_layers=12, num_heads=12)
logits = model.forward(tokens)  # Process input sequence
generated = model.generate(prompt, max_new_tokens=50)  # Generate text

What You’re NOT Building (Yet)

To keep this module focused, you will not implement:

  • KV caching for efficient generation (production systems cache keys/values to avoid recomputation)
  • FlashAttention or other memory-efficient attention (PyTorch uses specialized CUDA kernels)
  • Mixture of Experts or sparse transformers (advanced scaling techniques)
  • Multi-query or grouped-query attention (used in modern LLMs for efficiency)

You are building the canonical transformer architecture. Optimizations come later.

API Reference

This section documents the transformer components you’ll implement. Each class builds on the previous, culminating in a complete language model.

Helper Functions

create_causal_mask

create_causal_mask(seq_len: int) -> Tensor

Creates a causal (autoregressive) attention mask that prevents positions from attending to future positions. Returns a lower triangular matrix where position i can only attend to positions j ≤ i.

Returns: Tensor of shape (1, seq_len, seq_len) with 1.0 for allowed positions, 0.0 for masked positions.

LayerNorm

LayerNorm(normalized_shape: int, eps: float = 1e-5) -> LayerNorm

Normalizes activations across features for each sample independently. Essential for stable training of deep transformer networks.

Core Methods:

Table 2 lists the methods on this class.

Table 2: Core methods on the LayerNorm class.
Method Signature Description
forward forward(x: Tensor) -> Tensor Normalize across last dimension with learnable scale/shift
parameters parameters() -> List[Tensor] Returns [gamma, beta] learnable parameters

MLP (Multi-Layer Perceptron)

MLP(embed_dim: int, hidden_dim: int = None, dropout_prob: float = 0.1) -> MLP

Feed-forward network with 4x expansion, GELU activation, and projection back to original dimension.

Core Methods:

Table 3 lists the methods on this class.

Table 3: Core methods on the MLP feed-forward block.
Method Signature Description
forward forward(x: Tensor) -> Tensor Apply Linear → GELU → Linear transformation
parameters parameters() -> List[Tensor] Returns weights and biases from both layers

TransformerBlock

TransformerBlock(embed_dim: int, num_heads: int, mlp_ratio: int = 4, ff_dim: int = None, dropout_prob: float = 0.1) -> TransformerBlock

Complete transformer block with self-attention, MLP, layer normalization, and residual connections using pre-norm architecture.

Core Methods:

Table 4 lists the methods on this class.

Table 4: Core methods on the TransformerBlock class.
Method Signature Description
forward forward(x: Tensor, mask: Tensor = None) -> Tensor Process sequence through attention and MLP sub-layers
parameters parameters() -> List[Tensor] Returns all parameters from attention, norms, and MLP

GPT

GPT(vocab_size: int, embed_dim: int, num_layers: int, num_heads: int, max_seq_len: int = 1024) -> GPT

Complete GPT model for autoregressive language modeling with token embeddings, positional encoding, stacked transformer blocks, and generation capability. The architecture combines token and positional embeddings, processes through multiple transformer blocks with causal masking, applies final layer normalization, and projects to vocabulary logits.

Core Methods:

Table 5 lists the methods on this class.

Table 5: Core methods on the complete GPT model class.
Method Signature Description
forward forward(tokens: Tensor) -> Tensor Compute vocabulary logits for each position with causal masking
generate generate(prompt_tokens: Tensor, max_new_tokens: int = 50, temperature: float = 1.0) -> Tensor Autoregressively generate text using temperature-controlled sampling
parameters parameters() -> List[Tensor] Returns all model parameters from embeddings, blocks, and output head
_create_causal_mask _create_causal_mask(seq_len: int) -> Tensor Internal method creating upper triangular mask for autoregressive attention

Core Concepts

This section explores the architectural innovations that make transformers the dominant deep learning architecture. Understanding these concepts deeply will prepare you for both implementing transformers and designing novel architectures.

Layer Normalization: The Stability Foundation

Without normalization, training a network with dozens of layers becomes nearly impossible: activation distributions drift between layers and gradients explode or vanish. Layer norm pins the distribution back to zero mean and unit variance at every step.

Unlike batch normalization, which mixes statistics across the batch dimension, layer norm normalizes each sample independently across its features. That independence is what makes it work for variable-length sequences: a batch containing a 10-token tweet and a 500-token paragraph would give batch norm meaningless mixed statistics, but layer norm treats each position on its own.

Here’s the complete implementation, showing how a few targeted statistical operations keep the network’s forward and backward passes perfectly scaled:

The code in ?@lst-13-layernorm makes this concrete.

class LayerNorm:
    def __init__(self, normalized_shape, eps=1e-5):
        self.normalized_shape = normalized_shape
        self.eps = eps

        # Learnable parameters initialized to identity transform
        self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True)
        self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True)

    def forward(self, x):
        # Compute statistics across last dimension (features)
        mean = x.mean(axis=-1, keepdims=True)
        diff = x - mean
        variance = (diff * diff).mean(axis=-1, keepdims=True)

        # Normalize to zero mean, unit variance
        std = Tensor(np.sqrt(variance.data + self.eps))
        normalized = (x - mean) / std

        # Apply learnable transformation
        return normalized * self.gamma + self.beta

: Listing 13.1 — LayerNorm implementation. Computes per-sample mean and variance over the feature axis, then applies a learnable affine transform. {#lst-13-layernorm}

The formula is small: output = (x - μ) / σ * γ + β. The effect is large. Forcing every activation back to a consistent scale keeps gradients in a sane range, which is the difference between a 12-layer model that trains and a 12-layer model that diverges in the first 100 steps. The learnable gamma and beta let the model recover any distribution it needs, so the normalization step costs you no expressiveness.

The eps = 1e-5 guards against the rare case where every feature in a row is identical and variance hits zero. Without it, you divide by zero on the first batch.

Pre-Norm Architecture and Residual Connections

The original transformer (Vaswani et al., 2017) put layer norm after each sub-layer. Modern transformers put it before. The fix sounds trivial — swap the order — but it is what lets you train 24, 48, or 100 layers without warmup tricks. The pattern is: normalize, transform, add residual.

Residual connections are the gradient highways. When you write x + f(x), backpropagation gets two paths home: through the transformation f, and straight through the + x shortcut. Even if ∂f/∂x is tiny, the shortcut contributes a clean 1 to the gradient, so signal still reaches the early layers in a 100-layer stack.

Here’s how the transformer block implements pre-norm with residuals:

The code in ?@lst-13-pre-norm-forward makes this concrete.

def forward(self, x, mask=None):
    # First sub-layer: attention with pre-norm
    normed1 = self.ln1.forward(x)
    attention_out = self.attention.forward(normed1, mask)
    x = x + attention_out  # Residual connection

    # Second sub-layer: MLP with pre-norm
    normed2 = self.ln2.forward(x)
    mlp_out = self.mlp.forward(normed2)
    output = x + mlp_out  # Residual connection

    return output

: Listing 13.2 — Pre-norm transformer block forward pass. LayerNorm runs before each sub-layer while residual additions write back into the unnormalized stream. {#lst-13-pre-norm-forward}

Notice the asymmetry: each sub-layer reads a normalized copy of the input but writes its contribution back into the unnormalized residual stream. The normalized path keeps the sub-layer well-behaved; the residual path keeps information flowing intact across the depth of the network.

The MLP: Computational Capacity Through Expansion

Attention handles relationships between tokens. The MLP handles transformation within each token, position by position, independently. Together they cover both axes of the sequence.

The standard recipe is wide-then-narrow: expand the embedding to 4× its width, apply GELU, project back. The expansion gives the model a high-dimensional scratch space to disentangle features; the projection forces it to compress what matters back into the residual stream. The 4× ratio is empirical, not principled — it is what worked in the original paper and has stuck ever since.

The code in ?@lst-13-mlp makes this concrete.

class MLP:
    def __init__(self, embed_dim, hidden_dim=None):
        if hidden_dim is None:
            hidden_dim = 4 * embed_dim  # Standard 4x expansion

        self.linear1 = Linear(embed_dim, hidden_dim)
        self.gelu = GELU()
        self.linear2 = Linear(hidden_dim, embed_dim)

    def forward(self, x):
        hidden = self.linear1.forward(x)
        hidden = self.gelu.forward(hidden)
        output = self.linear2.forward(hidden)
        return output

: Listing 13.3 — Position-wise MLP. Expands to 4x hidden width, applies GELU, and projects back. Holds most of the block’s parameters. {#lst-13-mlp}

GELU replaced ReLU in transformer models because it gates smoothly instead of with a hard cutoff at zero, which gives cleaner gradients for language modeling. The choice matters less than the width: most modern variants (GELU, SwiGLU, GeGLU) train to similar loss curves at this scale.

The MLP dominates the parameter count. For embed_dim = 512, the first projection has 512 × 2048 + 2048 = 1,050,624 (~1.05M) parameters, the second has 2048 × 512 + 512 = 1,049,088 (~1.05M), for ~2.1M per block. In a 12-layer model that is ~25.2M parameters from MLPs alone — more than attention and embeddings combined.

Causal Masking for Autoregressive Generation

GPT is an autoregressive model: it predicts each token based only on previous tokens. During training, the model sees the entire sequence, but causal masking ensures position i cannot attend to positions j > i. This prevents information leakage from the future.

The causal mask is an upper triangular matrix filled with negative infinity:

def create_causal_mask(seq_len: int) -> Tensor:
    # Lower triangle = 1 (can attend), upper triangle = 0 (cannot attend)
    mask = np.tril(np.ones((seq_len, seq_len), dtype=np.float32))
    return Tensor(mask[np.newaxis, :, :])

For a 4-token sequence, this creates:

[[1, 0, 0, 0],   # Position 0 only sees itself
 [1, 1, 0, 0],   # Position 1 sees 0, 1
 [1, 1, 1, 0],   # Position 2 sees 0, 1, 2
 [1, 1, 1, 1]]   # Position 3 sees everything

Inside attention, these zeros become -inf in the logits before softmax. After softmax, -inf collapses to exactly 0 probability, so future positions contribute nothing to the weighted sum. The mask is what lets you train on a 2048-token sequence in a single parallel pass while still computing every prediction as if you only knew the past.

Complete Transformer Block Architecture

The transformer block is where all components unite into a coherent processing unit. Each block transforms the input sequence through two sub-layers: multi-head self-attention and MLP, each wrapped with layer normalization and residual connections.

The code in ?@lst-13-transformer-block makes this concrete.

class TransformerBlock:
    def __init__(self, embed_dim, num_heads, mlp_ratio=4):
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.ln1 = LayerNorm(embed_dim)  # Before attention
        self.ln2 = LayerNorm(embed_dim)  # Before MLP
        hidden_dim = int(embed_dim * mlp_ratio)
        self.mlp = MLP(embed_dim, hidden_dim)

    def forward(self, x, mask=None):
        # First sub-layer: attention with residual
        normed1 = self.ln1.forward(x)
        attention_out = self.attention.forward(normed1, mask)
        x = x + attention_out  # Residual connection

        # Second sub-layer: MLP with residual
        normed2 = self.ln2.forward(x)
        mlp_out = self.mlp.forward(normed2)
        output = x + mlp_out  # Residual connection

        return output

: Listing 13.4 — Complete TransformerBlock class. Wires MultiHeadAttention, two LayerNorms, and an MLP into a single pre-norm block with residual connections. {#lst-13-transformer-block}

Think of the data flow as a residual stream: the input embeddings enter, and every sub-layer adds its contribution on top without overwriting what came before. By the final block, the stream is the original embeddings plus contributions from every attention and MLP sub-layer in the stack — like a stack of transparencies, each adding detail to the same underlying image.

This is why transformers scale to hundreds of layers while plain MLPs choke at ten. Each layer’s job is to adjust the stream, not replace it. Backprop pushes gradients through these small additive corrections, and the residual shortcuts keep them from decaying as they travel.

Parameter Scaling and Memory Requirements

Two scaling laws govern transformer cost. Parameters scale roughly with embed_dim² (because attention and MLP weights are square-ish matrices in that dimension). Attention activations scale with seq_len² (every token attends to every other). Either one can dominate your hardware budget; both will, eventually.

For a single transformer block with embed_dim = 512 and num_heads = 8:

Table 6 breaks down the parameter count for a single transformer block.

Table 6: Parameter count breakdown for a single 512-dim, 8-head transformer block.
Component Parameters Calculation
Multi-Head Attention ~1.05M 4 × (512 × 512) for Q, K, V, O projections
Layer Norm 1 1K 2 × 512 for gamma, beta
MLP ~2.1M (512 × 2048 + 2048) + (2048 × 512 + 512)
Layer Norm 2 1K 2 × 512 for gamma, beta
Total per block ~3.2M Dominated by MLP, then attention

For a complete GPT model, add embeddings and output projection:

  • Token embeddings: vocab_size × embed_dim = 50000 × 512 = 25.6M
  • Position embeddings: max_seq_len × embed_dim = 2048 × 512 = 1.0M
  • Transformer blocks: 12 × 3.2M = 37.8M
  • Output projection: embed_dim × vocab_size (typically tied to token embeddings, so 0 extra)

Grand total: ~64M parameters for this configuration. GPT-2 small was ~117M; GPT-3 is 175B. The arithmetic does not change — you just multiply the same blocks by larger dimensions and more layers.

Memory at training time has three components:

  1. Parameter memory — linear with model size, stored once
  2. Activation memory — needed for backprop, grows with batch size and sequence length
  3. Attention memory — quadratic in sequence length, the bottleneck that bites first

The attention bottleneck is why long context is expensive. For a batch of 4 sequences with 8 heads in float32:

Table 7 shows how attention memory grows with sequence length.

Table 7: Attention memory growth with sequence length for a fixed-size transformer.
Sequence Length Attention Matrix Size Memory (MB)
512 4 × 8 × 512 × 512 32.0
1024 4 × 8 × 1024 × 1024 128.0
2048 4 × 8 × 2048 × 2048 512.0
4096 4 × 8 × 4096 × 4096 2048.0

Double the sequence length, quadruple the memory — per layer. A 12-layer model at 4K context burns 24 GB just on attention matrices. This is the wall that drove FlashAttention, sparse attention, and linear attention into existence.

Per transformer block, the cost profile decomposes cleanly into two regimes:

  • Attention: compute is O(N² · d), memory is O(N²) — quadratic in sequence length N.
  • MLP: compute is O(N · d²), memory is O(N · d) — linear in N but quadratic in the hidden width d.

Attention overtakes the MLP the moment N > d, which for every production LLM (where N is thousands of tokens and d is a few hundred to a few thousand) is always. That crossover is why the attention matrix — not the weight matrices — is the first thing that blows up your HBM budget.

NoteSystems Implication: KV-Cache Memory Growth During Autoregressive Decoding

The O(N²) scaling story above is about training. Autoregressive inference has a different — and equally brutal — memory problem: the KV cache. When generating tokens one at a time, naive attention would recompute the keys and values for every previous token at every step (the 66× overhead shown in the generation efficiency question below). Production inference instead caches K and V for every token ever seen, trading compute for memory.

The cache size per request is 2 × num_layers × num_heads × head_dim × seq_len × bytes — two copies (K and V), one per layer, per head. For a 70 B Llama-style model at 32K context in FP16, that is roughly 20 GB per request, often larger than the model’s own weight footprint. At scale this is the dominant cost: inference servers are memory-bound on the KV cache, not FLOP-bound on the matmuls. This is why recent architectures use multi-query attention (MQA) and grouped-query attention (GQA) — they share keys and values across heads to shrink the cache by 4–8×, trading a small amount of model quality for a huge cut in serving memory. Every optimization in modern LLM inference stacks — paged attention, prefix caching, speculative decoding — is ultimately a strategy for managing the KV cache.

Production Context

Your Implementation vs. PyTorch

Your transformer implementation and PyTorch’s production transformers share the same architectural principles. The differences lie in optimization: PyTorch uses fused CUDA kernels, memory-efficient attention, and various tricks for speed and scale.

Table 8 places your implementation side by side with the production reference for direct comparison.

Table 8: Feature comparison between TinyTorch transformers and production PyTorch.
Feature Your Implementation PyTorch
Architecture Pre-norm transformer blocks Pre-norm (modern) or post-norm (legacy)
Attention Standard scaled dot-product FlashAttention, sparse attention
Memory Full attention matrices KV caching, memory-efficient attention
Precision Float32 Mixed precision (FP16/BF16)
Parallelism Single device Model parallel, pipeline parallel
Efficiency Educational clarity Production optimization

Code Comparison

The following comparison shows equivalent transformer usage in TinyTorch and PyTorch. The API patterns are nearly identical because your implementation follows production design principles.

from tinytorch.core.transformers import TransformerBlock, GPT

# Create transformer block
block = TransformerBlock(embed_dim=512, num_heads=8)
output = block.forward(x)

# Create complete GPT model
model = GPT(vocab_size=50000, embed_dim=768, num_layers=12, num_heads=12)
logits = model.forward(tokens)
generated = model.generate(prompt, max_new_tokens=50, temperature=0.8)
import torch.nn as nn

# PyTorch transformer block (using nn.TransformerEncoderLayer)
block = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=2048)
output = block(x)

# Complete model (using HuggingFace transformers)
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
outputs = model.generate(input_ids, max_new_tokens=50, temperature=0.8)

Let’s walk through the key similarities and differences:

  • Line 1-2 (Block creation): Both create transformer blocks with identical parameters. PyTorch uses TransformerEncoderLayer while you built TransformerBlock from scratch.
  • Line 3 (Forward pass): Both process sequences with identical semantics. Your implementation explicitly shows attention and MLP; PyTorch’s is identical internally.
  • Line 5-6 (Model creation): Both create complete language models. PyTorch typically uses pre-trained models via HuggingFace; you build from scratch.
  • Line 7 (Generation): Both support autoregressive generation with temperature control. PyTorch adds beam search, top-k/top-p sampling, and other advanced techniques.
TipWhat’s Identical

The core architecture, pre-norm pattern, residual connections, and causal masking are identical. When you debug transformer models in PyTorch, you’ll understand exactly what’s happening because you built it yourself.

Why Transformers Matter at Scale

The architecture you just built is the same one that runs at the frontier. The difference is the dimensions.

  • GPT-3 (175B parameters) — 350 GB just to store weights in float16, 700 GB for mixed-precision training state.
  • Training cost — roughly $4.6M in compute, ~10,000 GPUs for weeks.
  • Inference latency — 100–200 ms to process 2048 tokens on optimized hardware.
  • Context scaling — going from 2K to 32K context costs 256× more attention memory per layer.

Those numbers exist because the architecture you built has no free lunch baked in: every doubling of context squares the attention cost, every doubling of width squares the parameter count. The next chapter is where you start measuring those costs in your own model — and the chapters after that are where you start beating them down.

TipCheck Your Understanding — Transformers

Before moving on, verify you can articulate each of the following:

If any of these feels fuzzy, revisit the Core Concepts section (especially Pre-Norm Architecture and Residual Connections and Parameter Scaling and Memory Requirements) before moving on.

Self-Check Questions

Test your understanding of transformer architecture and scaling with these systems thinking questions.

Q1: Attention Memory Calculation

A transformer with batch_size=8, num_heads=16, seq_len=2048 computes attention matrices at each layer. How much memory does one layer’s attention matrices consume? How does this scale if you double the sequence length to 4096?

Attention matrix size: batch_size × num_heads × seq_len × seq_len = 8 × 16 × 2048 × 2048 = 536,870,912 elements

Memory: 536,870,912 × 4 bytes (float32) = 2,147,483,648 bytes = 2.0 GB

Doubling sequence length to 4096: = 8 × 16 × 4096 × 4096 = 2,147,483,648 elements = 8.0 GB

Scaling: doubling sequence length quadruples memory (4× increase). This quadratic scaling is why long context is expensive — and why sparse attention, FlashAttention, and linear-attention variants exist.

Q2: Parameter Distribution Analysis

For a GPT model with vocab_size=50000, embed_dim=768, num_layers=12, num_heads=12, calculate approximate total parameters. Which component dominates the parameter count: embeddings or transformer blocks?

Token Embeddings: 50000 × 768 = 38.4M

Position Embeddings: 2048 × 768 = 1.6M (assuming max_seq_len=2048)

Transformer Blocks: each block has ~7.1M parameters with embed_dim=768

  • Attention: 4 × (768 × 768) = ~2.4M
  • MLP: (768 × 3072 + 3072) + (3072 × 768 + 768) = ~4.7M
  • Layer norms: negligible
  • Per block: ~7.1M
  • Total blocks: 12 × 7.1M = 85M

Output Projection: usually tied to embeddings (0 additional)

Total: 38.4M + 1.6M + 85M125M parameters — close to GPT-2 small.

Dominant component: transformer blocks (85M) outweigh embeddings (40M). As models scale, blocks pull further ahead because they grow with embed_dim² while embeddings grow only linearly with vocab size.

Q3: Residual Connection Benefits

Why do transformers use residual connections (x + f(x)) rather than just f(x)? How do residual connections affect gradient flow during backpropagation in a 24-layer transformer?

Without residual connections (y = f(x)):

  • Gradients must flow through every transformation layer.
  • Each layer’s Jacobian can shrink gradients (vanishing) or amplify them (exploding).
  • Across 24 layers, gradients can collapse to ~0 or blow up to ~∞.

With residual connections (y = x + f(x)):

  • Backprop gives ∂y/∂x = 1 + ∂f/∂x.
  • The +1 is a direct gradient path that does not depend on the sub-layer’s weights.
  • Even if ∂f/∂x is small, the +1 keeps the signal alive.
  • The result is a stack of “gradient highways” running the depth of the network.

24-layer impact: without residuals, a 0.9 per-layer attenuation compounds to 0.9²⁴ ≈ 0.08. With residuals, the +1 shortcut delivers gradients to early layers at full strength. This is why transformers scale to 100+ layers while plain feed-forward nets struggle past 10.

Q4: Autoregressive Generation Efficiency

Your generate() method processes the entire sequence for each new token. For generating 100 tokens with prompt length 50, how many total forward passes occur? Why is this inefficient?

Current implementation: for each of 100 new tokens, reprocess the entire sequence.

  • Token 1: process 50 tokens (prompt)
  • Token 2: process 51 tokens (prompt + 1)
  • Token 3: process 52 tokens
  • Token 100: process 149 tokens

Total forward passes: 50 + 51 + 52 + ... + 149 = 9,950 token processings

Why inefficient: attention recomputes key/value projections for every previous token at every step, even though those projections do not change. The key/value for position 50 is recomputed 100 times.

KV Caching optimization: store computed key/value projections for previous tokens.

  • Each new token only computes its own key/value.
  • Attention uses cached keys/values from previous tokens.
  • Total computation: 50 (prompt) + 100 (new tokens) = 150 token processings.

Speedup: 9950 / 15066× faster for this example. The ratio grows with generation length, which is why KV caching is non-negotiable in production inference stacks.

Q5: Layer Normalization vs Batch Normalization

Why do transformers use layer normalization instead of batch normalization? Consider a batch with sequences of varying lengths: [10 tokens, 50 tokens, 100 tokens].

Batch Normalization normalizes across the batch dimension:

  • For position 5, statistics would mix all three sequences.
  • But sequence 1 has no position 50, sequence 2 has no position 100.
  • With padding, statistics get contaminated by pad tokens.
  • Depends on batch composition: different batches give different statistics.

Layer Normalization normalizes across features for each sample:

  • Each position is normalized independently: (x - mean(x)) / std(x).
  • Position 5 of sequence 1 has no influence on position 50 of sequence 2.
  • No dependency on batch composition.
  • Works naturally with variable-length sequences.

Example: for a tensor of shape (batch=3, seq=10, features=768):

  • Batch norm computes 10 × 768 statistics across the batch dimension (problematic).
  • Layer norm computes 3 × 10 statistics across the feature dimension (independent).

Why it matters: transformers process variable-length sequences. Layer norm treats each position on its own, which makes it robust to length variation and batch composition. Batch norm depends on its batch; layer norm does not.

Key Takeaways

  • A transformer block is a recipe, not a mystery: pre-norm → attention → residual, pre-norm → MLP → residual. Stack twelve of those between embeddings and a language head and you have GPT.
  • Residuals are the gradient highway: x + f(x) gives backprop a +1 shortcut that keeps signal alive through arbitrarily deep stacks — without them, a 24-layer network’s gradients decay to zero.
  • Attention’s O(N²) memory is the first wall you hit: doubling sequence length quadruples attention memory per layer; long context is expensive because of geometry, not implementation.
  • The KV cache dominates inference memory: at serving time, 2 × layers × heads × head_dim × seq_len × bytes often eclipses the model weights themselves, which is why MQA, GQA, and paged attention now ship in every production LLM stack.

Coming next: Module 14 opens the Optimization Tier with measurement — profiling your transformer’s forward pass to see exactly where time and memory are spent before you try to cut either.

Further Reading

The transition from a theoretical transformer diagram to a planet-scale language model is governed by ruthless hardware constraints and brilliant systems engineering. For students seeking to understand the architectural turning points and the hardware-software co-design that made massive models possible, study these foundational texts.

Seminal Papers

  • Attention Is All You Need - Vaswani et al. (2017). The paper that introduced the transformer architecture, revolutionizing sequence modeling. Describes multi-head attention, positional encoding, and the encoder-decoder structure. Systems Implication: Eradicated the sequential compute bottleneck inherent to RNNs. By processing entire sequences simultaneously, the transformer perfectly aligns with the SIMD architecture of modern GPUs, enabling massive parallelization and achieving near-peak hardware utilization. arXiv:1706.03762

  • Language Models are Few-Shot Learners (GPT-3) - Brown et al. (2020). Demonstrates scaling laws and emergent capabilities of large language models. Shows how transformer performance improves predictably with immense scale. Systems Implication: At 175 billion parameters, the model shattered the memory capacity of a single GPU. This forced the systems engineering community to rapidly mature and deploy complex 3D parallelism strategies (data, tensor, and pipeline parallelism) across clusters of thousands of distinct compute nodes. arXiv:2005.14165

  • FlashAttention: Fast and Memory-Efficient Exact Attention - Dao et al. (2022). Overcomes the O(n²) memory scaling limit of attention using IO-aware algorithms, enabling practical long-context processing. Systems Implication: Leveraged precise, hardware-aware SRAM tiling to completely bypass the materialization of the gigantic quadratic attention matrix in HBM. This critical optimization shifted attention from being a catastrophic memory-bound operation back to a highly efficient compute-bound workload. arXiv:2205.14135

  • On Layer Normalization in the Transformer Architecture - Xiong et al. (2020). Mathematically analyzes pre-norm versus post-norm architectures, revealing why pre-norm enables the training of much deeper networks. Systems Implication: Pre-norm architectures prevent catastrophic gradient vanishing or exploding during backpropagation across hundreds of layers. By guaranteeing gradient stability, these architectures ensure that massive, distributed training runs remain convergent, saving millions of dollars in idle or wasted GPU compute. arXiv:2002.04745

Additional Resources

What’s Next

You finished the Architecture Tier. You have a transformer that trains, generates text, and matches the structural blueprint of GPT. Everything from here is about making it fast, small, and deployable.

Before that, the next two chapters take the Architecture Tier on a historical test drive. The Architecture Milestones — LeNet-5 on CIFAR-10 (1998) and the 2017 Transformer attention test — run your Conv2d, MaxPool2d, multi-head attention, and TransformerBlock on the exact problems those architectures were invented to solve. Convolutional networks hit 70%+ on natural images; attention clears sequence-reversal tasks RNNs can’t. Both are proof that the layers you just wrote behave the way the original papers claimed.

NoteComing Up: Architecture Milestones, then Module 14 — Profiling

First: two Architecture Milestones exercise your Conv/Pool layers (CIFAR-10) and your attention stack (sequence tasks) on their landmark problems. Then the Optimization Tier opens with the only honest place to start: measurement. Before you optimize anything, you need to know where the time goes. In Module 14 you instrument your transformer’s forward pass and answer concrete questions — how much of a step is spent in attention versus the MLP, how memory grows with sequence length on your machine, and which layer is the actual bottleneck. Every optimization in the chapters that follow (quantization, kernel fusion, KV caching) targets a number you measured in Module 14.

Where your transformer goes from here:

Table 9 traces how this module is reused by later parts of the curriculum.

Table 9: How transformers feed into profiling, quantization, and capstone modules.
Module What It Does Your Transformer In Action
14: Profiling Measure performance bottlenecks profiler.analyze(model.forward(x)) reveals where time and memory go
15: Quantization Reduce precision to 8-bit / 4-bit Shrink the 64M model to a fraction of its size with minimal accuracy loss
20: Capstone Production deployment Serve your transformer end-to-end with the systems you built

Get Started

TipInteractive Options
WarningSave Your Progress

Binder sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.

Back to top