Milestone 05: The Transformer Era (2017)

Milestone Info

Architecture Milestone | Difficulty: ●●●○ | Time: 30–60 min | Prerequisites: Modules 01–08, 11–13

What You’ll Learn

Why attention beats sequential processing (direct access, no bottleneck)
How transformers handle long-range dependencies that defeat RNNs
The architecture that powers GPT, BERT, and every modern LLM

Overview

Imagine compressing an entire book into a single number.

That’s what an RNN does. The input — a sentence, a paragraph, an entire novel — squeezes through one fixed-size hidden state. Information from the beginning fades as new tokens arrive. By token 1000, token 1 is nearly forgotten.

In 2017, Vaswani et al. asked: what if we just… didn’t? Their paper “Attention Is All You Need” proved that attention alone — no recurrence, no convolution — could match the state of the art. No sequential bottleneck. No information loss. Any token can directly attend to any other token.

Nearly a decade later, every frontier LLM is a transformer: GPT-4, Claude, Gemini, Llama. The architecture you finished in Modules 12–13 is the architecture behind the AI boom. This milestone proves your version works on tasks RNNs cannot solve.

What You’ll Build

Attention proof (01_vaswani_attention.py) — three synthetic tasks (sequence reversal, copying, mixed prefixes) that working self-attention can solve and broken attention cannot.
Optional corpus — TinyTalks in datasets/tinytalks/ for Q&A-style transformer experiments, independent of tito milestone run 05.

Tokens --> Embeddings --> [Attention --> FFN] x N --> Output

Prerequisites

Table 1 lists the modules you need to have completed before starting.

Table 1: Prerequisite modules for the Transformer milestone.

Module	Component	What It Provides
01–08	Foundation + Training	Tensor, Layers, DataLoader, Training
11	Embeddings	YOUR token + positional embeddings
12	Attention	YOUR multi-head self-attention
13	Transformers	YOUR LayerNorm + TransformerBlock

Running the Milestone

Before running, ensure you have completed Modules 01–13. You can check your progress:

tito module status

tito milestone run 05

Or:

cd milestones/05_2017_transformer
python 01_vaswani_attention.py

Synthetic vs TinyTalks

The milestone entrypoint uses synthetic sequences so attention can be verified quickly with no extra files. TinyTalks remains available under datasets/tinytalks/ for teaching and follow-on work.

Expected Results

Table 2 records the accuracy and runtime you should expect to see.

Table 2: Expected success criteria and runtime for the Transformer milestone.

Script	Task	Success Criteria	Time
`01_vaswani_attention.py`	Reversal / copy / mixed	See script thresholds (~95% / ~95% / ~90%)	Minutes

The Aha Moment: Direct Access Everywhere

Sequence reversal demands every output position read every input position — a stress test for cross-position attention. Copying forces an identity-like alignment pattern. Mixed prefixes mirror the context-conditioned behavior of decoder-style language models.

RNNs choke on these tasks; the dependencies are too long for a single hidden state. Attention makes them learnable:

RNN:       h[t]   = f(h[t-1], x[t])         # Sequential, lossy
Attention: out[i] = sum(attn[i,j] * v[j])   # Parallel, direct

Your Code Powers This

Table 3 names the TinyTorch components that power this milestone.

Table 3: TinyTorch components that power the Transformer milestone.

Component	Your Module	What It Does
`TokenEmbedding`	Module 11	YOUR learned token representations
`PositionalEmbedding`	Module 11	YOUR position encodings
`MultiHeadAttention`	Module 12	YOUR self-attention mechanism
`TransformerBlock`	Module 13	YOUR attention + feedforward blocks
`LayerNorm`	Module 13	YOUR normalization layers

No PyTorch. No HuggingFace. Just YOUR code.

Historical Context

The 2017 paper trained a 65M-parameter encoder-decoder on English↔︎German translation. The same blocks — same attention, same residual stream, same LayerNorm — scaled to GPT-3’s 175B parameters in 2020 with no architectural change. Bigger model, more data, longer training. That scaling story is what made transformers a paradigm rather than a paper. This milestone uses small synthetic tasks for fast feedback; TinyTalks in datasets/tinytalks/ provides Q&A-style character-level text when you want a real corpus.

Systems Insights

Memory: The attention matrix is O(n²). At 8K tokens, that’s 64M scores per head per layer — the cost that drives every long-context optimization (FlashAttention, sliding windows, KV cache eviction).
Compute: Embarrassingly parallel across positions. RNNs serialize on the hidden state; transformers fill a GPU.
Order: Attention is permutation-invariant. Position embeddings are how the model learns “first” from “last” — shuffle them and the model loses all sense of sequence.

What’s Next

You can build a transformer. The harder question is whether you can run one. A 1B-parameter model needs 4 GB just for fp32 weights, and that’s before activations, gradients, or the KV cache. Every frontier deployment is a fight against memory, latency, and energy. Milestone 06 (MLPerf) is the optimization arc — profiling, compression, and acceleration — that turns the architecture you just validated into something that actually ships.