Milestone 05: The Transformer Era (2017)
Architecture Milestone | Difficulty: ●●●○ | Time: 30–60 min | Prerequisites: Modules 01–08, 10–13
- Why attention beats sequential processing (direct access, no bottleneck)
- How transformers handle long-range dependencies that defeat RNNs
- The architecture that powers GPT, BERT, and every modern LLM
Overview
Imagine compressing an entire book into a single number.
That’s what an RNN does. The input — a sentence, a paragraph, an entire novel — squeezes through one fixed-size hidden state. Information from the beginning fades as new tokens arrive. By token 1000, token 1 is nearly forgotten.
In 2017, Vaswani et al. asked: what if we just… didn’t? Their paper “Attention Is All You Need” proved that attention alone — no recurrence, no convolution — could match the state of the art. No sequential bottleneck. No information loss. Any token can directly attend to any other token.
Nearly a decade later, every frontier LLM is a transformer: GPT-4, Claude, Gemini, Llama. The architecture you finished in Modules 12–13 is the architecture behind the AI boom. This milestone proves your version works on tasks RNNs cannot solve.
What You’ll Build
- Attention proof (
01_vaswani_attention.py) — three synthetic tasks (sequence reversal, copying, mixed prefixes) that working self-attention can solve and broken attention cannot. - Optional corpus — TinyTalks in
datasets/tinytalks/for Q&A-style transformer experiments, independent oftito milestone run 05.
Tokens --> Embeddings --> [Attention --> FFN] x N --> Output
Prerequisites
Table 1 lists the modules you need to have completed before starting.
| Module | Component | What It Provides |
|---|---|---|
| 01–08 | Foundation + Training | Tensor, Layers, DataLoader, Training |
| 10 | Tokenization | YOUR CharTokenizer |
| 11 | Embeddings | YOUR token + positional embeddings |
| 12 | Attention | YOUR multi-head self-attention |
| 13 | Transformers | YOUR LayerNorm + TransformerBlock |
Running the Milestone
Before running, ensure you have completed Modules 01–13. You can check your progress:
tito module statustito milestone run 05Or:
cd milestones/05_2017_transformer
python 01_vaswani_attention.pyThe milestone entrypoint uses synthetic sequences so attention can be verified quickly with no extra files. TinyTalks remains available under datasets/tinytalks/ for teaching and follow-on work.
Expected Results
Table 2 records the accuracy and runtime you should expect to see.
| Script | Task | Success Criteria | Time |
|---|---|---|---|
01_vaswani_attention.py |
Reversal / copy / mixed | See script thresholds (~95% / ~95% / ~90%) | Minutes |
The Aha Moment: Direct Access Everywhere
Sequence reversal demands every output position read every input position — a stress test for cross-position attention. Copying forces an identity-like alignment pattern. Mixed prefixes mirror the context-conditioned behavior of decoder-style language models.
RNNs choke on these tasks; the dependencies are too long for a single hidden state. Attention makes them learnable:
RNN: h[t] = f(h[t-1], x[t]) # Sequential, lossy
Attention: out[i] = sum(attn[i,j] * v[j]) # Parallel, direct
Your Code Powers This
Table 3 names the TinyTorch components that power this milestone.
| Component | Your Module | What It Does |
|---|---|---|
CharTokenizer |
Module 10 | YOUR character-level tokenization |
TokenEmbedding |
Module 11 | YOUR learned token representations |
PositionalEmbedding |
Module 11 | YOUR position encodings |
MultiHeadAttention |
Module 12 | YOUR self-attention mechanism |
TransformerBlock |
Module 13 | YOUR attention + feedforward blocks |
LayerNorm |
Module 13 | YOUR normalization layers |
No PyTorch. No HuggingFace. Just YOUR code.
Historical Context
The 2017 paper trained a 65M-parameter encoder-decoder on English↔︎German translation. The same blocks — same attention, same residual stream, same LayerNorm — scaled to GPT-3’s 175B parameters in 2020 with no architectural change. Bigger model, more data, longer training. That scaling story is what made transformers a paradigm rather than a paper. This milestone uses small synthetic tasks for fast feedback; TinyTalks in datasets/tinytalks/ provides Q&A-style character-level text when you want a real corpus.
Systems Insights
- Memory: The attention matrix is O(n²). At 8K tokens, that’s 64M scores per head per layer — the cost that drives every long-context optimization (FlashAttention, sliding windows, KV cache eviction).
- Compute: Embarrassingly parallel across positions. RNNs serialize on the hidden state; transformers fill a GPU.
- Order: Attention is permutation-invariant. Position embeddings are how the model learns “first” from “last” — shuffle them and the model loses all sense of sequence.
What’s Next
You can build a transformer. The harder question is whether you can run one. A 1B-parameter model needs 4 GB just for fp32 weights, and that’s before activations, gradients, or the KV cache. Every frontier deployment is a fight against memory, latency, and energy. Milestone 06 (MLPerf) is the optimization arc — profiling, compression, and acceleration — that turns the architecture you just validated into something that actually ships.