Milestone 05: The Transformer Era (2017)

NoteMilestone Info

Architecture Milestone | Difficulty: ●●●○ | Time: 30–60 min | Prerequisites: Modules 01–08, 10–13

TipWhat You’ll Learn
  • Why attention beats sequential processing (direct access, no bottleneck)
  • How transformers handle long-range dependencies that defeat RNNs
  • The architecture that powers GPT, BERT, and every modern LLM

Overview

Imagine compressing an entire book into a single number.

That’s what an RNN does. The input — a sentence, a paragraph, an entire novel — squeezes through one fixed-size hidden state. Information from the beginning fades as new tokens arrive. By token 1000, token 1 is nearly forgotten.

In 2017, Vaswani et al. asked: what if we just… didn’t? Their paper “Attention Is All You Need” proved that attention alone — no recurrence, no convolution — could match the state of the art. No sequential bottleneck. No information loss. Any token can directly attend to any other token.

Nearly a decade later, every frontier LLM is a transformer: GPT-4, Claude, Gemini, Llama. The architecture you finished in Modules 12–13 is the architecture behind the AI boom. This milestone proves your version works on tasks RNNs cannot solve.

What You’ll Build

  1. Attention proof (01_vaswani_attention.py) — three synthetic tasks (sequence reversal, copying, mixed prefixes) that working self-attention can solve and broken attention cannot.
  2. Optional corpus — TinyTalks in datasets/tinytalks/ for Q&A-style transformer experiments, independent of tito milestone run 05.
Tokens --> Embeddings --> [Attention --> FFN] x N --> Output

Prerequisites

Table 1 lists the modules you need to have completed before starting.

Table 1: Prerequisite modules for the Transformer milestone.
Module Component What It Provides
01–08 Foundation + Training Tensor, Layers, DataLoader, Training
10 Tokenization YOUR CharTokenizer
11 Embeddings YOUR token + positional embeddings
12 Attention YOUR multi-head self-attention
13 Transformers YOUR LayerNorm + TransformerBlock

Running the Milestone

Before running, ensure you have completed Modules 01–13. You can check your progress:

tito module status
tito milestone run 05

Or:

cd milestones/05_2017_transformer
python 01_vaswani_attention.py
NoteSynthetic vs TinyTalks

The milestone entrypoint uses synthetic sequences so attention can be verified quickly with no extra files. TinyTalks remains available under datasets/tinytalks/ for teaching and follow-on work.

Expected Results

Table 2 records the accuracy and runtime you should expect to see.

Table 2: Expected success criteria and runtime for the Transformer milestone.
Script Task Success Criteria Time
01_vaswani_attention.py Reversal / copy / mixed See script thresholds (~95% / ~95% / ~90%) Minutes

The Aha Moment: Direct Access Everywhere

Sequence reversal demands every output position read every input position — a stress test for cross-position attention. Copying forces an identity-like alignment pattern. Mixed prefixes mirror the context-conditioned behavior of decoder-style language models.

RNNs choke on these tasks; the dependencies are too long for a single hidden state. Attention makes them learnable:

RNN:       h[t]   = f(h[t-1], x[t])         # Sequential, lossy
Attention: out[i] = sum(attn[i,j] * v[j])   # Parallel, direct

Your Code Powers This

Table 3 names the TinyTorch components that power this milestone.

Table 3: TinyTorch components that power the Transformer milestone.
Component Your Module What It Does
CharTokenizer Module 10 YOUR character-level tokenization
TokenEmbedding Module 11 YOUR learned token representations
PositionalEmbedding Module 11 YOUR position encodings
MultiHeadAttention Module 12 YOUR self-attention mechanism
TransformerBlock Module 13 YOUR attention + feedforward blocks
LayerNorm Module 13 YOUR normalization layers

No PyTorch. No HuggingFace. Just YOUR code.

Historical Context

The 2017 paper trained a 65M-parameter encoder-decoder on English↔︎German translation. The same blocks — same attention, same residual stream, same LayerNorm — scaled to GPT-3’s 175B parameters in 2020 with no architectural change. Bigger model, more data, longer training. That scaling story is what made transformers a paradigm rather than a paper. This milestone uses small synthetic tasks for fast feedback; TinyTalks in datasets/tinytalks/ provides Q&A-style character-level text when you want a real corpus.

Systems Insights

  • Memory: The attention matrix is O(n²). At 8K tokens, that’s 64M scores per head per layer — the cost that drives every long-context optimization (FlashAttention, sliding windows, KV cache eviction).
  • Compute: Embarrassingly parallel across positions. RNNs serialize on the hidden state; transformers fill a GPU.
  • Order: Attention is permutation-invariant. Position embeddings are how the model learns “first” from “last” — shuffle them and the model loses all sense of sequence.

What’s Next

You can build a transformer. The harder question is whether you can run one. A 1B-parameter model needs 4 GB just for fp32 weights, and that’s before activations, gradients, or the KV cache. Every frontier deployment is a fight against memory, latency, and energy. Milestone 06 (MLPerf) is the optimization arc — profiling, compression, and acceleration — that turns the architecture you just validated into something that actually ships.

Back to top