Milestone 03: The MLP Revival (1986)
Foundation Milestone | Difficulty: ●●○○ | Time: 1–2 hours (incl. training) | Prerequisites: Modules 01–08
- How a multilayer network discovers its own features (edges, strokes) with no hand-coding
- Why representation learning replaced manual feature engineering
- That YOUR ~100 lines of TinyTorch can hit 95%+ accuracy on MNIST
Overview
For 17 years, neural networks were dead.
Minsky’s XOR proof (Milestone 02) showed a single layer of perceptrons could not separate even four points on a plane. Funding evaporated. Researchers moved on. “Neural network” became a dirty word.
Then in 1986, Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors.” Their argument was structural: stack two layers with a nonlinearity between them, train every weight with the chain rule, and the network discovers its own features. No hand-crafted rules. No domain experts. Data in, patterns out.
This milestone recreates that result. You wire your Linear and ReLU layers into a stack, hand it MNIST, and watch your autograd engine end the first AI winter.
What You’ll Build
Multi-layer perceptrons (MLPs) for digit recognition:
- TinyDigits — quick proof-of-concept on 8×8 images
- MNIST — the classic benchmark (95%+ accuracy)
Images --> Flatten --> Linear --> ReLU --> Linear --> ReLU --> Linear --> Classes
Prerequisites
Table 1 lists the modules you need to have completed before starting.
| Module | Component | What It Provides |
|---|---|---|
| 01–04 | Foundation | Tensor, Activations, Layers, Losses |
| 05 | DataLoader | YOUR batching and data pipeline |
| 06–08 | Training Infrastructure | Autograd, Optimizers, Training loops |
Running the Milestone
Confirm Modules 01–08 are complete:
tito module statuscd milestones/03_1986_mlp
# Part 1: Quick validation
python 01_rumelhart_tinydigits.py
# Expected: 75-85% accuracy
# Part 2: Full MNIST benchmark
python 02_rumelhart_mnist.py
# Expected: 94-97% accuracyExpected Results
Table 2 records the accuracy and runtime you should expect to see.
| Script | Dataset | Parameters | Accuracy | Training Time |
|---|---|---|---|---|
| 01 (TinyDigits) | 1K train, 8×8 | ~2.4K | 75–85% | 3–5 min |
| 02 (MNIST) | 60K train, 28×28 | ~100K | 94–97% | 10–15 min |
The Aha Moment: Automatic Feature Discovery
Watch YOUR network learn something you never taught it.
After training, reshape the first hidden layer’s weights into image-sized patches and visualize them. You will see edge detectors — horizontal, vertical, diagonal strokes. Nobody wrote those filters. The network discovered them because edges happen to be useful for telling digits apart.
This is representation learning: the model invents its own features from data instead of waiting for an expert to hand-design them. Combined with the universal approximation result — one hidden layer plus a nonlinearity can approximate any continuous function — this is why a stack of Linear and ReLU layers is enough to attack MNIST in the first place.
Your ~100 lines of TinyTorch just replicated the breakthrough that ended the first AI winter.
Your Code Powers This
Every component comes from YOUR implementations:
Table 3 names the TinyTorch components that power this milestone.
| Component | Your Module | What It Does |
|---|---|---|
Tensor |
Module 01 | Stores images and weights |
Linear |
Module 03 | YOUR fully-connected layers |
ReLU |
Module 02 | YOUR activation functions |
CrossEntropyLoss |
Module 04 | YOUR loss computation |
DataLoader |
Module 05 | YOUR batching pipeline |
backward() |
Module 06 | YOUR autograd engine |
SGD |
Module 07 | YOUR optimizer |
No PyTorch. No TensorFlow. Just YOUR code learning to read handwritten digits.
Historical Context
MNIST (1998) became the standard benchmark for handwritten digit recognition. MLPs reaching 95%+ accuracy moved neural networks from curiosity to credible tool — the result that pulled funding and researchers back into the field. Rumelhart, Hinton, and Williams’ backprop paper has since been cited over 50,000 times; every deep learning system you have used descends from it.
Systems Insights
- Memory: ~100K parameters × 4 bytes = 400 KB of weights — small enough to fit on 1986 workstation RAM, which is partly why this experiment was even possible.
- Compute: Dense matrix multiplies dominate training time. Every forward pass through a fully-connected layer is one big GEMM.
- Architecture: Each hidden layer composes features from the layer below, building progressively more abstract representations as you stack depth.
What’s Next
MLPs treat images as flat vectors. To your network, pixel (0,0) and pixel (0,1) are no more related than pixel (0,0) and pixel (27,27) — spatial structure is thrown away the moment you call flatten. Milestone 04 (CNN) puts locality back in with convolutional layers, and asks what that costs and what it buys.
Further Reading
- The Backprop Paper: Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors”
- MNIST Dataset: LeCun et al. (1998). “Gradient-based learning applied to document recognition”
- Universal Approximation: Cybenko (1989). “Approximation by superpositions of a sigmoidal function”