Milestone 03: The MLP Revival (1986)
Foundation Milestone | Difficulty: ●●○○ | Time: 1–2 hours (incl. training) | Prerequisites: Modules 01–08
- How a multilayer network discovers its own features (edges, strokes) with no hand-coding
- Why representation learning replaced manual feature engineering
- That YOUR ~100 lines of TinyTorch can solve XOR and train on TinyDigits
Overview
For 17 years, neural networks were dead.
Minsky’s XOR proof (Milestone 02) showed a single layer of perceptrons could not separate even four points on a plane. Funding evaporated. Researchers moved on. “Neural network” became a dirty word.
Then in 1986, Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors.” Their argument was structural: stack two layers with a nonlinearity between them, train every weight with the chain rule, and the network discovers its own features. No hand-crafted rules. No domain experts. Data in, patterns out.
This milestone recreates that result in two release-friendly steps: first solve XOR with a hidden layer, then train on the shipped TinyDigits dataset.
What You’ll Build
Multi-layer perceptrons (MLPs) for non-linear learning and digit recognition:
- XOR Solved — hidden layers plus backpropagation solve the 1969 crisis
- TinyDigits — quick proof-of-concept on 8×8 images
Images --> Flatten --> Linear --> ReLU --> Linear --> ReLU --> Linear --> Classes
Prerequisites
Table 1 lists the modules you need to have completed before starting.
| Module | Component | What It Provides |
|---|---|---|
| 01–04 | Foundation | Tensor, Activations, Layers, Losses |
| 05 | DataLoader | YOUR batching and data pipeline |
| 06–08 | Training Infrastructure | Autograd, Optimizers, Training loops |
Running the Milestone
Confirm Modules 01–08 are complete:
tito module statuscd milestones/03_1986_mlp
# Quick validation on TinyDigits
python 01_rumelhart_tinydigits.py
# Expected: 85%+ accuracy
# Or run the complete milestone from the TinyTorch project root
tito milestone run 03
# Run individual parts
tito milestone run 03 --part 1 # XOR Solved
tito milestone run 03 --part 2 # TinyDigitsExpected Results
Table 2 records the accuracy and runtime you should expect to see.
| Script | Dataset | Parameters | Accuracy | Training Time |
|---|---|---|---|---|
| 01 (XOR Solved) | 4 examples | small | 100% | <1 min |
| 02 (TinyDigits) | 1K train, 8×8 | ~2.4K | 85%+ | 3–5 min |
The Aha Moment: Automatic Feature Discovery
Watch YOUR network learn something you never taught it.
After training, reshape the first hidden layer’s weights into image-sized patches and visualize them. You will see edge detectors — horizontal, vertical, diagonal strokes. Nobody wrote those filters. The network discovered them because edges happen to be useful for telling digits apart.
This is representation learning: the model invents its own features from data instead of waiting for an expert to hand-design them. Combined with the universal approximation result — one hidden layer plus a nonlinearity can approximate any continuous function — this is why a stack of Linear and ReLU layers can move from XOR to image classification.
Your ~100 lines of TinyTorch just replicated the breakthrough that ended the first AI winter.
Your Code Powers This
Every component comes from YOUR implementations:
Table 3 names the TinyTorch components that power this milestone.
| Component | Your Module | What It Does |
|---|---|---|
Tensor |
Module 01 | Stores images and weights |
Linear |
Module 03 | YOUR fully-connected layers |
ReLU |
Module 02 | YOUR activation functions |
CrossEntropyLoss |
Module 04 | YOUR loss computation |
DataLoader |
Module 05 | YOUR batching pipeline |
backward() |
Module 06 | YOUR autograd engine |
SGD |
Module 07 | YOUR optimizer |
No PyTorch. No TensorFlow. Just YOUR code learning to read handwritten digits.
Historical Context
Backpropagation made multi-layer networks practical after the XOR crisis. The same idea that solves four non-linear points also scales to handwritten digit recognition: every deep learning system you have used descends from that chain-rule machinery.
Systems Insights
- Memory: ~100K parameters × 4 bytes = 400 KB of weights — small enough to fit on 1986 workstation RAM, which is partly why this experiment was even possible.
- Compute: Dense matrix multiplies dominate training time. Every forward pass through a fully-connected layer is one big GEMM.
- Architecture: Each hidden layer composes features from the layer below, building progressively more abstract representations as you stack depth.
What’s Next
MLPs treat images as flat vectors. To your network, pixel (0,0) and pixel (0,1) are no more related than pixel (0,0) and pixel (27,27) — spatial structure is thrown away the moment you call flatten. Milestone 04 (CNN) puts locality back in with convolutional layers, and asks what that costs and what it buys.
Further Reading
- The Backprop Paper: Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors”
- Universal Approximation: Cybenko (1989). “Approximation by superpositions of a sigmoidal function”