Milestone 03: The MLP Revival (1986)

Milestone Info

Foundation Milestone | Difficulty: ●●○○ | Time: 1–2 hours (incl. training) | Prerequisites: Modules 01–08

What You’ll Learn

How a multilayer network discovers its own features (edges, strokes) with no hand-coding
Why representation learning replaced manual feature engineering
That YOUR ~100 lines of TinyTorch can solve XOR and train on TinyDigits

Overview

For 17 years, neural networks were dead.

Minsky’s XOR proof (Milestone 02) showed a single layer of perceptrons could not separate even four points on a plane. Funding evaporated. Researchers moved on. “Neural network” became a dirty word.

Then in 1986, Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors.” Their argument was structural: stack two layers with a nonlinearity between them, train every weight with the chain rule, and the network discovers its own features. No hand-crafted rules. No domain experts. Data in, patterns out.

This milestone recreates that result in two release-friendly steps: first solve XOR with a hidden layer, then train on the shipped TinyDigits dataset.

What You’ll Build

Multi-layer perceptrons (MLPs) for non-linear learning and digit recognition:

XOR Solved — hidden layers plus backpropagation solve the 1969 crisis
TinyDigits — quick proof-of-concept on 8×8 images

Images --> Flatten --> Linear --> ReLU --> Linear --> ReLU --> Linear --> Classes

Prerequisites

Table 1 lists the modules you need to have completed before starting.

Table 1: Prerequisite modules for the MLP milestone.

Module	Component	What It Provides
01–04	Foundation	Tensor, Activations, Layers, Losses
05	DataLoader	YOUR batching and data pipeline
06–08	Training Infrastructure	Autograd, Optimizers, Training loops

Running the Milestone

Confirm Modules 01–08 are complete:

tito module status

cd milestones/03_1986_mlp

# Quick validation on TinyDigits
python 01_rumelhart_tinydigits.py
# Expected: 85%+ accuracy

# Or run the complete milestone from the TinyTorch project root
tito milestone run 03

# Run individual parts
tito milestone run 03 --part 1  # XOR Solved
tito milestone run 03 --part 2  # TinyDigits

Expected Results

Table 2 records the accuracy and runtime you should expect to see.

Table 2: Expected accuracy and training time for the MLP milestone scripts.

Script	Dataset	Parameters	Accuracy	Training Time
01 (XOR Solved)	4 examples	small	100%	<1 min
02 (TinyDigits)	1K train, 8×8	~2.4K	85%+	3–5 min

The Aha Moment: Automatic Feature Discovery

Watch YOUR network learn something you never taught it.

After training, reshape the first hidden layer’s weights into image-sized patches and visualize them. You will see edge detectors — horizontal, vertical, diagonal strokes. Nobody wrote those filters. The network discovered them because edges happen to be useful for telling digits apart.

This is representation learning: the model invents its own features from data instead of waiting for an expert to hand-design them. Combined with the universal approximation result — one hidden layer plus a nonlinearity can approximate any continuous function — this is why a stack of Linear and ReLU layers can move from XOR to image classification.

Your ~100 lines of TinyTorch just replicated the breakthrough that ended the first AI winter.

Your Code Powers This

Every component comes from YOUR implementations:

Table 3 names the TinyTorch components that power this milestone.

Table 3: TinyTorch components that power the MLP milestone.

Component	Your Module	What It Does
`Tensor`	Module 01	Stores images and weights
`Linear`	Module 03	YOUR fully-connected layers
`ReLU`	Module 02	YOUR activation functions
`CrossEntropyLoss`	Module 04	YOUR loss computation
`DataLoader`	Module 05	YOUR batching pipeline
`backward()`	Module 06	YOUR autograd engine
`SGD`	Module 07	YOUR optimizer

No PyTorch. No TensorFlow. Just YOUR code learning to read handwritten digits.

Historical Context

Backpropagation made multi-layer networks practical after the XOR crisis. The same idea that solves four non-linear points also scales to handwritten digit recognition: every deep learning system you have used descends from that chain-rule machinery.

Systems Insights

Memory: ~100K parameters × 4 bytes = 400 KB of weights — small enough to fit on 1986 workstation RAM, which is partly why this experiment was even possible.
Compute: Dense matrix multiplies dominate training time. Every forward pass through a fully-connected layer is one big GEMM.
Architecture: Each hidden layer composes features from the layer below, building progressively more abstract representations as you stack depth.

What’s Next

MLPs treat images as flat vectors. To your network, pixel (0,0) and pixel (0,1) are no more related than pixel (0,0) and pixel (27,27) — spatial structure is thrown away the moment you call flatten. Milestone 04 (CNN) puts locality back in with convolutional layers, and asks what that costs and what it buys.

Milestone 03: The MLP Revival (1986)

Overview

What You’ll Build

Prerequisites

Running the Milestone

Expected Results

The Aha Moment: Automatic Feature Discovery

Your Code Powers This

Historical Context

Systems Insights

What’s Next

Further Reading