Milestone 04: The CNN Revolution (1998)
Architecture Milestone | Difficulty: ●●●○ | Time: 1–2 hours (incl. training) | Prerequisites: Modules 01–09
- Why spatial structure matters: 100× fewer parameters, 50% better accuracy
- How weight sharing enables translation invariance
- The hierarchical feature learning that powers all computer vision
Overview
This is the first Architecture Milestone. The Foundation Milestones proved your training loop learns; the next two prove your architectures do the work they were invented for. Here you pick up the Conv2d and MaxPool2d layers you just built in Module 09 and point them at natural images — the exact problem that made convolutions famous.
- Yann LeCun deploys LeNet-5: a convolutional neural network that reads handwritten zip codes for the US Postal Service and dollar amounts on bank checks for NCR. It is not a research demo. It is production software, sorting mail and clearing checks at industrial scale — the first commercial success of deep learning.
The breakthrough is structural. Images are not bags of pixels; nearby pixels matter more than distant ones, and the same edge detector works whether you put it in the corner or the center. Exploit those two facts — local connectivity and weight sharing — and a convolutional network matches a dense one’s accuracy with two orders of magnitude fewer parameters.
You are about to reproduce those same principles on real natural images, using your own Conv2d and MaxPool2d from Module 09. Hit 75% on CIFAR-10 and you will have built — end-to-end, with no pretrained weights — the kind of computer vision system that defined the field for more than a decade after LeNet shipped.
What You’ll Build
CNNs that exploit image structure:
- TinyDigits — prove convolution beats MLPs on 8×8 images
- CIFAR-10 — scale to natural color images (32×32 RGB)
Images --> Conv --> ReLU --> Pool --> Conv --> ReLU --> Pool --> Flatten --> Linear --> Classes
Prerequisites
Table 1 lists the modules you need to have completed before starting.
| Module | Component | What It Provides |
|---|---|---|
| 01–08 | Foundation + Training | Complete training pipeline |
| 09 | Convolutions | Your Conv2d + MaxPool2d |
Running the Milestone
Before running, ensure you have completed Modules 01–09. You can check your progress:
tito module statuscd milestones/04_1998_cnn
# Part 1: TinyDigits (works offline)
python 01_lecun_tinydigits.py
# Expected: ~90% accuracy (vs ~80% MLP)
# Part 2: CIFAR-10 (requires download)
python 02_lecun_cifar10.py
# Expected: 65-75% accuracyExpected Results
Table 2 records the accuracy and runtime you should expect to see.
| Script | Dataset | Architecture | Accuracy | vs MLP |
|---|---|---|---|---|
| 01 (TinyDigits) | 1K train, 8×8 | Simple CNN | ~90% | +10% improvement |
| 02 (CIFAR-10) | 50K train, 32×32 RGB | Deeper CNN | 65–75% | MLPs struggle here |
The Aha Moment: Structure Matches Reality
An MLP sees an image as 3,072 unrelated numbers. It does not know that pixel (0,0) is next to pixel (0,1). It learns brittle correlations like “if pixel 1,234 is bright and pixel 2,891 is dark…” — patterns tied to absolute positions, which fall apart the moment the cat shifts a few pixels to the left.
A CNN bakes spatial structure into the architecture itself:
- Local connectivity — each neuron only looks at a small neighborhood (3×3 or 5×5). Edges, corners, and textures are local patterns; the network does not need a global view to detect them.
- Weight sharing — one filter scans the entire image. “Cat in the top-left” and “cat in the bottom-right” trigger the same feature detector, so the network learns the concept once instead of 1,024 times.
- Translation invariance — pooling makes the output insensitive to small shifts. The network learns that a cat is present, not where the pixels happened to land.
The numbers:
- MLP on CIFAR-10: ~100M parameters, ~50% accuracy (memorizes pixel positions, fails to generalize)
- Your CNN: ~1M parameters, 75%+ accuracy (learns reusable features that compose)
100× fewer parameters. 25 percentage points more accuracy. That is what happens when the architecture matches the data.
Part 1 validates your implementations on TinyDigits. Part 2 scales them: 50,000 natural color images, 32×32×3 = 3,072 dimensions per image, 10 categories (airplanes, cars, birds, cats, ships…). This is the hard problem.
Your DataLoader streams batches from disk. Your Conv2d layers extract features hierarchically — first layer finds edges, second finds textures, third finds object parts. Your MaxPool2d shrinks the spatial map while preserving what matters.
When the run prints Test Accuracy: 72%, sit with it for a second. You did not download a pretrained model. Every tensor op, every gradient, every parameter update is code you wrote — running on your laptop, training a model that would have made headlines twenty years ago. That is systems engineering.
Your Code Powers This
Table 3 names the TinyTorch components that power this milestone.
| Component | Your Module | What It Does |
|---|---|---|
Tensor |
Module 01 | Stores images and feature maps |
Conv2d |
Module 09 | Your convolutional layers |
MaxPool2d |
Module 09 | Your pooling layers |
ReLU |
Module 02 | Your activation functions |
Linear |
Module 03 | Your classifier head |
CrossEntropyLoss |
Module 04 | Your loss computation |
DataLoader |
Module 05 | Your batching pipeline |
backward() |
Module 06 | Your autograd engine |
Historical Context
LeNet-5 was deployed for zip code recognition at the US Postal Service and check-amount reading at NCR — the first neural networks to ship in production at meaningful scale.
CIFAR-10 (2009) became the standard pre-ImageNet benchmark. Reaching 70%+ on it was the signal the field was ready for the next jump in scale.
The 2012 “ImageNet moment” — AlexNet — applied the same CNN principles to 1.2 million images on GPUs. The blueprint was already in LeCun’s 1998 paper. The hardware just had to catch up.
Systems Insights
- Memory: ~1M parameters (weight sharing dramatically reduces vs dense)
- Compute: Convolution is compute-intensive but highly parallelizable
- Architecture: Hierarchical feature learning (edges → textures → objects)
What’s Next
CNNs are the right inductive bias for grid-structured data, but most of the world’s interesting signals are sequential — text, audio, time series. Milestone 05 introduces Transformers, the architecture that ate sequence modeling first and, eventually, vision itself.
Further Reading
- LeNet Paper: LeCun et al. (1998). “Gradient-based learning applied to document recognition”
- CIFAR-10: Krizhevsky (2009). “Learning Multiple Layers of Features from Tiny Images”
- AlexNet: Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). “ImageNet Classification with Deep Convolutional Neural Networks”