Milestone 06: MLPerf — The Optimization Era (2018)
Optimization Milestone | Difficulty: ●●●● | Time: 1–2 hours | Prerequisites: Modules 01–18
- The systematic optimization workflow: measure, optimize, validate, repeat
- Why profiling before optimizing beats heroic rewrites
- How to achieve 8× compression and 10× speedup with minimal accuracy loss
Overview
This is the Optimization Milestone — the third and final act of the historical arc. The Foundation Milestones proved your training loop learns. The Architecture Milestones proved your layers match real data. This one proves your optimization stack — the Profiler (Module 14), Quantization (15), Compression (16), Acceleration (17), and KV-Cache (18) you just finished — can turn a working model into a shippable one.
- ML research is sprinting; deployment is crawling. BERT-Large lands at 340M parameters and won’t fit on a phone. ResNet-152 inference blows past production latency budgets. Teams ship two-year-old models because the new ones are too slow, too big, too expensive — and every vendor’s benchmark numbers use a different dataset, batch size, and accuracy floor, so you can’t even tell whose hardware is faster.
MLPerf changes that. One protocol, one accuracy floor, one set of reference models — apples-to-apples across CPUs, GPUs, TPUs, and accelerators that didn’t exist last year. Optimization stops being an afterthought and becomes the discipline that decides who ships.
You compress YOUR models 8× and accelerate YOUR transformer generation 10× using the same measure → optimize → validate loop production teams run. That’s the gap between a research demo and a shipped product, closed in two scripts.
What You’ll Build
A complete MLPerf-style optimization pipeline:
- Static model optimization — profile, quantize, and prune MLP/CNN
- Generation speedup — KV-cache acceleration for transformers
Measure --> Optimize --> Validate --> Repeat
Prerequisites
Table 1 lists the modules you need to have completed before starting.
| Module | Component | What It Provides |
|---|---|---|
| 01–13 | Foundation + Architectures | Models to optimize |
| 14 | Profiling | YOUR measurement tools |
| 15 | Quantization | YOUR INT8/FP16 implementations |
| 16 | Compression | YOUR pruning techniques |
| 17 | Acceleration | YOUR vectorized operations |
| 18 | Memoization | YOUR KV-cache for generation |
Running the Milestone
Before running, ensure you have completed Modules 01–18. You can check your progress:
tito module statuscd milestones/06_2018_mlperf
# Part 1: Optimize MLP/CNN (profiling + quantization + pruning)
python 01_optimization_olympics.py
# Expected: 4-8x compression with <2% accuracy loss
# Part 2: Speed up Transformer generation (KV caching)
python 02_generation_speedup.py
# Expected: 6-10x faster generationExpected Results
Static model optimization (script 01)
Table 2 tracks model size and accuracy through each static optimisation stage.
| Optimization | Size | Accuracy | Notes |
|---|---|---|---|
| Baseline (FP32) | 100% | 85–90% | Full precision |
| + Quantization (INT8) | 25% | 84–89% | 4× smaller |
| + Pruning (50%) | 12.5% | 82–87% | 8× smaller total |
Generation speedup (script 02)
Table 3 shows per-token generation speed with and without the KV cache.
| Mode | Time/Token | Speedup |
|---|---|---|
| Without KV-Cache | ~10 ms | 1× |
| With KV-Cache | ~1 ms | 6–10× |
The Aha Moment: Systematic Beats Heroic
The wrong way (heroic optimization):
"It's too slow! Let me rewrite everything in C++!"
"Memory is too high! Let me redesign the architecture!"
"KV-cache sounds complex! Let me try CUDA kernels first!"
Result: weeks of work, marginal gains, introduced bugs.
The right way (systematic optimization):
1. MEASURE: Profile shows 70% of time is in attention,
80% of memory is Linear layers
2. OPTIMIZE: Add KV-cache (targets the 70%),
quantize Linear layers (targets the 80%)
3. VALIDATE: Accuracy drops 1.5% (acceptable),
8x faster (huge win)
4. REPEAT: Profile again, find next bottleneck
Result: 10× faster, 8× smaller, 2% accuracy cost — achieved in days.
This is what separates ML researchers from ML engineers:
- YOUR Profiler (Module 14) identifies real bottlenecks (not assumed ones)
- YOUR Quantization (Module 15) reduces memory 4×
- YOUR Pruning (Module 16) reduces parameters 50%+
- YOUR KV-Cache (Module 18) speeds up generation 10×
The full loop — measure, optimize, validate — runs on YOUR tools, not someone else’s library.
Your Code Powers This
This milestone closes the historical arc. Every optimization tool you exercise here comes from YOUR implementations:
Table 4 names the TinyTorch components that power this milestone.
| Component | Your Module | What It Does |
|---|---|---|
Profiler |
Module 14 | YOUR measurement and bottleneck identification |
quantize() |
Module 15 | YOUR INT8/FP16 conversion |
prune() |
Module 16 | YOUR weight pruning |
vectorize() |
Module 17 | YOUR accelerated operations |
KVCache |
Module 18 | YOUR key-value caching for generation |
No external optimization libraries. Every speedup, every byte saved, traces back to code you wrote.
Historical Context
Before MLPerf, comparing ML systems was guesswork. Vendors picked their own datasets, batch sizes, and accuracy targets, then claimed wins. MLPerf forced a common protocol — same models, same data, same accuracy floor — so a “2× faster” claim could finally be checked instead of believed.
That protocol marks the moment ML engineering became as load-bearing as ML research. Building a model is step one. Shipping it inside a latency budget, on hardware your users actually own, is where production value lives — and where careers are made.
Systems Insights
- Memory — 4–16× compression with under 2% accuracy loss is routine, not heroic.
- Latency — 10–40× speedup from caching and batching alone, before touching the model.
- Trade-offs — every win costs accuracy, latency, or memory. Profiling tells you which one to spend; intuition will lie to you.
What’s Next
With Milestone 06 the historical arc closes. Six recreations, eighteen modules, one framework you wrote yourself end-to-end. You’ve:
- Built every core component (Modules 01–13)
- Optimized for production deployment (Modules 14–18)
- Proven mastery by recreating six landmark systems (Milestones 01–06)
What’s left is the test no recreation can give you. The Capstone (Module 20) is the Torch Olympics — open-ended problems, fixed budgets, your framework, your call. The history lessons end here. The competition starts next.
Further Reading
- MLPerf: mlcommons.org
- Deep Compression: Han et al. (2015). “Deep Compression: Compressing DNNs with Pruning, Trained Quantization and Huffman Coding”
- Efficient Transformers: Tay et al. (2020). “Efficient Transformers: A Survey”