Milestone 06: MLPerf — The Optimization Era (2018)

Milestone Info

Optimization Milestone | Difficulty: ●●●● | Time: 1–2 hours | Prerequisites: Modules 01–19

What You’ll Learn

The systematic optimization workflow: measure, optimize, validate, repeat
Why profiling before optimizing beats heroic rewrites
How to achieve 8× compression and 10× speedup with minimal accuracy loss

Overview

This is the Optimization Milestone — the third and final act of the historical arc. The Foundation Milestones proved your training loop learns. The Architecture Milestones proved your layers match real data. This one proves your optimization stack — the Profiler (Module 14), Quantization (15), Compression (16), Acceleration (17), and KV-Cache (18) you just finished — can turn a working model into a shippable one.

ML research is sprinting; deployment is crawling. BERT-Large lands at 340M parameters and won’t fit on a phone. ResNet-152 inference blows past production latency budgets. Teams ship two-year-old models because the new ones are too slow, too big, too expensive — and every vendor’s benchmark numbers use a different dataset, batch size, and accuracy floor, so you can’t even tell whose hardware is faster.

MLPerf changes that. One protocol, one accuracy floor, one set of reference models — apples-to-apples across CPUs, GPUs, TPUs, and accelerators that didn’t exist last year. Optimization stops being an afterthought and becomes the discipline that decides who ships.

You compress YOUR models 8× and accelerate YOUR transformer generation 10× using the same measure → optimize → validate loop production teams run. That’s the gap between a research demo and a shipped product, closed in two scripts.

What You’ll Build

A complete MLPerf-style optimization pipeline:

Static model optimization — profile, quantize, and prune MLP/CNN
Generation speedup — KV-cache acceleration for transformers

Measure --> Optimize --> Validate --> Repeat

Prerequisites

Table 1 lists the modules you need to have completed before starting.

Table 1: Prerequisite modules for the MLPerf milestone.

Module	Component	What It Provides
01–08	Foundation + Training	Models and training loop to optimize
11–12	Embeddings + Attention	Transformer components for generation speedup
14	Profiling	YOUR measurement tools
15	Quantization	YOUR INT8/FP16 implementations
16	Compression	YOUR pruning techniques
17	Acceleration	YOUR vectorized operations
18	Memoization	YOUR KV-cache for generation
19	Benchmarking	YOUR standardized benchmark reports

Running the Milestone

Before running, ensure you have completed Modules 01–19. You can check your progress:

tito module status

cd milestones/06_2018_mlperf

# Part 1: Optimize MLP/CNN (profiling + quantization + pruning)
python 01_optimization_olympics.py
# Expected: 4-8x compression with <2% accuracy loss

# Part 2: Speed up Transformer generation (KV caching)
python 02_generation_speedup.py
# Expected: 6-10x faster generation

Expected Results

Static model optimization (script 01)

Table 2 tracks model size and accuracy through each static optimisation stage.

Table 2: Expected size and accuracy at each static optimization stage.

Optimization	Size	Accuracy	Notes
Baseline (FP32)	100%	85–90%	Full precision
+ Quantization (INT8)	25%	84–89%	4× smaller
+ Pruning (50%)	12.5%	82–87%	8× smaller total

Generation speedup (script 02)

Table 3 shows per-token generation speed with and without the KV cache.

Table 3: Per-token generation speed with and without the KV cache.

Mode	Time/Token	Speedup
Without KV-Cache	~10 ms	1×
With KV-Cache	~1 ms	6–10×

The Aha Moment: Systematic Beats Heroic

The wrong way (heroic optimization):

"It's too slow! Let me rewrite everything in C++!"
"Memory is too high! Let me redesign the architecture!"
"KV-cache sounds complex! Let me try CUDA kernels first!"

Result: weeks of work, marginal gains, introduced bugs.

The right way (systematic optimization):

1. MEASURE:   Profile shows 70% of time is in attention,
              80% of memory is Linear layers
2. OPTIMIZE:  Add KV-cache (targets the 70%),
              quantize Linear layers (targets the 80%)
3. VALIDATE:  Accuracy drops 1.5% (acceptable),
              8x faster (huge win)
4. REPEAT:    Profile again, find next bottleneck

Result: 10× faster, 8× smaller, 2% accuracy cost — achieved in days.

This is what separates ML researchers from ML engineers:

YOUR Profiler (Module 14) identifies real bottlenecks (not assumed ones)
YOUR Quantization (Module 15) reduces memory 4×
YOUR Pruning (Module 16) reduces parameters 50%+
YOUR KV-Cache (Module 18) speeds up generation 10×

The full loop — measure, optimize, validate — runs on YOUR tools, not someone else’s library.

Your Code Powers This

This milestone closes the historical arc. Every optimization tool you exercise here comes from YOUR implementations:

Table 4 names the TinyTorch components that power this milestone.

Table 4: TinyTorch components that power the MLPerf milestone.

Component	Your Module	What It Does
`Profiler`	Module 14	YOUR measurement and bottleneck identification
`quantize()`	Module 15	YOUR INT8/FP16 conversion
`prune()`	Module 16	YOUR weight pruning
`vectorize()`	Module 17	YOUR accelerated operations
`KVCache`	Module 18	YOUR key-value caching for generation

No external optimization libraries. Every speedup, every byte saved, traces back to code you wrote.

Historical Context

Before MLPerf, comparing ML systems was guesswork. Vendors picked their own datasets, batch sizes, and accuracy targets, then claimed wins. MLPerf forced a common protocol — same models, same data, same accuracy floor — so a “2× faster” claim could finally be checked instead of believed.

That protocol marks the moment ML engineering became as load-bearing as ML research. Building a model is step one. Shipping it inside a latency budget, on hardware your users actually own, is where production value lives — and where careers are made.

Systems Insights

Memory — 4–16× compression with under 2% accuracy loss is routine, not heroic.
Latency — 10–40× speedup from caching and batching alone, before touching the model.
Trade-offs — every win costs accuracy, latency, or memory. Profiling tells you which one to spend; intuition will lie to you.

What’s Next

With Milestone 06 the historical arc closes. Six recreations, nineteen modules, one framework you wrote yourself end-to-end. You’ve:

Built every core component (Modules 01–13)
Optimized and benchmarked for production deployment (Modules 14–19)
Proven mastery by recreating six landmark systems (Milestones 01–06)

What’s left is the test no recreation can give you. The Capstone (Module 20) is the Torch Olympics — open-ended problems, fixed budgets, your framework, your call. The history lessons end here. The competition starts next.

Milestone 06: MLPerf — The Optimization Era (2018)

Overview

What You’ll Build

Prerequisites

Running the Milestone

Expected Results

The Aha Moment: Systematic Beats Heroic

Your Code Powers This

Historical Context

Systems Insights

What’s Next

Further Reading