Milestone 06: MLPerf — The Optimization Era (2018)

NoteMilestone Info

Optimization Milestone | Difficulty: ●●●● | Time: 1–2 hours | Prerequisites: Modules 01–18

TipWhat You’ll Learn
  • The systematic optimization workflow: measure, optimize, validate, repeat
  • Why profiling before optimizing beats heroic rewrites
  • How to achieve 8× compression and 10× speedup with minimal accuracy loss

Overview

This is the Optimization Milestone — the third and final act of the historical arc. The Foundation Milestones proved your training loop learns. The Architecture Milestones proved your layers match real data. This one proves your optimization stack — the Profiler (Module 14), Quantization (15), Compression (16), Acceleration (17), and KV-Cache (18) you just finished — can turn a working model into a shippable one.

  1. ML research is sprinting; deployment is crawling. BERT-Large lands at 340M parameters and won’t fit on a phone. ResNet-152 inference blows past production latency budgets. Teams ship two-year-old models because the new ones are too slow, too big, too expensive — and every vendor’s benchmark numbers use a different dataset, batch size, and accuracy floor, so you can’t even tell whose hardware is faster.

MLPerf changes that. One protocol, one accuracy floor, one set of reference models — apples-to-apples across CPUs, GPUs, TPUs, and accelerators that didn’t exist last year. Optimization stops being an afterthought and becomes the discipline that decides who ships.

You compress YOUR models 8× and accelerate YOUR transformer generation 10× using the same measure → optimize → validate loop production teams run. That’s the gap between a research demo and a shipped product, closed in two scripts.

What You’ll Build

A complete MLPerf-style optimization pipeline:

  1. Static model optimization — profile, quantize, and prune MLP/CNN
  2. Generation speedup — KV-cache acceleration for transformers
Measure --> Optimize --> Validate --> Repeat

Prerequisites

Table 1 lists the modules you need to have completed before starting.

Table 1: Prerequisite modules for the MLPerf milestone.
Module Component What It Provides
01–13 Foundation + Architectures Models to optimize
14 Profiling YOUR measurement tools
15 Quantization YOUR INT8/FP16 implementations
16 Compression YOUR pruning techniques
17 Acceleration YOUR vectorized operations
18 Memoization YOUR KV-cache for generation

Running the Milestone

Before running, ensure you have completed Modules 01–18. You can check your progress:

tito module status
cd milestones/06_2018_mlperf

# Part 1: Optimize MLP/CNN (profiling + quantization + pruning)
python 01_optimization_olympics.py
# Expected: 4-8x compression with <2% accuracy loss

# Part 2: Speed up Transformer generation (KV caching)
python 02_generation_speedup.py
# Expected: 6-10x faster generation

Expected Results

Static model optimization (script 01)

Table 2 tracks model size and accuracy through each static optimisation stage.

Table 2: Expected size and accuracy at each static optimization stage.
Optimization Size Accuracy Notes
Baseline (FP32) 100% 85–90% Full precision
+ Quantization (INT8) 25% 84–89% 4× smaller
+ Pruning (50%) 12.5% 82–87% 8× smaller total

Generation speedup (script 02)

Table 3 shows per-token generation speed with and without the KV cache.

Table 3: Per-token generation speed with and without the KV cache.
Mode Time/Token Speedup
Without KV-Cache ~10 ms
With KV-Cache ~1 ms 6–10×

The Aha Moment: Systematic Beats Heroic

The wrong way (heroic optimization):

"It's too slow! Let me rewrite everything in C++!"
"Memory is too high! Let me redesign the architecture!"
"KV-cache sounds complex! Let me try CUDA kernels first!"

Result: weeks of work, marginal gains, introduced bugs.

The right way (systematic optimization):

1. MEASURE:   Profile shows 70% of time is in attention,
              80% of memory is Linear layers
2. OPTIMIZE:  Add KV-cache (targets the 70%),
              quantize Linear layers (targets the 80%)
3. VALIDATE:  Accuracy drops 1.5% (acceptable),
              8x faster (huge win)
4. REPEAT:    Profile again, find next bottleneck

Result: 10× faster, 8× smaller, 2% accuracy cost — achieved in days.

This is what separates ML researchers from ML engineers:

  • YOUR Profiler (Module 14) identifies real bottlenecks (not assumed ones)
  • YOUR Quantization (Module 15) reduces memory 4×
  • YOUR Pruning (Module 16) reduces parameters 50%+
  • YOUR KV-Cache (Module 18) speeds up generation 10×

The full loop — measure, optimize, validate — runs on YOUR tools, not someone else’s library.

Your Code Powers This

This milestone closes the historical arc. Every optimization tool you exercise here comes from YOUR implementations:

Table 4 names the TinyTorch components that power this milestone.

Table 4: TinyTorch components that power the MLPerf milestone.
Component Your Module What It Does
Profiler Module 14 YOUR measurement and bottleneck identification
quantize() Module 15 YOUR INT8/FP16 conversion
prune() Module 16 YOUR weight pruning
vectorize() Module 17 YOUR accelerated operations
KVCache Module 18 YOUR key-value caching for generation

No external optimization libraries. Every speedup, every byte saved, traces back to code you wrote.

Historical Context

Before MLPerf, comparing ML systems was guesswork. Vendors picked their own datasets, batch sizes, and accuracy targets, then claimed wins. MLPerf forced a common protocol — same models, same data, same accuracy floor — so a “2× faster” claim could finally be checked instead of believed.

That protocol marks the moment ML engineering became as load-bearing as ML research. Building a model is step one. Shipping it inside a latency budget, on hardware your users actually own, is where production value lives — and where careers are made.

Systems Insights

  • Memory — 4–16× compression with under 2% accuracy loss is routine, not heroic.
  • Latency — 10–40× speedup from caching and batching alone, before touching the model.
  • Trade-offs — every win costs accuracy, latency, or memory. Profiling tells you which one to spend; intuition will lie to you.

What’s Next

With Milestone 06 the historical arc closes. Six recreations, eighteen modules, one framework you wrote yourself end-to-end. You’ve:

  • Built every core component (Modules 01–13)
  • Optimized for production deployment (Modules 14–18)
  • Proven mastery by recreating six landmark systems (Milestones 01–06)

What’s left is the test no recreation can give you. The Capstone (Module 20) is the Torch Olympics — open-ended problems, fixed budgets, your framework, your call. The history lessons end here. The competition starts next.

Further Reading

Back to top