Optimization Tier (Modules 14-19)#

Transform research prototypes into production-ready systems.

What You’ll Learn#

The Optimization tier teaches you how to make ML systems fast, small, and deployable. You’ll learn systematic profiling, model compression through quantization and pruning, inference acceleration with caching and batching, and comprehensive benchmarking methodologies.

By the end of this tier, you’ll understand:

  • How to identify performance bottlenecks through profiling

  • Why quantization reduces model size by 4-16× with minimal accuracy loss

  • How pruning removes unnecessary parameters to compress models

  • What KV-caching does to accelerate transformer inference

  • How batching and other optimizations achieve production speed

Module Progression#

        graph TB
 A[ Architecture<br/>CNNs + Transformers]

 A --> M14[14. Profiling<br/>Find bottlenecks]

 M14 --> M15[15. Quantization<br/>INT8 compression]
 M14 --> M16[16. Compression<br/>Structured pruning]

 M15 --> SMALL[ Smaller Models<br/>4-16× size reduction]
 M16 --> SMALL

 M14 --> M17[17. Acceleration<br/>Vectorization + fusion]
 M17 --> M18[18. Memoization<br/>KV-cache for inference]

 M18 --> FAST[ Faster Inference<br/>12-40× speedup]

 SMALL --> M19[19. Benchmarking<br/>Systematic measurement]
 FAST --> M19

 M19 --> OLYMPICS[ MLPerf Torch Olympics<br/>Production-ready systems]

 style A fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
 style M14 fill:#fff3e0,stroke:#f57c00,stroke-width:3px
 style M15 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
 style M16 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
 style M17 fill:#ffcc80,stroke:#e65100,stroke-width:3px
 style M18 fill:#ffb74d,stroke:#e65100,stroke-width:3px
 style M19 fill:#ffa726,stroke:#e65100,stroke-width:4px
 style SMALL fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
 style FAST fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
 style OLYMPICS fill:#fef3c7,stroke:#f59e0b,stroke-width:4px
    

Fig. 21 Optimization Module Flow. Starting from profiling, two parallel tracks: model-level (quantization, compression) and runtime (acceleration, memoization), converging at benchmarking.#

Module Details#

14. Profiling - Measure Before Optimizing#

What it is: Tools and techniques to identify computational bottlenecks in ML systems.

Why it matters: “Premature optimization is the root of all evil.” Profiling tells you WHERE to optimize—which operations consume the most time, memory, or energy. Without profiling, you’re guessing.

What you’ll build: Memory profilers, timing utilities, and FLOPs counters to analyze model performance.

Systems focus: Time complexity, space complexity, computational graphs, hotspot identification

Key insight: Don’t optimize blindly. Profile first, then optimize the bottlenecks.

15. Quantization - Smaller Models, Similar Accuracy#

What it is: Converting FP32 weights to INT8 to reduce model size and speed up inference.

Why it matters: Quantization achieves 4× size reduction and faster computation with minimal accuracy loss (often <1%). Essential for deploying models on edge devices or reducing cloud costs.

What you’ll build: Post-training quantization (PTQ) for weights and activations with calibration.

Systems focus: Numerical precision, scale/zero-point calculation, quantization-aware operations

Impact: Models shrink from 100MB → 25MB while maintaining 95%+ of original accuracy.

16. Compression - Pruning Unnecessary Parameters#

What it is: Removing unimportant weights and neurons through structured pruning.

Why it matters: Neural networks are often over-parameterized. Pruning removes 50-90% of parameters with minimal accuracy loss, reducing memory and computation.

What you’ll build: Magnitude-based pruning, structured pruning (entire channels/layers), and fine-tuning after pruning.

Systems focus: Sparsity patterns, memory layout, retraining strategies

Impact: Combined with quantization, achieve 8-16× compression (quantize + prune).

17. Acceleration - Vectorization and Fusion#

What it is: Vectorized operations, SIMD utilization, and kernel fusion for maximum hardware efficiency.

Why it matters: Naive implementations waste 90%+ of hardware capability. Vectorization and fusion eliminate redundant memory traffic and exploit hardware parallelism for 10-100× speedups.

What you’ll build: Vectorized matrix multiplication, fused operations (like fused_gelu), and cache-optimal memory access patterns.

Systems focus: Cache locality, SIMD utilization, memory bandwidth optimization, loop tiling

Impact: Convolution speedup from naive loops to near-BLAS performance.

18. Memoization - KV-Cache for Fast Generation#

What it is: Caching key-value pairs in transformers to avoid recomputing attention for previously generated tokens.

Why it matters: Without KV-cache, generating each new token requires O(nÂČ) recomputation of all previous tokens. With KV-cache, generation becomes O(n), achieving 10-100× speedups for long sequences.

What you’ll build: KV-cache implementation for transformer inference with proper memory management.

Systems focus: Cache management, memory vs speed trade-offs, incremental computation

Impact: Text generation goes from 0.5 tokens/sec → 50+ tokens/sec.

19. Benchmarking - Systematic Measurement#

What it is: Rigorous methodology for measuring model performance across multiple dimensions.

Why it matters: “What gets measured gets managed.” Benchmarking provides apples-to-apples comparisons of accuracy, speed, memory, and energy—essential for production decisions.

What you’ll build: Comprehensive benchmarking suite measuring accuracy, latency, throughput, memory, and FLOPs.

Systems focus: Measurement methodology, statistical significance, performance metrics

Historical context: MLCommons’ MLPerf (founded 2018) established systematic benchmarking as AI systems grew too complex for ad-hoc evaluation.

What You Can Build After This Tier#

        timeline
 title Production-Ready Systems
 Baseline : 100MB model, 0.5 tokens/sec, 95% accuracy
 Quantization : 25MB model (4× smaller), same accuracy
 Pruning : 12MB model (8× smaller), 94% accuracy
 KV-Cache : 50 tokens/sec (100× faster generation)
 Batching : 500 tokens/sec (1000× throughput)
 MLPerf Olympics : Production-ready transformer deployment
    

Fig. 22 Production Optimization Timeline. Progressive improvements from baseline to production-ready: 8-16× smaller models and 12-40× faster inference.#

After completing the Optimization tier, you’ll be able to:

  • Milestone 06 (2018): Achieve production-ready optimization:

  • 8-16× smaller models (quantization + pruning)

  • 12-40× faster inference (KV-cache + batching)

  • Systematic profiling and benchmarking workflows

  • Deploy models that run on:

  • Edge devices (Raspberry Pi, mobile phones)

  • Cloud infrastructure (cost-effective serving)

  • Real-time applications (low-latency requirements)

Prerequisites#

Required:

  • ** Architecture Tier** (Modules 09-13) completed

  • Understanding of CNNs and/or transformers

  • Experience training models on real datasets

  • Basic understanding of systems concepts (memory, CPU/GPU, throughput)

Helpful but not required:

  • Production ML experience

  • Systems programming background

  • Understanding of hardware constraints

Time Commitment#

Per module: 4-6 hours (implementation + profiling + benchmarking)

Total tier: ~30-40 hours for complete mastery

Recommended pace: 1 module per week (this tier is dense!)

Learning Approach#

Each module follows Measure → Optimize → Validate:

  1. Measure: Profile baseline performance (time, memory, accuracy)

  2. Optimize: Implement optimization technique (quantize, prune, cache)

  3. Validate: Benchmark improvements and understand trade-offs

This mirrors production ML workflows where optimization is an iterative, data-driven process.

Key Achievement: MLPerf Torch Olympics#

After Module 19, you’ll complete the MLPerf Torch Olympics Milestone (2018):

cd milestones/06_2018_mlperf
python 01_baseline_profile.py # Identify bottlenecks
python 02_compression.py # Quantize + prune (8-16× smaller)
python 03_generation_opts.py # KV-cache + batching (12-40× faster)

What makes this special: You’ll have built the entire optimization pipeline from scratch—profiling tools, quantization engine, pruning algorithms, caching systems, and benchmarking infrastructure.

Two Optimization Categories#

The Optimization tier follows a deliberate pedagogical structure: Measure → Model-Level → Runtime → Validate.

Model-Level Optimizations (Modules 15-16)#

These techniques change the model itself—they permanently modify weights and architecture:

  • Quantization (15): Convert FP32 weights to INT8 for 4× compression

  • Compression (16): Prune unnecessary parameters for 2-10× size reduction

After model-level optimization, you have a different (smaller, faster) model.

Runtime Optimizations (Modules 17-18)#

These techniques change how execution happens without modifying model weights:

  • Acceleration (17): Vectorization exploits SIMD instructions; kernel fusion eliminates memory traffic. These apply to ANY numerical computation.

  • Memoization (18): KV-cache is domain-specific optimization for transformer inference, trading memory for O(nÂČ) → O(n) speedup.

Note the progression: general-purpose optimization (acceleration) comes before domain-specific optimization (KV-cache). This matches the pedagogical principle of teaching foundational concepts before specialized applications.

Why This Order?#

The Optimization tier follows Measure → Transform → Execute → Validate:

Profile (14) → Model-Level (15-16) → Runtime (17-18) → Benchmark (19)
   ↓              ↓                      ↓                ↓
 "What's slow?"  "Shrink the model"   "Speed up execution"  "Did it work?"

Why Model-Level before Runtime?

Model-level optimizations (quantization, pruning) are one-time transformations you apply before deployment. Runtime optimizations (acceleration, memoization) are ongoing techniques applied during every inference. It makes sense to first prepare the model, then optimize how it runs.

Why Acceleration (17) before Memoization (18)?

  • Acceleration teaches general-purpose optimization: vectorization, cache locality, kernel fusion. These techniques apply to any numerical computation—matrix multiplication, convolutions, attention, everything.

  • Memoization (KV-cache) is domain-specific optimization for transformer autoregressive generation. It only applies to transformers generating sequences.

The pedagogical principle: general before specific. Once you understand how to make any code fast (acceleration), you can appreciate the specialized optimization that makes LLM inference economically viable (KV-cache).

Both tracks start from Module 14 (Profiling) and converge at Module 19 (Benchmarking).

Recommendation: Complete modules in order (14→15→16→17→18→19) to build a complete understanding of the optimization landscape.

Real-World Impact#

The techniques in this tier are used by every production ML system:

  • Quantization: TensorFlow Lite, ONNX Runtime, Apple Neural Engine

  • Pruning: Mobile ML, edge AI, efficient transformers

  • KV-Cache: All transformer inference engines (vLLM, TGI, llama.cpp)

  • Batching: Cloud serving (AWS SageMaker, GCP Vertex AI)

  • Benchmarking: MLPerf industry standard for AI performance

After this tier, you’ll understand how real ML systems achieve production performance.

Next Steps#

Ready to optimize?

# Start the Optimization tier
tito module start 14_profiling

# Follow the measure → optimize → validate cycle

Or explore other tiers:

← Back to Home ‱ Milestone System