Optimization Tier (Modules 14-19)#
Transform research prototypes into production-ready systems.
What Youâll Learn#
The Optimization tier teaches you how to make ML systems fast, small, and deployable. Youâll learn systematic profiling, model compression through quantization and pruning, inference acceleration with caching and batching, and comprehensive benchmarking methodologies.
By the end of this tier, youâll understand:
How to identify performance bottlenecks through profiling
Why quantization reduces model size by 4-16Ă with minimal accuracy loss
How pruning removes unnecessary parameters to compress models
What KV-caching does to accelerate transformer inference
How batching and other optimizations achieve production speed
Module Progression#
graph TB
A[ïž Architecture<br/>CNNs + Transformers]
A --> M14[14. Profiling<br/>Find bottlenecks]
M14 --> M15[15. Quantization<br/>INT8 compression]
M14 --> M16[16. Compression<br/>Structured pruning]
M15 --> SMALL[ Smaller Models<br/>4-16Ă size reduction]
M16 --> SMALL
M14 --> M17[17. Memoization<br/>KV-cache for inference]
M17 --> M18[18. Acceleration<br/>Batching + optimizations]
M18 --> FAST[ Faster Inference<br/>12-40Ă speedup]
SMALL --> M19[19. Benchmarking<br/>Systematic measurement]
FAST --> M19
M19 --> OLYMPICS[ MLPerf Torch Olympics<br/>Production-ready systems]
style A fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style M14 fill:#fff3e0,stroke:#f57c00,stroke-width:3px
style M15 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
style M16 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
style M17 fill:#ffcc80,stroke:#e65100,stroke-width:3px
style M18 fill:#ffb74d,stroke:#e65100,stroke-width:3px
style M19 fill:#ffa726,stroke:#e65100,stroke-width:4px
style SMALL fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
style FAST fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
style OLYMPICS fill:#fef3c7,stroke:#f59e0b,stroke-width:4px
Fig. 21 Optimization Module Flow. Starting from profiling, two parallel tracks address size reduction (quantization, compression) and speed improvement (memoization, acceleration), converging at benchmarking.#
Module Details#
14. Profiling - Measure Before Optimizing#
What it is: Tools and techniques to identify computational bottlenecks in ML systems.
Why it matters: âPremature optimization is the root of all evil.â Profiling tells you WHERE to optimizeâwhich operations consume the most time, memory, or energy. Without profiling, youâre guessing.
What youâll build: Memory profilers, timing utilities, and FLOPs counters to analyze model performance.
Systems focus: Time complexity, space complexity, computational graphs, hotspot identification
Key insight: Donât optimize blindly. Profile first, then optimize the bottlenecks.
15. Quantization - Smaller Models, Similar Accuracy#
What it is: Converting FP32 weights to INT8 to reduce model size and speed up inference.
Why it matters: Quantization achieves 4Ă size reduction and faster computation with minimal accuracy loss (often <1%). Essential for deploying models on edge devices or reducing cloud costs.
What youâll build: Post-training quantization (PTQ) for weights and activations with calibration.
Systems focus: Numerical precision, scale/zero-point calculation, quantization-aware operations
Impact: Models shrink from 100MB â 25MB while maintaining 95%+ of original accuracy.
16. Compression - Pruning Unnecessary Parameters#
What it is: Removing unimportant weights and neurons through structured pruning.
Why it matters: Neural networks are often over-parameterized. Pruning removes 50-90% of parameters with minimal accuracy loss, reducing memory and computation.
What youâll build: Magnitude-based pruning, structured pruning (entire channels/layers), and fine-tuning after pruning.
Systems focus: Sparsity patterns, memory layout, retraining strategies
Impact: Combined with quantization, achieve 8-16Ă compression (quantize + prune).
17. Memoization - KV-Cache for Fast Generation#
What it is: Caching key-value pairs in transformers to avoid recomputing attention for previously generated tokens.
Why it matters: Without KV-cache, generating each new token requires O(nÂČ) recomputation of all previous tokens. With KV-cache, generation becomes O(n), achieving 10-100Ă speedups for long sequences.
What youâll build: KV-cache implementation for transformer inference with proper memory management.
Systems focus: Cache management, memory vs speed trade-offs, incremental computation
Impact: Text generation goes from 0.5 tokens/sec â 50+ tokens/sec.
18. Acceleration - Batching and Beyond#
What it is: Batching multiple requests, operation fusion, and other inference optimizations.
Why it matters: Production systems serve multiple users simultaneously. Batching amortizes overhead across requests, achieving near-linear throughput scaling.
What youâll build: Dynamic batching, operation fusion, and inference server patterns.
Systems focus: Throughput vs latency, memory pooling, request scheduling
Impact: Combined with KV-cache, achieve 12-40Ă faster inference than naive implementations.
19. Benchmarking - Systematic Measurement#
What it is: Rigorous methodology for measuring model performance across multiple dimensions.
Why it matters: âWhat gets measured gets managed.â Benchmarking provides apples-to-apples comparisons of accuracy, speed, memory, and energyâessential for production decisions.
What youâll build: Comprehensive benchmarking suite measuring accuracy, latency, throughput, memory, and FLOPs.
Systems focus: Measurement methodology, statistical significance, performance metrics
Historical context: MLCommonsâ MLPerf (founded 2018) established systematic benchmarking as AI systems grew too complex for ad-hoc evaluation.
What You Can Build After This Tier#
timeline
title Production-Ready Systems
Baseline : 100MB model, 0.5 tokens/sec, 95% accuracy
Quantization : 25MB model (4Ă smaller), same accuracy
Pruning : 12MB model (8Ă smaller), 94% accuracy
KV-Cache : 50 tokens/sec (100Ă faster generation)
Batching : 500 tokens/sec (1000Ă throughput)
MLPerf Olympics : Production-ready transformer deployment
Fig. 22 Production Optimization Timeline. Progressive improvements from baseline to production-ready: 8-16Ă smaller models and 12-40Ă faster inference.#
After completing the Optimization tier, youâll be able to:
Milestone 06 (2018): Achieve production-ready optimization:
8-16Ă smaller models (quantization + pruning)
12-40Ă faster inference (KV-cache + batching)
Systematic profiling and benchmarking workflows
Deploy models that run on:
Edge devices (Raspberry Pi, mobile phones)
Cloud infrastructure (cost-effective serving)
Real-time applications (low-latency requirements)
Prerequisites#
Required:
** Architecture Tier** (Modules 08-13) completed
Understanding of CNNs and/or transformers
Experience training models on real datasets
Basic understanding of systems concepts (memory, CPU/GPU, throughput)
Helpful but not required:
Production ML experience
Systems programming background
Understanding of hardware constraints
Time Commitment#
Per module: 4-6 hours (implementation + profiling + benchmarking)
Total tier: ~30-40 hours for complete mastery
Recommended pace: 1 module per week (this tier is dense!)
Learning Approach#
Each module follows Measure â Optimize â Validate:
Measure: Profile baseline performance (time, memory, accuracy)
Optimize: Implement optimization technique (quantize, prune, cache)
Validate: Benchmark improvements and understand trade-offs
This mirrors production ML workflows where optimization is an iterative, data-driven process.
Key Achievement: MLPerf Torch Olympics#
After Module 19, youâll complete the MLPerf Torch Olympics Milestone (2018):
cd milestones/06_2018_mlperf
python 01_baseline_profile.py # Identify bottlenecks
python 02_compression.py # Quantize + prune (8-16Ă smaller)
python 03_generation_opts.py # KV-cache + batching (12-40Ă faster)
What makes this special: Youâll have built the entire optimization pipeline from scratchâprofiling tools, quantization engine, pruning algorithms, caching systems, and benchmarking infrastructure.
Two Optimization Tracks#
The Optimization tier has two parallel focuses:
Size Optimization (Modules 15-16):
Quantization (INT8 compression)
Pruning (removing parameters)
Goal: Smaller models for deployment
Speed Optimization (Modules 17-18):
Memoization (KV-cache)
Acceleration (batching, fusion)
Goal: Faster inference for production
Both tracks start from Module 14 (Profiling) and converge at Module 19 (Benchmarking).
Recommendation: Complete modules in order (14â15â16â17â18â19) to build a complete understanding of the optimization landscape.
Real-World Impact#
The techniques in this tier are used by every production ML system:
Quantization: TensorFlow Lite, ONNX Runtime, Apple Neural Engine
Pruning: Mobile ML, edge AI, efficient transformers
KV-Cache: All transformer inference engines (vLLM, TGI, llama.cpp)
Batching: Cloud serving (AWS SageMaker, GCP Vertex AI)
Benchmarking: MLPerf industry standard for AI performance
After this tier, youâll understand how real ML systems achieve production performance.
Next Steps#
Ready to optimize?
# Start the Optimization tier
tito module start 14_profiling
# Follow the measure â optimize â validate cycle
Or explore other tiers:
Foundation Tier (Modules 01-07): Mathematical foundations
Architecture Tier (Modules 08-13): CNNs and transformers
Torch Olympics (Module 20): Final integration challenge