#| fig-cap: "**Optimization Module Flow.** Starting from profiling, two parallel tracks: model-level (quantization, compression) and runtime (acceleration, memoization), converging at benchmarking." graph TB A[️ Architecture<br/>CNNs + Transformers] A --> M14[14. Profiling<br/>Find bottlenecks] M14 --> M15[15. Quantization<br/>INT8 compression] M14 --> M16[16. Compression<br/>Structured pruning] M15 --> SMALL[ Smaller Models<br/>4-16× size reduction] M16 --> SMALL M14 --> M17[17. Acceleration<br/>Vectorization + fusion] M17 --> M18[18. Memoization<br/>KV-cache for inference] M18 --> FAST[ Faster Inference<br/>12-40× speedup] SMALL --> M19[19. Benchmarking<br/>Systematic measurement] FAST --> M19 M19 --> OLYMPICS[ MLPerf Torch Olympics<br/>Production-ready systems] style A fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style M14 fill:#fff3e0,stroke:#f57c00,stroke-width:3px style M15 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px style M16 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px style M17 fill:#ffcc80,stroke:#e65100,stroke-width:3px style M18 fill:#ffb74d,stroke:#e65100,stroke-width:3px style M19 fill:#ffa726,stroke:#e65100,stroke-width:4px style SMALL fill:#c8e6c9,stroke:#388e3c,stroke-width:3px style FAST fill:#c8e6c9,stroke:#388e3c,stroke-width:3px style OLYMPICS fill:#fef3c7,stroke:#f59e0b,stroke-width:4px
Optimization Tier (Modules 14-19)
Transform research prototypes into production-ready systems.
What You’ll Learn
The Optimization tier teaches you how to make ML systems fast, small, and deployable. You’ll learn systematic profiling, model compression through quantization and pruning, inference acceleration with caching and batching, and comprehensive benchmarking methodologies.
By the end of this tier, you’ll understand:
- How to identify performance bottlenecks through profiling
- Why quantization reduces model size by 4-16× with minimal accuracy loss
- How pruning removes unnecessary parameters to compress models
- What KV-caching does to accelerate transformer inference
- How batching and other optimizations achieve production speed
Module Progression
Module Details
14. Profiling - Measure Before Optimizing
What it is: Tools and techniques to identify computational bottlenecks in ML systems.
Why it matters: “Premature optimization is the root of all evil.” Profiling tells you WHERE to optimize—which operations consume the most time, memory, or energy. Without profiling, you’re guessing.
What you’ll build: Memory profilers, timing utilities, and FLOPs counters to analyze model performance.
Systems focus: Time complexity, space complexity, computational graphs, hotspot identification
Key insight: Don’t optimize blindly. Profile first, then optimize the bottlenecks.
15. Quantization - Smaller Models, Similar Accuracy
What it is: Converting FP32 weights to INT8 to reduce model size and speed up inference.
Why it matters: Quantization achieves 4× size reduction and faster computation with minimal accuracy loss (often <1%). Essential for deploying models on edge devices or reducing cloud costs.
What you’ll build: Post-training quantization (PTQ) for weights and activations with calibration.
Systems focus: Numerical precision, scale/zero-point calculation, quantization-aware operations
Impact: Models shrink from 100MB → 25MB while maintaining 95%+ of original accuracy.
16. Compression - Pruning Unnecessary Parameters
What it is: Removing unimportant weights and neurons through structured pruning.
Why it matters: Neural networks are often over-parameterized. Pruning removes 50-90% of parameters with minimal accuracy loss, reducing memory and computation.
What you’ll build: Magnitude-based pruning, structured pruning (entire channels/layers), and fine-tuning after pruning.
Systems focus: Sparsity patterns, memory layout, retraining strategies
Impact: Combined with quantization, achieve 8-16× compression (quantize + prune).
17. Acceleration - Vectorization and Fusion
What it is: Vectorized operations, SIMD utilization, and kernel fusion for maximum hardware efficiency.
Why it matters: Naive implementations waste 90%+ of hardware capability. Vectorization and fusion eliminate redundant memory traffic and exploit hardware parallelism for 10-100× speedups.
What you’ll build: Vectorized matrix multiplication, fused operations (like fused_gelu), and cache-optimal memory access patterns.
Systems focus: Cache locality, SIMD utilization, memory bandwidth optimization, loop tiling
Impact: Convolution speedup from naive loops to near-BLAS performance.
18. Memoization - KV-Cache for Fast Generation
What it is: Caching key-value pairs in transformers to avoid recomputing attention for previously generated tokens.
Why it matters: Without KV-cache, generating each new token requires O(n²) recomputation of all previous tokens. With KV-cache, generation becomes O(n), achieving 10-100× speedups for long sequences.
What you’ll build: KV-cache implementation for transformer inference with proper memory management.
Systems focus: Cache management, memory vs speed trade-offs, incremental computation
Impact: Text generation goes from 0.5 tokens/sec → 50+ tokens/sec.
19. Benchmarking - Systematic Measurement
What it is: Rigorous methodology for measuring model performance across multiple dimensions.
Why it matters: “What gets measured gets managed.” Benchmarking provides apples-to-apples comparisons of accuracy, speed, memory, and energy—essential for production decisions.
What you’ll build: Comprehensive benchmarking suite measuring accuracy, latency, throughput, memory, and FLOPs.
Systems focus: Measurement methodology, statistical significance, performance metrics
Historical context: MLCommons’ MLPerf (founded 2018) established systematic benchmarking as AI systems grew too complex for ad-hoc evaluation.
What You Can Build After This Tier
#| fig-cap: "**Production Optimization Timeline.** Progressive improvements from baseline to production-ready: 8-16× smaller models and 12-40× faster inference." timeline title Production-Ready Systems Baseline : 100MB model, 0.5 tokens/sec, 95% accuracy Quantization : 25MB model (4× smaller), same accuracy Pruning : 12MB model (8× smaller), 94% accuracy KV-Cache : 50 tokens/sec (100× faster generation) Batching : 500 tokens/sec (1000× throughput) MLPerf Olympics : Production-ready transformer deployment
After completing the Optimization tier, you’ll be able to:
Milestone 06 (2018): Achieve production-ready optimization:
8-16× smaller models (quantization + pruning)
12-40× faster inference (KV-cache + batching)
Systematic profiling and benchmarking workflows
Deploy models that run on:
Edge devices (Raspberry Pi, mobile phones)
Cloud infrastructure (cost-effective serving)
Real-time applications (low-latency requirements)
Prerequisites
Required: - ** Architecture Tier** (Modules 09-13) completed - Understanding of CNNs and/or transformers - Experience training models on real datasets - Basic understanding of systems concepts (memory, CPU/GPU, throughput)
Helpful but not required: - Production ML experience - Systems programming background - Understanding of hardware constraints
Time Commitment
Per module: 4-6 hours (implementation + profiling + benchmarking)
Total tier: ~30-40 hours for complete mastery
Recommended pace: 1 module per week (this tier is dense!)
Learning Approach
Each module follows Measure → Optimize → Validate:
- Measure: Profile baseline performance (time, memory, accuracy)
- Optimize: Implement optimization technique (quantize, prune, cache)
- Validate: Benchmark improvements and understand trade-offs
This mirrors production ML workflows where optimization is an iterative, data-driven process.
Key Achievement: MLPerf Torch Olympics
After Module 19, you’ll complete the MLPerf Torch Olympics Milestone (2018):
cd milestones/06_2018_mlperf
python 01_baseline_profile.py # Identify bottlenecks
python 02_compression.py # Quantize + prune (8-16× smaller)
python 03_generation_opts.py # KV-cache + batching (12-40× faster)What makes this special: You’ll have built the entire optimization pipeline from scratch—profiling tools, quantization engine, pruning algorithms, caching systems, and benchmarking infrastructure.
Two Optimization Categories
The Optimization tier follows a deliberate pedagogical structure: Measure → Model-Level → Runtime → Validate.
Model-Level Optimizations (Modules 15-16)
These techniques change the model itself—they permanently modify weights and architecture:
- Quantization (15): Convert FP32 weights to INT8 for 4× compression
- Compression (16): Prune unnecessary parameters for 2-10× size reduction
After model-level optimization, you have a different (smaller, faster) model.
Runtime Optimizations (Modules 17-18)
These techniques change how execution happens without modifying model weights:
- Acceleration (17): Vectorization exploits SIMD instructions; kernel fusion eliminates memory traffic. These apply to ANY numerical computation.
- Memoization (18): KV-cache is domain-specific optimization for transformer inference, trading memory for O(n²) → O(n) speedup.
Note the progression: general-purpose optimization (acceleration) comes before domain-specific optimization (KV-cache). This matches the pedagogical principle of teaching foundational concepts before specialized applications.
Why This Order?
The Optimization tier follows Measure → Transform → Execute → Validate:
Profile (14) → Model-Level (15-16) → Runtime (17-18) → Benchmark (19)
↓ ↓ ↓ ↓
"What's slow?" "Shrink the model" "Speed up execution" "Did it work?"
Why Model-Level before Runtime?
Model-level optimizations (quantization, pruning) are one-time transformations you apply before deployment. Runtime optimizations (acceleration, memoization) are ongoing techniques applied during every inference. It makes sense to first prepare the model, then optimize how it runs.
Why Acceleration (17) before Memoization (18)?
- Acceleration teaches general-purpose optimization: vectorization, cache locality, kernel fusion. These techniques apply to any numerical computation—matrix multiplication, convolutions, attention, everything.
- Memoization (KV-cache) is domain-specific optimization for transformer autoregressive generation. It only applies to transformers generating sequences.
The pedagogical principle: general before specific. Once you understand how to make any code fast (acceleration), you can appreciate the specialized optimization that makes LLM inference economically viable (KV-cache).
Both tracks start from Module 14 (Profiling) and converge at Module 19 (Benchmarking).
Recommendation: Complete modules in order (14→15→16→17→18→19) to build a complete understanding of the optimization landscape.
Real-World Impact
The techniques in this tier are used by every production ML system:
- Quantization: TensorFlow Lite, ONNX Runtime, Apple Neural Engine
- Pruning: Mobile ML, edge AI, efficient transformers
- KV-Cache: All transformer inference engines (vLLM, TGI, llama.cpp)
- Batching: Cloud serving (AWS SageMaker, GCP Vertex AI)
- Benchmarking: MLPerf industry standard for AI performance
After this tier, you’ll understand how real ML systems achieve production performance.
Next Steps
Ready to optimize?
# Start the Optimization tier
tito module start 14_profiling
# Follow the measure → optimize → validate cycleOr explore other tiers:
- Foundation Tier (Modules 01-08): Mathematical foundations
- Architecture Tier (Modules 09-13): CNNs and transformers
- Torch Olympics (Module 20): Final integration challenge