Optimization Tier (Modules 14-19)

Optimization Tier (Modules 14-19)#

Transform research prototypes into production-ready systems.

What You’ll Learn#

The Optimization tier teaches you how to make ML systems fast, small, and deployable. You’ll learn systematic profiling, model compression through quantization and pruning, inference acceleration with caching and batching, and comprehensive benchmarking methodologies.

By the end of this tier, you’ll understand:

How to identify performance bottlenecks through profiling
Why quantization reduces model size by 4-16× with minimal accuracy loss
How pruning removes unnecessary parameters to compress models
What KV-caching does to accelerate transformer inference
How batching and other optimizations achieve production speed

Module Progression#

        graph TB
 A[️ Architecture<br/>CNNs + Transformers]

 A --> M14[14. Profiling<br/>Find bottlenecks]

 M14 --> M15[15. Quantization<br/>INT8 compression]
 M14 --> M16[16. Compression<br/>Structured pruning]

 M15 --> SMALL[ Smaller Models<br/>4-16× size reduction]
 M16 --> SMALL

 M14 --> M17[17. Memoization<br/>KV-cache for inference]
 M17 --> M18[18. Acceleration<br/>Batching + optimizations]

 M18 --> FAST[ Faster Inference<br/>12-40× speedup]

 SMALL --> M19[19. Benchmarking<br/>Systematic measurement]
 FAST --> M19

 M19 --> OLYMPICS[ MLPerf Torch Olympics<br/>Production-ready systems]

 style A fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
 style M14 fill:#fff3e0,stroke:#f57c00,stroke-width:3px
 style M15 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
 style M16 fill:#ffe0b2,stroke:#ef6c00,stroke-width:3px
 style M17 fill:#ffcc80,stroke:#e65100,stroke-width:3px
 style M18 fill:#ffb74d,stroke:#e65100,stroke-width:3px
 style M19 fill:#ffa726,stroke:#e65100,stroke-width:4px
 style SMALL fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
 style FAST fill:#c8e6c9,stroke:#388e3c,stroke-width:3px
 style OLYMPICS fill:#fef3c7,stroke:#f59e0b,stroke-width:4px

Fig. 21 Optimization Module Flow. Starting from profiling, two parallel tracks address size reduction (quantization, compression) and speed improvement (memoization, acceleration), converging at benchmarking.#

Module Details#

14. Profiling - Measure Before Optimizing#

What it is: Tools and techniques to identify computational bottlenecks in ML systems.

Why it matters: “Premature optimization is the root of all evil.” Profiling tells you WHERE to optimize—which operations consume the most time, memory, or energy. Without profiling, you’re guessing.

What you’ll build: Memory profilers, timing utilities, and FLOPs counters to analyze model performance.

Systems focus: Time complexity, space complexity, computational graphs, hotspot identification

Key insight: Don’t optimize blindly. Profile first, then optimize the bottlenecks.

15. Quantization - Smaller Models, Similar Accuracy#

What it is: Converting FP32 weights to INT8 to reduce model size and speed up inference.

Why it matters: Quantization achieves 4× size reduction and faster computation with minimal accuracy loss (often <1%). Essential for deploying models on edge devices or reducing cloud costs.

What you’ll build: Post-training quantization (PTQ) for weights and activations with calibration.

Systems focus: Numerical precision, scale/zero-point calculation, quantization-aware operations

Impact: Models shrink from 100MB → 25MB while maintaining 95%+ of original accuracy.

16. Compression - Pruning Unnecessary Parameters#

What it is: Removing unimportant weights and neurons through structured pruning.

Why it matters: Neural networks are often over-parameterized. Pruning removes 50-90% of parameters with minimal accuracy loss, reducing memory and computation.

What you’ll build: Magnitude-based pruning, structured pruning (entire channels/layers), and fine-tuning after pruning.

Systems focus: Sparsity patterns, memory layout, retraining strategies

Impact: Combined with quantization, achieve 8-16× compression (quantize + prune).

17. Memoization - KV-Cache for Fast Generation#

What it is: Caching key-value pairs in transformers to avoid recomputing attention for previously generated tokens.

Why it matters: Without KV-cache, generating each new token requires O(n²) recomputation of all previous tokens. With KV-cache, generation becomes O(n), achieving 10-100× speedups for long sequences.

What you’ll build: KV-cache implementation for transformer inference with proper memory management.

Systems focus: Cache management, memory vs speed trade-offs, incremental computation

Impact: Text generation goes from 0.5 tokens/sec → 50+ tokens/sec.

18. Acceleration - Batching and Beyond#

What it is: Batching multiple requests, operation fusion, and other inference optimizations.

Why it matters: Production systems serve multiple users simultaneously. Batching amortizes overhead across requests, achieving near-linear throughput scaling.

What you’ll build: Dynamic batching, operation fusion, and inference server patterns.

Systems focus: Throughput vs latency, memory pooling, request scheduling

Impact: Combined with KV-cache, achieve 12-40× faster inference than naive implementations.

19. Benchmarking - Systematic Measurement#

What it is: Rigorous methodology for measuring model performance across multiple dimensions.

Why it matters: “What gets measured gets managed.” Benchmarking provides apples-to-apples comparisons of accuracy, speed, memory, and energy—essential for production decisions.

What you’ll build: Comprehensive benchmarking suite measuring accuracy, latency, throughput, memory, and FLOPs.

Systems focus: Measurement methodology, statistical significance, performance metrics

Historical context: MLCommons’ MLPerf (founded 2018) established systematic benchmarking as AI systems grew too complex for ad-hoc evaluation.

What You Can Build After This Tier#

        timeline
 title Production-Ready Systems
 Baseline : 100MB model, 0.5 tokens/sec, 95% accuracy
 Quantization : 25MB model (4× smaller), same accuracy
 Pruning : 12MB model (8× smaller), 94% accuracy
 KV-Cache : 50 tokens/sec (100× faster generation)
 Batching : 500 tokens/sec (1000× throughput)
 MLPerf Olympics : Production-ready transformer deployment

Fig. 22 Production Optimization Timeline. Progressive improvements from baseline to production-ready: 8-16× smaller models and 12-40× faster inference.#

After completing the Optimization tier, you’ll be able to:

Milestone 06 (2018): Achieve production-ready optimization:
8-16× smaller models (quantization + pruning)
12-40× faster inference (KV-cache + batching)
Systematic profiling and benchmarking workflows
Deploy models that run on:
Edge devices (Raspberry Pi, mobile phones)
Cloud infrastructure (cost-effective serving)
Real-time applications (low-latency requirements)

Prerequisites#

Required:

** Architecture Tier** (Modules 08-13) completed
Understanding of CNNs and/or transformers
Experience training models on real datasets
Basic understanding of systems concepts (memory, CPU/GPU, throughput)

Helpful but not required:

Production ML experience
Systems programming background
Understanding of hardware constraints

Time Commitment#

Per module: 4-6 hours (implementation + profiling + benchmarking)

Total tier: ~30-40 hours for complete mastery

Recommended pace: 1 module per week (this tier is dense!)

Learning Approach#

Each module follows Measure → Optimize → Validate:

Measure: Profile baseline performance (time, memory, accuracy)
Optimize: Implement optimization technique (quantize, prune, cache)
Validate: Benchmark improvements and understand trade-offs

This mirrors production ML workflows where optimization is an iterative, data-driven process.

Key Achievement: MLPerf Torch Olympics#

After Module 19, you’ll complete the MLPerf Torch Olympics Milestone (2018):

cd milestones/06_2018_mlperf
python 01_baseline_profile.py # Identify bottlenecks
python 02_compression.py # Quantize + prune (8-16× smaller)
python 03_generation_opts.py # KV-cache + batching (12-40× faster)

What makes this special: You’ll have built the entire optimization pipeline from scratch—profiling tools, quantization engine, pruning algorithms, caching systems, and benchmarking infrastructure.

Two Optimization Tracks#

The Optimization tier has two parallel focuses:

Size Optimization (Modules 15-16):

Quantization (INT8 compression)
Pruning (removing parameters)
Goal: Smaller models for deployment

Speed Optimization (Modules 17-18):

Memoization (KV-cache)
Acceleration (batching, fusion)
Goal: Faster inference for production

Both tracks start from Module 14 (Profiling) and converge at Module 19 (Benchmarking).

Recommendation: Complete modules in order (14→15→16→17→18→19) to build a complete understanding of the optimization landscape.

Real-World Impact#

The techniques in this tier are used by every production ML system:

Quantization: TensorFlow Lite, ONNX Runtime, Apple Neural Engine
Pruning: Mobile ML, edge AI, efficient transformers
KV-Cache: All transformer inference engines (vLLM, TGI, llama.cpp)
Batching: Cloud serving (AWS SageMaker, GCP Vertex AI)
Benchmarking: MLPerf industry standard for AI performance

After this tier, you’ll understand how real ML systems achieve production performance.

Next Steps#

Ready to optimize?

# Start the Optimization tier
tito module start 14_profiling

# Follow the measure → optimize → validate cycle

Or explore other tiers:

Foundation Tier (Modules 01-07): Mathematical foundations
Architecture Tier (Modules 08-13): CNNs and transformers
Torch Olympics (Module 20): Final integration challenge

← Back to Home • Milestone System