Tutorials

Learn ML systems reasoning through the 22 Systems Walls.

These tutorials teach you to reason quantitatively about ML infrastructure using the MLSys·im analytical framework. They are organized by the six domains of the 22 Systems Walls taxonomy — the same framework used in the Machine Learning Systems textbook.

Each tutorial answers one question. Start at the beginning for a guided path, or jump to any domain that matches your interest.

TipHow to Use These Tutorials
  • Every tutorial runs on a laptop in under 30 seconds. No GPU required.
  • Code cells are executable. Clone the repo and run them, or follow along on the website.
  • Exercises ask you to predict first, then verify. This builds intuition faster than reading alone.
  • Tutorials within a cluster build on each other, but clusters are largely independent.
  • Time estimates are for reading + running code. Add 15–20 min if you do all exercises.

Start Here

Before diving into any domain, complete this introduction to the roofline model.

Beginner

0 · Hello, Roofline

Question: How do I predict whether my model is memory-bound or compute-bound?

Five lines of code, one answer. The foundation for everything that follows. ⏱ ~10 min

Start Tutorial →


Cluster 1: Node — Walls 1–3

One accelerator, one model. Where is the ceiling?

These tutorials explore the three walls that constrain a single accelerator: compute throughput (Wall 1), memory capacity (Wall 2), and memory bandwidth (Wall 3). Understanding which wall binds — and why — is the most fundamental skill in ML systems reasoning.

Beginner

1 · The Memory Wall

Question: Why doesn’t 3.2× more FLOPS give 3.2× speedup?

Compare A100 → H100 and discover that for LLM inference, bandwidth — not compute — is the binding constraint. The most important fallacy in ML systems. ⏱ ~15 min

Start Tutorial →

Intermediate

2 · Two Phases, One Request

Question: Why is LLM serving fundamentally different from CNN inference?

The same model on the same GPU hits two different ceilings: prefill is compute-bound, decode is memory-bound. This is why LLM serving requires its own analysis. ⏱ ~15 min

Start Tutorial →

Intermediate

3 · KV-Cache: The Hidden Tax

Question: What actually limits how many users I can serve concurrently?

At 128K context length, the KV-cache alone fills an 80 GB GPU. Explore how sequence length, batch size, and paged attention interact to constrain serving capacity. ⏱ ~20 min

Start Tutorial →


Cluster 2: Data — Walls 8–10

The GPU is fast. Is the pipeline faster?

Even the fastest accelerator sits idle if the data pipeline cannot keep up. These walls cover ingestion (Wall 8), transformation (Wall 9), and storage bandwidth (Wall 10).

Intermediate

4 · Starving the GPU

Question: Why is my GPU utilization only 40%?

A100 compute takes 48 ms per step, but the CPU augmentation pipeline is the true bottleneck. The binding constraint is JPEG decoding, not silicon. ⏱ ~15 min

Start Tutorial →


Cluster 3: Algorithm — Walls 11–13

Can I make the model smaller or the training cheaper?

These walls govern scaling laws (Wall 11), compression and quantization (Wall 12), and architecture efficiency (Wall 13).

Intermediate

5 · Quantization: Not a Free Lunch

Question: Does INT4 always give 4× speedup?

For memory-bound decode: nearly 4×. For compute-bound training: 0×. The regime determines whether quantization helps — and the roofline tells you which regime you’re in. ⏱ ~20 min

Start Tutorial →


Cluster 4: Fleet — Walls 14–16

Scaling past one machine. Where does efficiency go?

Distributed training introduces three new walls: communication overhead (Wall 14), synchronization cost (Wall 15), and reliability (Wall 16).

Advanced

6 · Scaling to 1000 GPUs

Question: Where does my training efficiency disappear at scale?

At 1024 GPUs, AllReduce communication overhead erodes scaling efficiency. But the real hidden cost is reliability: cluster MTBF drops to ~20 hours, forcing frequent checkpoints that consume more wall-clock time than communication itself. ⏱ ~20 min

Start Tutorial →


Cluster 5: Ops — Walls 17–20

What does it cost — in dollars, carbon, and water?

Operational walls cover energy (Wall 17), sustainability (Wall 18), economic cost (Wall 19), and safety (Wall 20). These are the walls that determine whether a technically feasible system is actually deployable.

Intermediate

7 · Geography is a Systems Variable

Question: Does it matter where I train?

Same 256-GPU cluster, same model, same duration: 412 tonnes CO₂ in Iowa vs. 10 tonnes in Québec. A 40× difference from geography alone. ⏱ ~15 min

Start Tutorial →

Intermediate

8 · The $9M Question

Question: How much does chain-of-thought reasoning actually cost?

K=8 reasoning steps multiply your serving bill by 7.6× — from $1.2M to $9.1M per year. A seemingly simple algorithmic choice becomes a capital expenditure decision. ⏱ ~20 min

Start Tutorial →


Cluster 6: Analysis — Walls 21–22

Cross-cutting diagnostics. Which knob matters most?

These walls provide the tools for sensitivity analysis (Wall 21) and synthesis (Wall 22) — the ability to ask “what if?” and “what must be true?”

Advanced

9 · Where to Invest: Sensitivity Analysis

Question: Should I buy more FLOPS or more bandwidth?

∂T/∂BW = −0.88 vs. ∂T/∂FLOPS = −0.06. For LLM inference, a 10% bandwidth increase yields 15× more improvement than a 10% compute increase. Then use inverse Roofline to derive the minimum hardware spec from an SLA. ⏱ ~20 min

Start Tutorial →

Advanced

10 · GPU vs. Wafer-Scale

Question: Can a fundamentally different architecture change which wall binds?

Cerebras eliminates the HBM memory wall entirely — but the binding constraint shifts to injection bandwidth. A qualitative regime change, not just a speedup. ⏱ ~20 min

Start Tutorial →


Cluster 7: Capstone

Compose everything. All 22 walls, one analysis.

Advanced

11 · Full-Stack Audit: LLaMA-70B Training

Question: What does a complete systems analysis look like?

Trace LLaMA-70B training through all six domains: Node → Data → Algorithm → Fleet → Ops → Analysis. Twelve of the 22 walls exercised in one coherent analysis. ⏱ ~30 min

Start Tutorial →


Extending MLSys·im

Developer

Custom Solvers & Hardware

Learn to contribute new hardware specifications to the Silicon Zoo or build your own analytical solvers using the 5-layer architecture. ⏱ ~15 min

Start Guide →


Learning Paths

Choose a path based on your role:

Path A: First-Time Learner (~ 2 hours)

“I’m new to ML systems and want to build intuition from scratch.”

0 → 1 → 2 → 3 → 4 → 5 → 7 → 11

Why this order: Start with single-node physics (roofline, memory wall, two phases), understand the KV-cache memory constraint that dominates LLM serving, see the data pipeline bottleneck, understand quantization regimes, learn that geography matters, then compose it all in the capstone. Tutorial 6 (distributed reliability) is deferred — come back to it after the capstone.

Path B: ML Engineer (~ 2.5 hours)

“I deploy models in production and need to make hardware decisions.”

0 → 1 → 2 → 3 → 8 → 9 → 11

Path C: Researcher (~ 2 hours)

“I evaluate hardware architectures and need quantitative tools.”

0 → 1 → 5 → 9 → 10 → 6 → 11

Path D: Conference Tutorial (90 min)

“I’m attending a live tutorial at ISCA / MLSys / ASPLOS.”

Time Tutorial Core Message Format
0–15 min 0. Hello, Roofline The equation, the tool, the regime Live coding, audience predicts
15–35 min 1. The Memory Wall Why 3.2× FLOPS ≠ 3.2× speedup Live coding, audience verifies
35–45 min Break + Q&A
45–60 min 9. Sensitivity Bandwidth is 15× more valuable Live coding + derivation
60–75 min 6. Scaling to 1000 GPUs Reliability dominates communication Pre-computed results, discussion
75–85 min 11. Full-Stack Audit Composing all six domains Summary table walkthrough
85–90 min Where to learn more Take-home exercises

Why 5 tutorials, not 8: Each tutorial needs enough time for the predict-compute-reflect cycle to land. Tutorials 2, 5, and 7 are available as take-home exercises for attendees who want to continue after the session.

Path D+: Half-Day Workshop (3 hours)

“I’m attending a half-day tutorial at ISCA / ASPLOS.”

The full 12-tutorial sequence with hands-on exercises. All 8 tutorials from the original conference path plus Tutorials 3 (KV-Cache) and 10 (Wafer-Scale), with 15-minute breaks between clusters.


All tutorials are Quarto-compatible. Run them locally after pip install mlsysim, or browse the rendered versions on this website.

Back to top