For Students
Build intuition for ML systems – without needing GPU hardware.
Whether you are taking your first ML systems course or preparing for industry interviews, MLSYSIM lets you experiment with real hardware specifications and see exactly why systems behave the way they do. Every number comes from a real datasheet. Every equation is grounded in peer-reviewed literature.
What You Will Learn
By working through the MLSYSIM tutorials and exercises, you will:
- Identify bottlenecks – Determine whether a workload is memory-bound or compute-bound on any hardware, and understand why
- Reason quantitatively – Use real datasheet numbers (not made-up examples) to calculate latency, throughput, and cost
- Build systems intuition – See how batch size, precision, parallelism strategy, and datacenter location each affect performance
- Think across the stack – Connect workload characteristics to hardware specs to infrastructure constraints
Prerequisites
- Python: Comfortable with functions, loops, and f-strings
- Math: Basic algebra (no calculus required – all solver equations are arithmetic)
- ML: Familiarity with terms like “model parameters,” “inference,” and “training” (the Glossary defines everything else)
No GPU, no cloud account, no special hardware required. Just:
pip install mlsysimSee the Getting Started guide for development installs and Colab/Binder options.
Quick Start
import mlsysim
from mlsysim import Engine
# Load a model and hardware from the vetted registry
model = mlsysim.Models.ResNet50
gpu = mlsysim.Hardware.Cloud.A100
# Solve: is this workload memory-bound or compute-bound?
profile = Engine.solve(model=model, hardware=gpu, batch_size=1, precision="fp16")
print(f"Bottleneck: {profile.bottleneck}") # → Memory Bound
print(f"Latency: {profile.latency.to('ms'):~.2f}")Your Learning Path
Start at the top and work through in order. Each tutorial builds on the one before it. The Companion Slides column links directly to the lecture deck that covers the same material – use them for visual explanations, worked examples, and active learning exercises.
| Step | Tutorial | You Will Learn | Time | Companion Slides |
|---|---|---|---|---|
| 1 | Hello World | The roofline model, memory-bound vs. compute-bound, batch size sweeps | 15 min | Hardware Acceleration (Vol I, Ch 11) |
| 2 | Sustainability Lab | Energy, carbon footprint, regional grid effects | 20 min | Sustainable AI (Vol II, Ch 15) |
| 3 | LLM Serving | TTFT vs. ITL, KV-cache pressure, the two phases of LLM inference | 25 min | Model Serving (Vol I, Ch 13) and Inference at Scale (Vol II, Ch 9) |
| 4 | Distributed Training | Data/tensor/pipeline parallelism, communication overhead, scaling efficiency | 30 min | Distributed Training (Vol II, Ch 5) and Collective Communication (Vol II, Ch 6) |
Every tutorial includes “predict first” exercises. Before running code, write down what you expect. This practice builds the mental models that make you effective at systems reasoning. The companion slide decks include the same predict-first methodology with 8–11 active learning moments per deck.
How MLSYSIM Maps to the Textbook and Slides
MLSYSIM is the companion framework for the Machine Learning Systems textbook. Each solver maps to specific chapters and slide decks. Use the slide links below to review the theory before (or after) running the solver.
| MLSYSIM Solver | What It Models | Textbook Topic | Slide Deck |
|---|---|---|---|
| SingleNodeModel | Roofline analysis, compute vs. memory bottleneck | Hardware Acceleration | Vol I, Ch 11 |
| ServingModel | TTFT, ITL, KV-cache memory | Model Serving | Vol I, Ch 13 |
| DistributedModel | 3D parallelism, all-reduce, pipeline bubbles | Distributed Training | Vol II, Ch 5 |
| EconomicsModel | CapEx, OpEx, TCO | Compute Infrastructure | Vol II, Ch 2 |
| SustainabilityModel | Energy, carbon, water usage | Sustainable AI | Vol II, Ch 15 |
| ReliabilityModel | MTBF, checkpoint interval | Fault Tolerance | Vol II, Ch 7 |
Not using the textbook? No problem – MLSYSIM is self-contained. The Math Foundations page documents every equation, and each slide deck stands on its own with full speaker notes.
Recommended Study Workflow
Whether you are self-studying or following a course, this workflow maximizes retention:
- Read the textbook chapter (or skim the slide deck) to get the conceptual framework
- Predict what will happen before running any code – write it down
- Simulate using MLSYSIM to test your prediction against real hardware specs
- Explore by changing one parameter at a time (batch size, precision, hardware) and observing the effect
- Reflect on where your prediction was wrong – that gap is where learning happens
If you are self-studying, the slide decks include speaker notes with timing guidance, teaching tips, and common misconceptions – they are written to be useful even without an instructor. If you are in a course, your instructor may assign specific tutorials as homework; check the Instructor Guide for the recommended pairing.
Slides at a Glance
The full slide collection covers both volumes of the textbook. Every deck includes speaker notes, active learning exercises, and original SVG diagrams.
Volume I: Foundations (17 decks, 570 slides)
Covers the single-machine ML stack: data engineering, neural computation, architectures, frameworks, training, compression, hardware acceleration, serving, and operations.
Volume II: At Scale (18 decks, 529 slides)
Covers distributed infrastructure: compute clusters, network fabrics, distributed training, fault tolerance, fleet orchestration, inference at scale, and governance.
Next Steps
- Getting Started – Install MLSYSIM and run your first analysis
- Hello World Tutorial – Your first roofline analysis
- Solver Guide – Deep dive into each solver’s capabilities
- Glossary – Look up any unfamiliar term
- Math Foundations – The equations behind every solver
- All Slide Decks – 35 Beamer decks with speaker notes and active learning exercises