For Students

Build intuition for ML systems – without needing GPU hardware.

Whether you are taking your first ML systems course or preparing for industry interviews, MLSYSIM lets you experiment with real hardware specifications and see exactly why systems behave the way they do. Every number comes from a real datasheet. Every equation is grounded in peer-reviewed literature.


What You Will Learn

By working through the MLSYSIM tutorials and exercises, you will:

  • Identify bottlenecks – Determine whether a workload is memory-bound or compute-bound on any hardware, and understand why
  • Reason quantitatively – Use real datasheet numbers (not made-up examples) to calculate latency, throughput, and cost
  • Build systems intuition – See how batch size, precision, parallelism strategy, and datacenter location each affect performance
  • Think across the stack – Connect workload characteristics to hardware specs to infrastructure constraints

Prerequisites

  • Python: Comfortable with functions, loops, and f-strings
  • Math: Basic algebra (no calculus required – all solver equations are arithmetic)
  • ML: Familiarity with terms like “model parameters,” “inference,” and “training” (the Glossary defines everything else)

No GPU, no cloud account, no special hardware required. Just:

pip install mlsysim

See the Getting Started guide for development installs and Colab/Binder options.


Quick Start

import mlsysim
from mlsysim import Engine

# Load a model and hardware from the vetted registry
model = mlsysim.Models.ResNet50
gpu   = mlsysim.Hardware.Cloud.A100

# Solve: is this workload memory-bound or compute-bound?
profile = Engine.solve(model=model, hardware=gpu, batch_size=1, precision="fp16")

print(f"Bottleneck: {profile.bottleneck}")   # → Memory Bound
print(f"Latency:    {profile.latency.to('ms'):~.2f}")

Your Learning Path

Start at the top and work through in order. Each tutorial builds on the one before it. The Companion Slides column links directly to the lecture deck that covers the same material – use them for visual explanations, worked examples, and active learning exercises.

Step Tutorial You Will Learn Time Companion Slides
1 Hello World The roofline model, memory-bound vs. compute-bound, batch size sweeps 15 min Hardware Acceleration (Vol I, Ch 11)
2 Sustainability Lab Energy, carbon footprint, regional grid effects 20 min Sustainable AI (Vol II, Ch 15)
3 LLM Serving TTFT vs. ITL, KV-cache pressure, the two phases of LLM inference 25 min Model Serving (Vol I, Ch 13) and Inference at Scale (Vol II, Ch 9)
4 Distributed Training Data/tensor/pipeline parallelism, communication overhead, scaling efficiency 30 min Distributed Training (Vol II, Ch 5) and Collective Communication (Vol II, Ch 6)
TipPredict Before You Compute

Every tutorial includes “predict first” exercises. Before running code, write down what you expect. This practice builds the mental models that make you effective at systems reasoning. The companion slide decks include the same predict-first methodology with 8–11 active learning moments per deck.


How MLSYSIM Maps to the Textbook and Slides

MLSYSIM is the companion framework for the Machine Learning Systems textbook. Each solver maps to specific chapters and slide decks. Use the slide links below to review the theory before (or after) running the solver.

MLSYSIM Solver What It Models Textbook Topic Slide Deck
SingleNodeModel Roofline analysis, compute vs. memory bottleneck Hardware Acceleration Vol I, Ch 11
ServingModel TTFT, ITL, KV-cache memory Model Serving Vol I, Ch 13
DistributedModel 3D parallelism, all-reduce, pipeline bubbles Distributed Training Vol II, Ch 5
EconomicsModel CapEx, OpEx, TCO Compute Infrastructure Vol II, Ch 2
SustainabilityModel Energy, carbon, water usage Sustainable AI Vol II, Ch 15
ReliabilityModel MTBF, checkpoint interval Fault Tolerance Vol II, Ch 7

Not using the textbook? No problem – MLSYSIM is self-contained. The Math Foundations page documents every equation, and each slide deck stands on its own with full speaker notes.


Slides at a Glance

The full slide collection covers both volumes of the textbook. Every deck includes speaker notes, active learning exercises, and original SVG diagrams.

Volume I: Foundations (17 decks, 570 slides)

Covers the single-machine ML stack: data engineering, neural computation, architectures, frameworks, training, compression, hardware acceleration, serving, and operations.

Browse Vol I Decks | Download All (PDF)

Volume II: At Scale (18 decks, 529 slides)

Covers distributed infrastructure: compute clusters, network fabrics, distributed training, fault tolerance, fleet orchestration, inference at scale, and governance.

Browse Vol II Decks | Download All (PDF)


Next Steps

Back to top