For Instructors
Reproducible, hardware-independent exercises — paired with 35 lecture decks and 266 diagrams.
MLSYSIM provides a framework for assigning analytically grounded problem sets where every answer is deterministic and reproducible — regardless of what hardware your students have access to. Combined with the companion lecture slides, it forms a complete teaching toolkit for ML systems courses.
Why MLSYSIM for Teaching?
| Challenge | How MLSYSIM Helps |
|---|---|
| Students lack GPU access | All analysis runs on a laptop — no cloud credits needed |
| Homework answers vary by hardware | Vetted registry specs produce identical results everywhere |
| Hard to grade open-ended systems questions | Analytical solvers give deterministic, verifiable outputs |
| Specifications become stale | Registry updated from official datasheets; one update propagates everywhere |
| Students memorize without understanding | “Predict first” exercises build genuine intuition |
| No time to build slides from scratch | 35 Beamer decks with speaker notes, active learning, and SVG diagrams ready to use |
The Teaching Ecosystem
MLSYSIM is one component of a larger open teaching toolkit:
| Resource | What It Provides | Link |
|---|---|---|
| Textbook | Two-volume open textbook — foundations (Vol I) and scale (Vol II) | mlsysbook.ai |
| Lecture Slides | 35 Beamer decks, 1,099 slides, 266 SVG diagrams, speaker notes on every slide | Slides Portal |
| MLSYSIM | 6 analytical solvers, typed hardware registry, deterministic assignments | Getting Started |
| TinyML Courseware | 4-course sequence with 178 slide decks for embedded ML | TinyML Slides |
| Teaching Guide | 16-week semester plans, active learning taxonomy, customization guide | Teaching Guide |
Course Integration Patterns
Pattern 1 — Textbook Companion (Full Semester)
Map MLSYSIM tutorials and assignments directly to textbook chapters and lecture decks. The table below shows one possible 16-week arrangement using Volume I slides.
| Week | Lecture Slides | Textbook Topic | MLSYSIM Assignment |
|---|---|---|---|
| 2 | Introduction | The Iron Law of ML Systems | Read Hello World warmup — identify bottleneck equation |
| 5 | NN Computation | FLOPs, memory footprint | Hello World — roofline analysis, batch size sweep |
| 8 | Model Training | Training memory budget | Solver Guide — TrainingStateSolver, ZeRO stages |
| 11 | HW Acceleration | Roofline model, accelerator comparison | Hardware comparison assignment (see below) |
| 13 | Model Serving | TTFT, ITL, KV-cache | LLM Serving — serving latency analysis |
For a Volume II course on distributed systems:
| Week | Lecture Slides | Textbook Topic | MLSYSIM Assignment |
|---|---|---|---|
| 3 | Compute Infrastructure | GPU clusters, interconnects | TCO analysis with EconomicsModel |
| 5 | Distributed Training | 3D parallelism, scaling | Distributed Training — parallelism strategies |
| 7 | Fault Tolerance | Checkpointing, MTBF | ReliabilityModel — Young-Daly checkpoint interval |
| 10 | Performance Engineering | Profiling, optimization | Multi-solver composition (see capstone ideas below) |
| 15 | Sustainable AI | Energy, carbon, water | Sustainability Lab — carbon footprint |
The Teaching Guide provides complete 16-week schedules for Volume I, Volume II, and a combined 32-week sequence — with timing estimates for every deck.
Pattern 2 — Standalone Labs
Use individual tutorials as self-contained lab assignments in any systems course. Each tutorial includes exercises with clear expected outputs:
| Tutorial | Duration | Key Concepts | Pairs With Slides |
|---|---|---|---|
| Hello World | 15 min | Roofline model, memory vs. compute bound | HW Acceleration |
| Sustainability Lab | 20 min | Energy, carbon footprint, regional grids | Sustainable AI |
| LLM Serving | 25 min | TTFT vs. ITL, KV-cache pressure | Model Serving |
| Distributed Training | 30 min | Data/tensor/pipeline parallelism | Distributed Training |
Pattern 3 — Capstone Projects
Advanced students compose multiple solvers to answer research-style questions. See Extending MLSYSIM for the custom solver API.
Assignment Ideas
Homework: Hardware Comparison (30 min)
Using
Engine.solve(), compare ResNet-50 inference latency on the A100, H100, and Jetson AGX at batch sizes 1, 32, and 256. For each configuration, state whether the workload is memory-bound or compute-bound and explain why the bottleneck shifts with batch size.
Pairs with: HW Acceleration slides (roofline model, ridge point) and Benchmarking slides (measurement methodology).
Homework: Training Memory Budget (30 min)
Using the TrainingStateSolver, calculate the memory required to train GPT-2 (1.5B parameters) in FP16 with Adam optimizer under ZeRO Stage 0, Stage 1, and Stage 3. Explain why each stage reduces memory and what trade-off it introduces.
Pairs with: Model Training slides and Distributed Training slides.
Lab: Carbon-Aware Training (45 min)
Using the SustainabilityModel, calculate the carbon footprint of training GPT-3 on a 256-GPU H100 cluster in Quebec vs. US Average vs. Poland. Produce a table and a 2-paragraph analysis of why datacenter location matters more than hardware choice for carbon.
Pairs with: Sustainable AI slides (grid carbon intensity, PUE).
Lab: LLM Serving Capacity Planning (45 min)
Using the ServingModel, determine the maximum sequence length at which Llama-3.1-70B can serve a single request on an 8-GPU H100 node without exceeding memory. Then calculate TTFT and ITL at sequence lengths of 1K, 4K, and 16K tokens. At what point does KV-cache pressure dominate?
Pairs with: Model Serving slides and Inference at Scale slides.
Exam Question: Back-of-Envelope
A GPU has 1,979 TFLOP/s peak compute (FP16) and 3.35 TB/s memory bandwidth. (a) What is the ridge point in FLOP/Byte? (b) A model layer has arithmetic intensity of 50 FLOP/Byte — is it compute-bound or memory-bound? (c) Another layer has arithmetic intensity of 400 FLOP/Byte — which regime is it in, and what does that imply about the benefit of moving to a GPU with 2x the bandwidth? Show your work.
Pairs with: HW Acceleration slides (roofline model, ridge point derivation).
Capstone: Multi-Solver Design Study (1 week)
Design a training cluster for a 70B-parameter model. Use the DistributedModel to select a parallelism strategy, the EconomicsModel for TCO over 6 months, the SustainabilityModel to compare three datacenter locations, and the ReliabilityModel to determine checkpoint frequency. Present your analysis as a 3-page technical memo with quantitative justification for each decision.
Pairs with: the full Volume II slide set — infrastructure, training, fault tolerance, and sustainability.
Grading Notes
Because MLSYSIM produces deterministic output from vetted specifications:
- Answer keys are stable — the same
mlsysimversion produces identical numbers for every student, every semester - Partial credit is straightforward — grade the reasoning (which solver, which inputs, which bottleneck explanation), not just the number
- “Predict first” questions are easy to assess — students submit their prediction before running code; compare the two for a conceptual understanding score
Pin the version in your assignment instructions (pip install mlsysim==0.1.0) so answer keys remain valid even after new releases update specifications.
Reproducibility Guarantee
All specifications in the MLSys Zoo are:
- Sourced from official manufacturer datasheets and published benchmarks
- Typed with
pint.Quantityfor dimensional correctness — unit errors are caught at runtime - Frozen per release —
mlsysim==0.1.0always produces the same answers
This means your answer key works for every student, every semester.
Jupyter & Quarto Compatibility
All tutorials run in:
- Jupyter Notebooks — standard
.ipynbworkflow - Quarto documents — render to HTML, PDF, or slides with
quarto render - Google Colab —
pip install mlsysimin the first cell, then go
No GPU runtime required. CPU-only environments work perfectly because MLSYSIM computes from equations, not empirical profiling.
Getting Started
- Point students to the Getting Started guide for installation
- Assign the Hello World tutorial as a warmup
- Browse the Solver Guide to select solvers for your course topics
- Pair each assignment with the relevant lecture slides for classroom context
- Use the MLSys Zoo for available hardware, model, and infrastructure specifications