Getting Started
Install MLSYSIM and run your first analysis in under 5 minutes.
MLSYSIM assumes basic Python familiarity (variables, functions, pip install). No prior ML or hardware knowledge is required. Key concepts like roofline analysis, memory-bound vs. compute-bound, and FLOP/s are explained in context throughout the tutorials. For a full reference of terms, see the Glossary.
Installation
MLSYSIM requires Python 3.9+ and installs cleanly with pip:
pip install mlsysimFor development or to follow along with tutorials locally:
git clone https://github.com/harvard-edge/cs249r_book
cd cs249r_book/mlsysim
pip install -e ".[dev]"Verify the installation:
python -c "import mlsysim; print(mlsysim.__version__)"All tutorials in this documentation can also be run on Google Colab or Binder without any local installation. Look for the launch buttons at the top of each tutorial.
Your First Analysis
Once installed, you can run a complete roofline analysis in five lines. The roofline model is the foundation of ML systems performance reasoning – it determines whether your workload is limited by compute (arithmetic units) or memory (data movement). For a visual walkthrough, see the Hardware Acceleration slide deck (Vol I, Ch 11).
import mlsysim
from mlsysim import Engine
# 1. Load a model and hardware from the vetted Zoo
model = mlsysim.Models.ResNet50
hardware = mlsysim.Hardware.Cloud.A100
# 2. Solve -- the Engine applies the roofline model
profile = Engine.solve(model=model, hardware=hardware, batch_size=1, precision="fp16")
# 3. Read the results
print(f"Bottleneck: {profile.bottleneck}") # → 'Memory'
print(f"Latency: {profile.latency}") # → 0.34 ms
print(f"Throughput: {profile.throughput}") # → 2941 samples/secMLSYSIM uses the Pint library for physical units. All quantities carry attached units (ms, GB, TFLOP/s, etc.). Use .to('ms') to convert between units. Use .magnitude to extract the raw number when you need it for calculations or plotting.
Understanding the Output
Engine.solve() returns a PerformanceProfile – a structured result containing everything the roofline model can tell you about your workload.
Core fields
| Field | What it means |
|---|---|
bottleneck |
'Memory' or 'Compute' – which resource limits performance |
latency |
Time to process one batch, derived from the roofline ceiling |
throughput |
Samples per second = batch_size / latency |
latency_compute |
Time if only compute were the constraint |
latency_memory |
Time if only memory bandwidth were the constraint |
arithmetic_intensity |
Operations per byte – the x-axis of the roofline plot |
Extended fields
| Field | What it means |
|---|---|
energy |
Estimated energy consumption (Joules) |
memory_footprint |
Total memory required for the workload |
mfu |
Model FLOPs Utilization – fraction of peak compute achieved |
feasible |
Whether the workload fits in device memory |
If latency_memory > latency_compute, you are memory-bound: faster arithmetic units will not help. You need to increase batch size, use a more compute-dense operation (e.g., fused attention), or reduce data movement. If you are compute-bound, that is when parallelism and quantization pay off.
This is the same insight taught in the Neural Network Computation slides (Vol I, Ch 5) and the Performance Engineering slides (Vol II, Ch 10).
Exploring the Zoo
MLSYSIM ships with vetted registries of hardware, models, infrastructure, and systems – all sourced from real datasheets. Use tab-completion to explore.
Hardware
Five tiers spanning the full deployment spectrum:
# Cloud accelerators
mlsysim.Hardware.Cloud.A100
mlsysim.Hardware.Cloud.H100
mlsysim.Hardware.Cloud.H200
# Workstation / desktop GPUs
mlsysim.Hardware.Workstation.DGXSpark
# Mobile processors
mlsysim.Hardware.Mobile.iPhone
mlsysim.Hardware.Mobile.Snapdragon
# Edge devices
mlsysim.Hardware.Edge.Jetson
# Tiny / microcontroller targets
mlsysim.Hardware.Tiny.ESP32
mlsysim.Hardware.Tiny.HimaxFor the theory behind this hardware spectrum, see the Compute Infrastructure slides (Vol II, Ch 2).
Models
Organized by application domain:
# Language models
mlsysim.Models.Language.GPT2
mlsysim.Models.Language.Llama3_8B
mlsysim.Models.Language.Llama3_70B
# Vision models
mlsysim.Models.Vision.ResNet50
mlsysim.Models.Vision.MobileNetV2
mlsysim.Models.Vision.AlexNet
# Tiny / edge models
mlsysim.Models.Tiny.DS_CNN
mlsysim.Models.Tiny.WakeVisionInfrastructure
Regional grids and datacenter configurations for sustainability analysis:
# Regional power grids -- carbon intensity varies by energy source
mlsysim.Infra.Grids.Quebec # hydro: ~20 gCO2/kWh
mlsysim.Infra.Grids.US_Avg # mixed: ~390 gCO2/kWh
mlsysim.Infra.Grids.Poland # coal: ~820 gCO2/kWhThe Sustainable AI slides (Vol II, Ch 15) explain why datacenter location is a first-class engineering decision.
Systems
Cluster definitions for distributed analysis:
# Network fabrics
mlsysim.Systems.Fabrics.InfiniBand_NDR
mlsysim.Systems.Fabrics.Ethernet_100G
# Pre-configured clusters
mlsysim.Systems.Clusters.Frontier_8K
mlsysim.Systems.Clusters.Research_256For the full topology and cluster modeling, see the Distributed Training slides (Vol II, Ch 5) and Network Fabrics slides (Vol II, Ch 3).
Complete registry listings are available in the Zoo reference pages.
Adjusting the Efficiency Parameter
The efficiency parameter (η) is the single most important tuning knob in MLSYSIM. It represents the fraction of theoretical peak hardware performance that is actually achieved in practice. Most GPUs run at 2–5% of peak without optimization; well-tuned workloads reach 35–55%.
# Default: well-optimized training (η = 0.5)
profile_default = Engine.solve(
model=model, hardware=hardware,
batch_size=32, precision="fp16", efficiency=0.5
)
# Conservative: typical inference workload (η = 0.35)
profile_inference = Engine.solve(
model=model, hardware=hardware,
batch_size=32, precision="fp16", efficiency=0.35
)
print(f"Training estimate: {profile_default.latency}")
print(f"Inference estimate: {profile_inference.latency}")Typical efficiency ranges:
| Scenario | η range | Notes |
|---|---|---|
| Well-optimized training (fp16) | 0.35–0.55 | Megatron-LM, DeepSpeed |
| Inference (fp16) | 0.25–0.45 | vLLM, TensorRT-LLM |
| Inference (int8) | 0.20–0.40 | Quantized serving |
See the Accuracy & Validation page for guidance on choosing η for different scenarios. The gap between theoretical peak and achieved throughput is covered in detail in the Performance Engineering slides (Vol II, Ch 10).
Defining Custom Models
You are not limited to the Zoo. Define any model by specifying its parameters and FLOPs:
from mlsysim import TransformerWorkload
from mlsysim import ureg
my_model = TransformerWorkload(
name="My-Custom-LLM",
architecture="Transformer",
parameters=13e9 * ureg.param,
layers=40,
hidden_dim=5120,
heads=40,
kv_heads=8,
inference_flops=2 * 13e9 * ureg.flop # Rule of thumb: ~2 FLOPs per parameter
)
profile = Engine.solve(model=my_model, hardware=hardware, batch_size=1)
print(f"Bottleneck: {profile.bottleneck}")
print(f"Latency: {profile.latency}")
print(f"Feasible: {profile.feasible}") # Does the model fit in device memory?The Model Compression slides (Vol I, Ch 10) explain why parameter count and precision together determine both the memory footprint and the arithmetic intensity of a workload.
Companion Slide Decks
MLSYSIM is the hands-on companion to the Machine Learning Systems textbook. The concepts you model with MLSYSIM are taught visually in 35 Beamer slide decks (1,099 slides total) with speaker notes and active learning exercises.
| Concept in MLSYSIM | Slide Deck | Key Topics |
|---|---|---|
Engine.solve() and the roofline model |
Hardware Acceleration (Vol I, Ch 11) | Roofline model, arithmetic intensity, systolic arrays, memory wall |
| FLOPs, MACs, and compute cost | Neural Network Computation (Vol I, Ch 5) | Forward/backward pass cost, training memory breakdown |
| Training memory and mixed precision | Model Training (Vol I, Ch 8) | Iron Law of Training, gradient checkpointing, mixed precision |
| Quantization and compression | Model Compression (Vol I, Ch 10) | Pruning, quantization, knowledge distillation |
| Hardware Zoo tiers | Compute Infrastructure (Vol II, Ch 2) | Accelerator spectrum, HBM architecture, TCO |
| DistributedModel | Distributed Training (Vol II, Ch 5) | 3D parallelism, scaling efficiency, communication overhead |
| ServingModel and LLM inference | Model Serving (Vol I, Ch 13) | TTFT, ITL, KV-cache, batching strategies |
| SustainabilityModel | Sustainable AI (Vol II, Ch 15) | Energy wall, carbon geography, PUE |
| Efficiency parameter (η) | Performance Engineering (Vol II, Ch 10) | Operator fusion, FlashAttention, precision engineering |
| Benchmarking and validation | Benchmarking (Vol I, Ch 12) | MLPerf, measurement methodology, latency percentiles |
Volume I: Foundations – 17 decks, 570 slides
Volume II: At Scale – 18 decks, 529 slides
Next Steps
Follow the structured learning path on the Tutorials page, starting with the Hello World Tutorial. Each tutorial pairs with a companion slide deck for visual explanations and active learning exercises.
For a complete reference of which solver to use for different questions, see the Solver Guide.