Getting Started

Install MLSys·im and run your first analysis in under 5 minutes.

Prerequisites

MLSys·im assumes basic Python familiarity (variables, functions, pip install). No prior ML or hardware knowledge is required. Key concepts like roofline analysis, memory-bound vs. compute-bound, and FLOP/s are explained in context throughout the tutorials. For a full reference of terms, see the Glossary.

Installation

MLSys·im requires Python 3.10+ and installs cleanly with pip:

pip install mlsysim

For development or to follow along with tutorials locally:

git clone https://github.com/harvard-edge/cs249r_book
cd cs249r_book/mlsysim
pip install -e ".[dev]"

Verify the installation:

python -c "import mlsysim; print(mlsysim.__version__)"

Local install recommended

Tutorials are pure Python and run in any Python 3.10+ environment. The local workflow is the supported path for the current release and matches the commands used throughout the documentation.

Your First Analysis

Once installed, you can run a complete roofline analysis in five lines. The roofline model is the foundation of ML systems performance reasoning – it determines whether your workload is limited by compute (arithmetic units) or memory (data movement). For a visual walkthrough, see the Hardware Acceleration slide deck (Vol I, Ch 11).

import mlsysim
from mlsysim import Engine

# 1. Load a model and hardware from the vetted Zoo
model    = mlsysim.Models.Vision.ResNet50
hardware = mlsysim.Hardware.Cloud.A100

# 2. Solve -- the Engine applies the roofline model
profile = Engine.solve(model=model, hardware=hardware, batch_size=1, precision="fp16")

# 3. Read the results
print(f"Bottleneck: {profile.bottleneck}")              # → 'Memory'
print(f"Latency:    {profile.latency.to('ms'):~.2f}")   # → 0.54 ms
print(f"Throughput: {profile.throughput:.0f}")          # → 1843 / second

Working with units

MLSys·im uses the Pint library for physical units. All quantities carry attached units (ms, GB, TFLOP/s, etc.). Use .to('ms') to convert between units. Use .magnitude to extract the raw number when you need it for calculations or plotting.

Understanding the Output

Engine.solve() returns a PerformanceProfile – a structured result containing everything the roofline model can tell you about your workload.

Core fields

Field	What it means
`bottleneck`	`'Memory'` or `'Compute'` – which resource limits performance
`latency`	Time to process one batch, derived from the roofline ceiling
`throughput`	Samples per second = `batch_size / latency`
`latency_compute`	Time if only compute were the constraint
`latency_memory`	Time if only memory bandwidth were the constraint
`arithmetic_intensity`	Operations per byte – the x-axis of the roofline plot

Extended fields

Field	What it means
`energy`	Estimated energy consumption (Joules)
`memory_footprint`	Total memory required for the workload
`mfu`	Model FLOPs Utilization – fraction of peak compute achieved
`feasible`	Whether the workload fits in device memory

The key insight

If latency_memory > latency_compute, you are memory-bound: faster arithmetic units will not help. You need to increase batch size, use a more compute-dense operation (e.g., fused attention), or reduce data movement. If you are compute-bound, that is when parallelism and quantization pay off.

This is the same insight taught in the Neural Network Computation slides (Vol I, Ch 5) and the Performance Engineering slides (Vol II, Ch 9).

Exploring the Zoo

MLSys·im ships with vetted registries of hardware, models, infrastructure, and systems, with source metadata where available. Use tab-completion to explore.

Hardware

Five tiers spanning the full deployment spectrum:

# Cloud accelerators
mlsysim.Hardware.Cloud.A100
mlsysim.Hardware.Cloud.H100
mlsysim.Hardware.Cloud.H200

# Workstation / desktop GPUs
mlsysim.Hardware.Workstation.DGX_Spark

# Mobile processors
mlsysim.Hardware.Mobile.iPhone15Pro
mlsysim.Hardware.Mobile.Snapdragon8Gen3

# Edge devices
mlsysim.Hardware.Edge.JetsonOrinNX

# Tiny / microcontroller targets
mlsysim.Hardware.Tiny.ESP32_S3
mlsysim.Hardware.Tiny.HimaxWE1

For the theory behind this hardware spectrum, see the Compute Infrastructure slides (Vol II, Ch 2).

Models

Organized by application domain:

# Language models
mlsysim.Models.Language.GPT2
mlsysim.Models.Language.Llama3_8B
mlsysim.Models.Language.Llama3_70B

# Vision models
mlsysim.Models.Vision.ResNet50
mlsysim.Models.Vision.MobileNetV2
mlsysim.Models.Vision.AlexNet

# Tiny / edge models
mlsysim.Models.Tiny.DS_CNN
mlsysim.Models.Tiny.WakeVision

Infrastructure

Regional grids and datacenter configurations for sustainability analysis:

# Regional power grids -- carbon intensity varies by energy source
mlsysim.Infrastructure.Grids.Quebec      # hydro:  ~20 gCO2/kWh
mlsysim.Infrastructure.Grids.US_Avg      # mixed:  ~390 gCO2/kWh
mlsysim.Infrastructure.Grids.Poland      # coal:   ~820 gCO2/kWh

The Sustainable AI slides (Vol II, Ch 15) explain why datacenter location is a first-class engineering decision.

Systems

Cluster definitions for distributed analysis:

# Network fabrics
mlsysim.Systems.Fabrics.InfiniBand_NDR
mlsysim.Systems.Fabrics.Ethernet_100G

# Pre-configured clusters
mlsysim.Systems.Clusters.Frontier_8K
mlsysim.Systems.Clusters.Research_256

For the full topology and cluster modeling, see the Distributed Training slides (Vol II, Ch 5) and Network Fabrics slides (Vol II, Ch 3).

Support registries

Beyond the six primary zoos, MLSys·im exposes cited literature anchors, deployment envelopes, MLOps thresholds, and solver calibration knobs:

# Deployment paradigms (latency/RAM envelopes — not chip specs)
mlsysim.Platforms.Cloud.latency_range_ms
mlsysim.Platforms.Mobile.ram

# Cited appendix scalars (MFU bands, Chinchilla, scaling η)
mlsysim.Literature.Chinchilla.TokensPerParam
mlsysim.Literature.Training.MfuHigh

# MLOps drift thresholds
mlsysim.Ops.Monitoring.PsiWarnThreshold

# Reusable scenario anchors
mlsysim.ReferenceStats.ClinicalImaging.RetinalPhotoSize
mlsysim.ReferenceStats.StorageTrainingCorpus.TrainingTokens

# Composed physical systems
mlsysim.Systems.Nodes.DGX_H100
mlsysim.Systems.Racks.DGX_H100_4Node
mlsysim.Systems.Clusters.Frontier_8K

# Solver/engine defaults (not hardware specs)
mlsysim.engine.calibration.REFERENCE_MFU_SUSTAINED

Canonical paths in Python

Always use nested registry paths in Python code (Hardware.Cloud.H100, Models.Vision.ResNet50). The CLI still accepts short names like Llama3_8B and H100. See Provenance & registry paths.

Complete registry listings are available in the Zoo reference pages.

Adjusting the Efficiency Parameter

The efficiency parameter (η) is the single most important tuning knob in MLSys·im. It represents the fraction of theoretical peak hardware performance that is actually achieved in practice. Most GPUs run at 2–5% of peak without optimization; well-tuned workloads reach 35–55%.

# Default: well-optimized training (η = 0.5)
profile_default = Engine.solve(
    model=model, hardware=hardware,
    batch_size=32, precision="fp16", efficiency=0.5
)

# Conservative: typical inference workload (η = 0.35)
profile_inference = Engine.solve(
    model=model, hardware=hardware,
    batch_size=32, precision="fp16", efficiency=0.35
)

print(f"Training estimate:  {profile_default.latency}")
print(f"Inference estimate: {profile_inference.latency}")

Typical efficiency ranges:

Scenario	η range	Notes
Well-optimized training (fp16)	0.35–0.55	Megatron-LM, DeepSpeed
Inference (fp16)	0.25–0.45	vLLM, TensorRT-LLM
Inference (int8)	0.20–0.40	Quantized serving

See the Accuracy & Validation page for guidance on choosing η for different scenarios. The gap between theoretical peak and achieved throughput is covered in detail in the Performance Engineering slides (Vol II, Ch 9).

Defining Custom Models

You are not limited to the Zoo. Define any model by specifying its parameters and FLOPs:

from mlsysim import ureg
from mlsysim.models.types import TransformerWorkload

my_model = TransformerWorkload(
    name="My-Custom-LLM",
    architecture="Transformer",
    parameters=13e9 * ureg.param,
    layers=40,
    hidden_dim=5120,
    heads=40,
    kv_heads=8,
    inference_flops=2 * 13e9 * ureg.flop  # Rule of thumb: ~2 FLOPs per parameter
)

profile = Engine.solve(model=my_model, hardware=hardware, batch_size=1)
print(f"Bottleneck: {profile.bottleneck}")
print(f"Latency:    {profile.latency}")
print(f"Feasible:   {profile.feasible}")  # Does the model fit in device memory?

The Model Compression slides (Vol I, Ch 10) explain why parameter count and precision together determine both the memory footprint and the arithmetic intensity of a workload.

Companion Slide Decks

MLSys·im is the hands-on companion to the Machine Learning Systems textbook. The concepts you model with MLSys·im are taught visually in 35 Beamer slide decks (1,099 slides total) with speaker notes and active learning exercises.

Concept in MLSys·im	Slide Deck	Key Topics
`Engine.solve()` and the roofline model	Hardware Acceleration (Vol I, Ch 11)	Roofline model, arithmetic intensity, systolic arrays, memory wall
FLOPs, MACs, and compute cost	Neural Network Computation (Vol I, Ch 5)	Forward/backward pass cost, training memory breakdown
Training memory and mixed precision	Model Training (Vol I, Ch 8)	Iron Law of Training, gradient checkpointing, mixed precision
Quantization and compression	Model Compression (Vol I, Ch 10)	Pruning, quantization, knowledge distillation
Hardware Zoo tiers	Compute Infrastructure (Vol II, Ch 2)	Accelerator spectrum, HBM architecture, TCO
DistributedModel	Distributed Training (Vol II, Ch 5)	3D parallelism, scaling efficiency, communication overhead
ServingModel and LLM inference	Model Serving (Vol I, Ch 13)	TTFT, ITL, KV-cache, batching strategies
SustainabilityModel	Sustainable AI (Vol II, Ch 15)	Energy wall, carbon geography, PUE
Efficiency parameter (η)	Performance Engineering (Vol II, Ch 9)	Operator fusion, FlashAttention, precision engineering
Benchmarking and validation	Benchmarking (Vol I, Ch 12)	MLPerf, measurement methodology, latency percentiles

Volume I: Foundations – 17 decks, 570 slides

Browse Volume I on the slide website

Volume II: At Scale – 18 decks, 529 slides

Browse Volume II on the slide website

Next Steps

Recommended path

Follow the structured learning path on the Tutorials page, starting with the Hello, Roofline Tutorial. Each tutorial pairs with a companion slide deck for visual explanations and active learning exercises.

For a complete reference of which solver to use for different questions, see the Solver Guide.