Machine Learning Systems
  • MLSys·im
    • MLSys·im Home

    • Getting Started
    • CLI Reference

    • Tutorials
    • The Zoo
    • API Reference

    • MLSys·im Paper (PDF)
    • Tutorial Slides
  • Read
    • Volume I: Foundations
    • Volume II: At Scale

    • Volume I PDF
    • Volume I EPUB

    • Volume II PDF
    • Volume II EPUB
  • Build
    • Labs
    • TinyTorch
    • Hardware Kits
    • MLSys·im
  • Teach
    • Course Map
    • Lecture Slides
    • Instructor Hub
  • Prepare
    • StaffML
    • Study Plans
    • Gauntlet Mode

    • StaffML Paper
  • Connect
    • Newsletter
    • Global Network
    • Workshops & Events
    • Partners & Sponsors
  • About
    • Mission
    • Our Story

    • People
    • Contributors

    • License
  • Subscribe
  • Star
  • Support
  • GitHub
    • Discussions
    • Edit this page
    • Report an issue
    • View source

🚧 DEVELOPMENT PREVIEW - Built from dev@a46ce0b1 • 2026-05-24 09:44 EDT • Stable version →

🧮 MLSys·im — first-principles analytical modeling for ML training and inference; model the physics before you build.
📘 The book: Vol I: Foundations · Vol II: At Scale — open access, free forever.
🛠️ Alongside the book: TinyTorch (build) · Hardware Kits (deploy) · Labs (explore) · StaffML (practice) · Lecture Slides
📬 Newsletter: ML Systems insights & updates — Subscribe →

MLSys·im

Open Source · Companion to mlsysbook.ai

MLSys·im

Analytical modeling for ML system performance, cost, and carbon.
From first principles.

MLSys·im is a first-principles analytical modeling framework for ML systems, designed for education and early design-space reasoning before empirical benchmarking.

pip install mlsysim

Get Started Tutorials Tutorial Slides Paper PDF

Analytical framework for design-space reasoning · Dimensionally checked units · Vetted model, hardware, fleet, and infrastructure registries
Roofline Analysis
Arithmetic Intensity (FLOP/Byte) FLOP/s Memory Bound Compute Bound Ridge Point
Identify whether your workload is memory-bound or compute-bound on any hardware.
LLM Serving
Llama-3.1-8B on H100 Pre-fill 4.2 ms TTFT (compute-bound) → Decode 0.8 ms ITL (memory-bound) KV-Cache: 2.1 GB / 80 GB available
Model the two phases of autoregressive inference and KV-cache memory pressure.
Distributed Training
256× H100 — GPT-3 175B Data Parallel 32× Tensor Parallel 4× Pipeline Parallel 2× Scaling Efficiency 74% Pipeline Bubble 6.3%
3D parallelism decomposition: data, tensor, and pipeline parallel scaling on GPU clusters.
Sustainability Analysis
Quebec 20 g CO₂/kWh Norway 10 g CO₂/kWh US Avg 390 g CO₂/kWh Poland 820 g CO₂/kWh
Same workload, different region. Up to 41x difference in carbon footprint.
Hardware Comparison
H100 990 TFLOP/s A100 312 TFLOP/s Jetson 25 TFLOP/s ESP32 0.5 GFLOP/s
19 devices from cloud GPUs to microcontrollers, all with vetted datasheet specs.
Total Cost of Ownership
64× H100 Cluster — 3-Year TCO CapEx $2.0M Energy $1.2M Maint. $0.5M Total TCO $3.7M
Break down hardware, energy, and maintenance costs over any time horizon.

Start with the system question

MLSys·im is meant for the stage before you benchmark or provision hardware. Use it to make first-order constraints explicit, then validate the sensitive parameters on the real stack when hardware is available.

Will it fit? Estimate weights, activations, optimizer state, KV cache, and communication buffers before a job fails at runtime.

What binds? Separate compute, memory bandwidth, network communication, data input, reliability, cost, and carbon constraints.

How much capacity? Size serving replicas for a QPS and p99 target, including batching and queueing assumptions.

What should I tune? Sweep batch size, precision, parallelism, efficiency, geography, and hardware choices without needing cluster access.

Try it in 5 lines

import mlsysim
from mlsysim import Engine

profile = Engine.solve(
    model    = mlsysim.Models.ResNet50,
    hardware = mlsysim.Hardware.Cloud.A100,
    batch_size = 1,
    precision  = "fp16"
)

print(f"Bottleneck: {profile.bottleneck}")              # → Memory
print(f"Latency:    {profile.latency.to('ms'):~.2f}")   # → 0.54 ms
print(f"Throughput: {profile.throughput:.0f}")          # → 1843 / second

At batch=1, ResNet-50 loads ~50 MB of weights but performs only ~8 GFLOPs, making it firmly memory-bound on any modern GPU. The solver identifies this in microseconds using the Iron Law [1]. You can start from the curated Model Zoo and Silicon Zoo, or define your own workload and hardware objects when exploring a new design.

\[T = \max\!\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\ \frac{\text{Bytes}}{\text{BW}}\right)\]

Core workflows, one framework

Every solver takes typed registry objects and returns analytically grounded estimates. No benchmarking required for the first pass.

Roofline Analysis Compute vs. memory bottleneck identification using the Iron Law. Single-node latency and throughput. Tutorial: Hello Roofline

3D Parallelism Data, tensor, and pipeline parallel scaling efficiency. Ring all-reduce and pipeline bubble overhead. Tutorial: Scaling to 1000 GPUs

LLM Serving Time-to-first-token (TTFT), inter-token latency (ITL), and KV-cache memory pressure. Tutorial: Two Phases of Inference

Memory & Capacity Training memory breakdown, serving replica sizing, and MoE routing imbalance as first-order design checks. Tutorial: Memory, Capacity, and MoE

Total Cost of Ownership CapEx, OpEx, electricity, maintenance, and per-query economics over any time horizon. Tutorial: The $9M Question

Sustainability Energy, carbon footprint (kg CO2e), and water usage across datacenter regions. Tutorial: Geography Matters

Reliability Fleet MTBF, failure probability, and Young-Daly optimal checkpoint interval. Tutorial: Sensitivity Analysis

Validate the approximation

The outputs are analytical estimates, not measurements. Treat them as a way to narrow the design space, expose assumptions, and identify the next experiment to run.

Check feasibility first. If memory, network, or data movement is infeasible analytically, benchmarking will not rescue the design.

Calibrate efficiency. Compute-heavy solvers expose an efficiency parameter. Defaults are literature-informed starting points; replace them with measured MFU or sustained throughput when you have it.

Compare regimes. The most robust conclusions are binding constraints, crossover points, and sensitivity rankings.

Escalate fidelity deliberately. Use MLSys·im to choose the few configurations worth validating with profiling, traces, or production load tests.

Read the accuracy and validation guide →

Learn by doing

Beginner

Hello Roofline

Memory-bound vs. compute-bound in 5 lines of Python. Sweep batch sizes and see the roofline crossover.

Beginner

The Memory Wall

Why most LLM inference is memory-bound, not compute-bound. Visualize the gap between peak FLOP/s and bandwidth.

Intermediate

Two Phases of Inference

Pre-fill is compute-bound, decode is memory-bound. Model both phases and diagnose KV-cache pressure.

Intermediate

Memory, Capacity, and MoE

Estimate training memory, serving replicas, and sparse expert routing imbalance with explicit assumptions.

Advanced

Scaling to 1000 GPUs

Ring all-reduce communication, pipeline bubbles, and scaling efficiency on distributed GPU clusters.

Advanced

Sensitivity Analysis

Perturb hardware parameters and identify which constraint is actually worth improving.

See all tutorials →

Companion learning material

MLSys·im is designed to pair with the Machine Learning Systems textbook and course materials. The links below point to the readable slide websites and tutorial overview; the MLSys·im tutorials remain the main path for using the tool.

17 Decks

Volume I: Foundations

The single-machine ML stack: data engineering, neural computation, training, compression, hardware acceleration, and serving. Browse Volume I →

18 Decks

Volume II: At Scale

Distributed infrastructure: compute clusters, network fabrics, distributed training, fault tolerance, fleet orchestration, inference at scale, and sustainability. Browse Volume II →

Tutorial

Quantitative ML Systems Tutorial

A tutorial-oriented path through the Iron Law, the 5-layer stack, and MLSys·im examples from single-node roofline to fleet-scale carbon analysis.

For course use, see the Teaching Guide for semester plans and customization instructions.

Built for

Students

Build intuition for why ML systems behave as they do. Run roofline analysis, see the memory wall, compute carbon footprints — all without needing GPU hardware. See learning path →

Instructors

Assign analytically grounded problem sets with deterministic, reproducible outputs. Pair MLSYSIM exercises with 35 ready-to-teach Beamer slide decks — each with speaker notes and active learning prompts. See course integration →

Engineers & Researchers

Pre-deployment estimates for any architecture. Model distributed overheads, LLM serving latency, and multi-region sustainability before provisioning hardware. See quick API guide →

Citation

If you use the MLSys·im Python package in coursework, research, or infrastructure analysis, please cite:

@software{mlsysim2026,
  author       = {Janapa Reddi, Vijay},
  title        = {{MLSys$\\cdot$im}: First-Principles Infrastructure Modeling for Machine Learning Systems},
  year         = {2026},
  url          = {../mlsysim},
  version      = {0.1.2},
  institution  = {Harvard University}
}

The slide decks, MLSys·im engine, tutorials, and textbook are all part of the same open-source ecosystem. Cite the textbook separately when you use the book or course materials. View all resources on GitHub.

References

[1]
S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009, doi: 10.1145/1498765.1498785.

© 2024-2026 Harvard University. Code: Apache-2.0 · Docs: CC-BY-NC-SA 4.0

Part of the Machine Learning Systems textbook