MLSYSIM: A First-Principles Analytical Engine for Teaching Machine Learning Systems

From Roofline Bounds to Datacenter Carbon: Bridging the Gap Between Textbook Theory and Systems Reality

Author

Vijay Janapa Reddi

Abstract

Machine learning systems education faces a practical gap: the hardware students need to reason about — H100 clusters, InfiniBand fabrics, multi-megawatt datacenters — is inaccessible for hands-on experimentation. We present MLSYSIM, a first-principles analytical engine designed as the companion framework to the Machine Learning Systems textbook mlsysbook2024?. MLSYSIM provides six composable solvers covering single-node performance (Roofline), distributed training (4D parallelism including Expert Parallelism for MoE), LLM serving (pre-fill vs. decode), Total Cost of Ownership, carbon footprint, and cluster reliability. All quantities carry physical units via pint.Quantity types, enforcing dimensional correctness at runtime. A vetted registry of 19 hardware devices spanning five deployment tiers (cloud to sub-watt TinyML), 15 model architectures, and 4 regional grid profiles provides a single source of truth that keeps textbook exercises grounded in real-world specifications. A formula library of 20+ canonical equations, a persona-driven simulation layer, and built-in visualization round out the platform. MLSYSIM is open source and available at mlsysbook.ai.


1. Introduction

The “Iron Law” of machine learning performance states that inference latency is bounded by two ceilings: the time to execute all floating-point operations at peak throughput, and the time to transfer all model weights from memory at peak bandwidth. Whichever is slower determines the bottleneck. This principle, formalized in the Roofline model [1], is foundational to ML systems reasoning, yet teaching it effectively requires students to work with real hardware specifications that most universities cannot afford to provide.

This accessibility gap creates a pedagogical problem. Students learn that an NVIDIA H100 achieves 1,979 TFLOP/s at FP16 Tensor Core and has 3.35 TB/s of HBM bandwidth [2], but without a framework to apply these numbers, the specifications remain abstract. They study 3D parallelism (Data, Tensor, and Pipeline parallelism) but cannot experiment with different configurations to observe how pipeline bubbles grow or communication overhead scales. They discuss carbon-aware computing but lack the tools to quantify how training location affects emissions [3].

MLSYSIM addresses this gap by providing a dimensionally strict, first-principles analytical engine. It is not an empirical profiler (like PyTorch Profiler), nor a cycle-accurate simulator (like gem5). It is an analytical modeling platform that computes performance bounds from specifications and first-order equations. This design choice is deliberate: by working from equations rather than empirical traces, students build the mathematical intuition needed to reason about systems they will encounter in practice.

1.1 Pedagogical Motivation

MLSYSIM serves three user communities:

  1. Students learning ML systems for the first time. MLSYSIM lets them explore “what-if” questions: What happens to latency when I change precision from FP16 to INT8? How does pipeline parallelism degree affect the bubble fraction? Where should I train to minimize carbon?

  2. Instructors who need reproducible, hardware-independent exercises. MLSYSIM’s vetted registry ensures that homework problems produce consistent results regardless of the student’s local hardware. Companion lecture slides provide ready-made visual materials that align directly with the solver domains.

  3. Developers building ML infrastructure. MLSYSIM’s type-safe API provides quick back-of-the-envelope estimates for capacity planning, hardware selection, and cost modeling.

1.2 Design Principles

Three principles guide MLSYSIM’s architecture:

  • Eliminate Magic Numbers. Every constant in the framework (hardware FLOP/s, memory bandwidth, carbon intensity) is sourced from manufacturer datasheets or published benchmarks, with provenance metadata attached. Students never work with unexplained numbers.

  • Enforce Dimensional Correctness. All quantities carry physical units via the pint library. Attempting to add FLOP/s to GB/s raises a DimensionalityError at runtime, catching the class of unit-conversion bugs that plague back-of-the-envelope calculations.

  • Progressive Disclosure. The framework scales with the student. A first exercise uses Engine.solve() with two arguments. An advanced lab configures 4D parallelism with Expert Parallelism, sweeps grid regions, and chains multiple solvers to answer compound questions.

1.3 Contributions

This paper makes four contributions:

  1. A 5-layer analytical architecture (Section 3) that cleanly separates workload demand from hardware supply, enabling compositional analysis across the full ML systems stack.

  2. Six composable analytical solvers (Section 4) covering performance, serving, distributed scaling, economics, sustainability, and reliability, each grounded in established systems models (Roofline, Young-Daly, ring all-reduce).

  3. A vetted specification registry (Section 6) providing a single source of truth for hardware, model, infrastructure, and fleet specifications used throughout the Machine Learning Systems textbook.

  4. A formula library and simulation layer (Section 5) exposing 20+ canonical equations as individually importable, unit-aware functions, paired with persona-driven simulations for lab exercises.

1.4 Paper Organization

Section 3 presents the 5-layer stack architecture. Section 4 details the six analytical solvers and their mathematical foundations. Section 5 describes the formula library and simulation layer. Section 6 describes the MLSys Zoo registry. Section 7 discusses pedagogical integration with the textbook and lecture slides. Section 8 positions MLSYSIM relative to existing tools. Section 9 addresses accuracy and validation. Section 10 discusses limitations and future work. Section 11 concludes.


2. Architecture: The 5-Layer Stack

MLSYSIM organizes the ML systems domain into five composable layers, following a strategy we call Progressive Lowering: abstract workload demand is progressively mapped onto concrete hardware supply through intermediate representations. This mirrors the layered decomposition taught in the Hardware Acceleration and Compute Infrastructure lecture slides.

%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
    A["<b>Layer A: Workloads</b><br/>TransformerWorkload, CNNWorkload<br/><i>Parameters, FLOPs, Arithmetic Intensity</i>"]
    B["<b>Layer B: Hardware</b><br/>HardwareNode, ComputeCore, MemoryHierarchy<br/><i>Peak FLOP/s, Bandwidth, Capacity, TDP</i>"]
    C["<b>Layer C: Infrastructure</b><br/>GridProfile, Datacenter<br/><i>Carbon Intensity, PUE, WUE</i>"]
    D["<b>Layer D: Systems</b><br/>Node, Fleet, NetworkFabric<br/><i>Topology, Accelerators/Node, Fabric BW</i>"]
    E["<b>Layer E: Solvers</b><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
    F["<b>Results</b><br/>PerformanceProfile · SystemLedger"]

    A --> E
    B --> D
    C --> D
    D --> E
    E --> F

2.1 Layer A: Workloads (Demand)

A Workload is a hardware-agnostic description of computational demand. MLSYSIM provides two concrete workload types:

  • TransformerWorkload: Defines parameter count, layer count, hidden dimension, attention heads, and KV-head count. Supports KV-cache size calculation for serving analysis.
  • CNNWorkload: Defines parameter count and inference FLOPs.

Both workloads implement lower(precision) -> ComputationGraph, which produces a hardware-agnostic intermediate representation containing total operations, weight bytes, and arithmetic intensity (ops/byte). This lowering step is where precision format (FP32, FP16, INT8, INT4) affects the analysis: lower precision reduces weight bytes, increasing arithmetic intensity and potentially shifting the bottleneck from memory-bound to compute-bound.

2.2 Layer B: Hardware (Supply)

A HardwareNode specifies the physical capabilities of a single accelerator:

Field Type Meaning
compute.peak_flops Quantity (TFLOP/s) Theoretical peak throughput
compute.precision_flops Dict Peak throughput per precision format
memory.bandwidth Quantity (TB/s) HBM bandwidth
memory.capacity Quantity (GB) Total HBM capacity
tdp Quantity (W) Thermal Design Power
dispatch_tax Quantity (ms) Kernel launch overhead

The ridge_point() method computes the Roofline inflection point: peak_flops / bandwidth, expressed in FLOP/byte. Workloads with arithmetic intensity below this threshold are memory-bound; those above are compute-bound. This concept is central to the Hardware Acceleration slides, which develop the Roofline model visually before students encounter it programmatically.

2.3 Layer C: Infrastructure (Environment)

A GridProfile captures the environmental context of computation:

  • Carbon Intensity (gCO2/kWh): Ranges from ~20 (Quebec hydro) to ~820 (Poland coal) in the registry, a ~41x difference for identical workloads.
  • Power Usage Effectiveness (PUE): The ratio of total facility energy to IT energy. Ranges from 1.03 (liquid-cooled) to 1.45 (legacy air-cooled).
  • Water Usage Effectiveness (WUE): Liters of water consumed per kWh of energy.

These metrics align with the carbon accounting framework presented in the Sustainable AI slides, where students learn the three-phase carbon lifecycle before computing operational emissions with MLSYSIM.

2.4 Layer D: Systems (Topology)

A Fleet composes hardware into a cluster:

  • Node: Groups accelerators within a server (e.g., 8x H100 with 900 GB/s NVLink).
  • NetworkFabric: Specifies inter-node connectivity (bandwidth, topology, oversubscription ratio).
  • Fleet: Defines node count and links to infrastructure context.

The total_accelerators property (nodes x accelerators_per_node) determines the scale available for parallelism decomposition. The Network Fabrics slides cover the topology trade-offs (fat-tree, rail-optimized) and the alpha-beta communication model that the DistributedModel builds upon.

2.5 Layer E: Solvers (Analysis)

Solvers bridge demand (Layer A) and supply (Layers B–D) to produce analytical results. Each solver implements a solve() method that accepts typed inputs and returns structured outputs. Section 4 details each solver’s mathematical model.

2.6 Results: PerformanceProfile and SystemLedger

Solver outputs are structured into two result types:

  • PerformanceProfile: The output of Engine.solve(), containing latency, throughput, bottleneck classification, arithmetic intensity, energy consumption, MFU (Model FLOP Utilization), HFU (Hardware FLOP Utilization), and memory feasibility.

  • SystemLedger: A unified result container produced by the simulation layer (Section 5), aggregating PerformanceMetrics, SustainabilityMetrics, EconomicMetrics, and ReliabilityMetrics into a single record for multi-dimensional analysis.


3. Analytical Solvers

MLSYSIM provides six solvers, each targeting a distinct class of systems question. All solvers share a common interface (BaseSolver.solve()) and can be composed to answer compound questions.

3.1 SingleNodeModel: The Roofline Model

The SingleNodeModel implements the Iron Law of ML performance:

\[T_{\text{latency}} = \max\!\left(\frac{\text{FLOPs}}{\text{Peak}_\text{FLOP/s} \times \eta},\;\frac{\text{Bytes}}{\text{BW}_\text{mem}}\right) + T_{\text{dispatch}}\]

where \(\eta\) is the hardware utilization efficiency (typically 0.25–0.55 for ML workloads).

Inputs: Workload, HardwareNode, batch size, precision, efficiency.

Outputs: A PerformanceProfile containing latency, throughput, bottleneck classification (“Memory Bound” or “Compute Bound”), arithmetic intensity, energy consumption, and memory feasibility.

The solver first maps precision to bytes-per-parameter and selects the appropriate peak FLOP/s (e.g., FP32 vs. FP16 Tensor Core throughput). It then computes both the compute-bound and memory-bound latencies, takes the maximum, adds the dispatch tax, and determines which ceiling binds. The Hardware Acceleration slides introduce the Roofline plot visually; MLSYSIM makes it interactive.

from mlsysim import Engine, Hardware, Models

profile = Engine.solve(
    model=Models.ResNet50,
    hardware=Hardware.Cloud.H100,
    batch_size=1,
    precision="fp16"
)
print(f"Bottleneck: {profile.bottleneck}")  # -> Memory Bound
print(f"Latency: {profile.latency}")        # -> 0.03 ms

3.2 ServingModel: LLM Inference Phases

LLM inference has two physically distinct phases, a dichotomy explored in both the Model Serving slides (Volume I) and the Inference at Scale slides (Volume II):

  1. Pre-fill (Compute-Bound): All prompt tokens are processed in parallel. Latency scales with 2 * params * seq_len * batch_size / (peak_flops * eta). This determines Time-To-First-Token (TTFT).

  2. Decode (Memory-Bound): Each token requires reading all model weights plus the KV-cache from HBM. Latency per token scales with (weight_bytes + kv_cache_bytes) / bandwidth. This determines Inter-Token Latency (ITL).

The solver also computes KV-cache memory [4]:

\[\text{KV-cache} = 2 \times n_\text{layers} \times n_\text{kv\_heads} \times d_\text{head} \times \text{seq\_len} \times \text{batch} \times \text{bytes/element}\]

and checks whether weights + KV-cache <= HBM capacity (the “Memory Wall”). When this constraint is violated, the model cannot be served on the target hardware without techniques such as quantization or tensor parallelism.

3.3 DistributedModel: 4D Parallelism

For fleet-scale training, the solver decomposes the workload using four parallelism dimensions. The Distributed Training slides introduce the “3D parallelism cube”; MLSYSIM extends this with a fourth axis for Mixture-of-Experts:

  • Data Parallelism (DP): Replicates the model across dp_size workers. Requires all-reduce of gradients after each step.
  • Tensor Parallelism (TP): Splits individual layers across tp_size GPUs within a node (over NVLink).
  • Pipeline Parallelism (PP): Chains model stages across pp_size nodes, introducing pipeline bubbles.
  • Expert Parallelism (EP): For Mixture-of-Experts architectures [5], distributes expert sub-networks across ep_size GPUs with All-to-All communication for token routing.

The total accelerator count constrains the decomposition: dp_size * tp_size * pp_size * ep_size = total_accelerators.

Communication overhead is modeled using the ring all-reduce formula for DP gradients [6] and the alpha-beta model taught in the Collective Communication slides:

\[T_{\text{ring}} = 2 \cdot \frac{N-1}{N} \cdot \frac{S}{\text{BW}} + 2(N-1) \cdot \alpha\]

where \(N\) is the number of workers, \(S\) is the message size (gradient tensor bytes), BW is the effective fabric bandwidth (accounting for oversubscription), and \(\alpha\) is the per-message latency.

Pipeline bubble fraction follows the interleaved pipeline model [7]:

\[\text{Bubble} = \frac{P - 1}{V \times M + P - 1}\]

where \(P\) is the pipeline depth, \(M\) is the number of microbatches, and \(V\) is the number of virtual stages per GPU.

Scaling efficiency is computed as:

\[\eta_{\text{scale}} = \frac{T_{\text{compute}}}{T_{\text{compute}} + T_{\text{comm}} + T_{\text{bubble}}}\]

3.4 EconomicsModel: Total Cost of Ownership

The EconomicsModel computes TCO, grounding the cost models presented in the Compute Infrastructure slides:

\[\text{TCO} = \text{CapEx} + \text{OpEx}_\text{energy} + \text{OpEx}_\text{maintenance}\]

where:

  • CapEx = unit cost x total accelerators
  • OpEx (energy) = total energy (from SustainabilityModel) x electricity price
  • OpEx (maintenance) = 5% annual maintenance ratio x CapEx x (duration / 365)

3.5 SustainabilityModel: Carbon, Energy, Water

The SustainabilityModel chains three calculations, implementing the carbon accounting framework from the Sustainable AI slides:

\[E_{\text{IT}} = \text{TDP} \times N_\text{accel} \times T_\text{hours}\] \[E_{\text{total}} = E_{\text{IT}} \times \text{PUE}\] \[\text{Carbon} = E_{\text{total}} \times \text{CI}_\text{region}\] \[\text{Water} = E_{\text{total}} \times \text{WUE}\]

This solver illustrates a key insight for sustainable ML: the ~41x difference in carbon intensity between Quebec (20 gCO2/kWh) and Poland (820 gCO2/kWh) means that where you train can matter as much as how you train [3].

3.6 ReliabilityModel: MTBF and Checkpointing

At cluster scale, component failures become statistical certainties. The solver computes the quantities taught in the Fault Tolerance slides:

Fleet MTBF: For \(N\) independent nodes each with MTBF \(\mu\):

\[\text{MTBF}_\text{fleet} = \frac{\mu}{N}\]

Failure probability for a job of duration \(T\):

\[P(\text{failure}) = 1 - e^{-T / \text{MTBF}_\text{fleet}}\]

Optimal checkpoint interval using the Young-Daly formula [8], [9]:

\[\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}\]

where \(\delta\) is the time to save one checkpoint.

3.7 Composing Solvers

Real-world questions often require chaining multiple solvers. For example, answering “Can I serve Llama-70B on 4x H100s, and what will it cost?” requires the ServingModel (feasibility and latency) followed by the EconomicsModel (per-query cost). Similarly, “What is the most sustainable way to train GPT-3?” chains the DistributedModel (optimal parallelism) with the SustainabilityModel (carbon by region).

%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "Solver composition for compound questions. Each solver's output feeds the next, enabling multi-dimensional analysis."
%%| fig-width: 100%
flowchart LR
    Q1["Can I serve<br/>Llama-70B on<br/>4x H100s?"] --> S1["ServingModel"]
    S1 --> S2["EconomicsModel"]
    S2 --> A1["Feasible at<br/>$X/query"]

    Q2["Most sustainable<br/>way to train<br/>GPT-3?"] --> S3["DistributedModel"]
    S3 --> S4["SustainabilityModel"]
    S4 --> A2["Quebec saves<br/>~41x carbon"]


4. Formula Library and Simulation Layer

Beyond the six solvers, MLSYSIM exposes two additional subsystems that support deeper engagement.

4.1 Canonical Formula Library

The formulas module provides 20+ individually importable, unit-aware functions that implement the canonical equations of ML systems. Each function is documented with its source reference and enforces dimensional correctness:

Formula Domain Source
calc_transformer_training_flops Training 6PD scaling law
calc_activation_memory Training Korthikanti et al.
calc_checkpoint_size Reliability Young-Daly model
calc_amdahls_speedup Scaling Amdahl’s Law
calc_ring_allreduce_time Communication Alpha-beta model
calc_mtbf_node Reliability Exponential failure model
calc_availability_stacked Reliability Series reliability
calc_effective_flops Performance Roofline model
calc_kv_cache_bytes Serving Transformer architecture

This library serves two purposes. First, it enables students to import and compose individual equations in Jupyter notebooks without instantiating full solver objects, supporting the exploratory style of the Performance Engineering slides. Second, it provides a single source of truth for the solvers themselves: every solver delegates to these functions rather than re-implementing the math inline.

4.2 Persona-Driven Simulations

The simulation layer introduces four Personas that represent canonical deployment archetypes:

Persona Scale Factor Archetype
CloudTitan 100M x Frontier training runs
EdgeGuardian 1,000 x Edge fleet deployments
MobileNomad 10 x On-device inference
TinyPioneer 1 x Sub-watt TinyML

Each persona wraps a solver configuration and produces a SystemLedger result. This narrative layer transforms abstract solver outputs into relatable deployment stories for lab exercises: students do not just “run the DistributedModel”; they “plan a CloudTitan training run” and see how cost, carbon, and reliability interact at 8,192-GPU scale.

4.3 Visualization

MLSYSIM includes built-in visualization functions:

  • plot_roofline(): Generates annotated Roofline plots showing the compute and memory ceilings, the ridge point, and where specific workloads fall.
  • plot_evaluation_scorecard(): Produces multi-panel scorecards from scenario evaluations, showing feasibility, performance, cost, and sustainability in a unified view.

These visualizations complement the Roofline diagrams in the Hardware Acceleration slides and the diagnostic flowcharts in the Performance Engineering slides, allowing students to generate publication-quality figures from their own analyses.


5. The MLSys Zoo: A Centralized Specification Registry

A persistent challenge in ML systems education is the staleness of hardware specifications. Textbook exercises written with 2020-era A100 specs become misleading when students encounter H100s or B200s. MLSYSIM addresses this with a centralized, version-controlled registry.

5.1 Registry Design

Each registry entry is a Pydantic model with:

  • Typed fields enforced at construction time (a HardwareNode cannot be created without peak_flops)
  • Physical units attached to all quantities (80 * ureg.GB, not 80)
  • Provenance metadata linking to manufacturer datasheets

5.2 Hardware Zoo

The Hardware Zoo spans five deployment tiers, covering a power range from 0.005 W to 1,000 W and a memory range from 512 KiB to 192 GB:

Tier Devices Characteristic
Cloud V100, A100, H100, H200, B200, MI300X, TPUv5p, T4 300–1,000 W, TB/s bandwidth
Workstation DGX Spark, MacBook M3 Max 100–200 W, unified or HBM memory
Mobile iPhone 15 Pro, Pixel 8, Snapdragon 8 Gen 3 5 W, battery-constrained
Edge Jetson Orin NX, Coral, NUC+Movidius, Edge Server 2–25 W, latency-constrained
Tiny ESP32-S3, Himax WE-I Plus Sub-watt, KB-scale memory

This 200,000x power span enables students to reason about the full deployment spectrum in a single framework, from the sub-watt TinyML devices explored in the TinyML slides to the datacenter-scale accelerators covered in the Compute Infrastructure slides.

5.3 Model Zoo

The Model Zoo provides 15 pre-configured workload profiles across four categories:

Category Models
Language GPT-2, GPT-3, GPT-4, BERT-Base, Llama-2-70B, Llama-3-8B, Llama-3-70B
Vision AlexNet, ResNet-50, MobileNetV2, YOLOv8-Nano
Tiny DS-CNN (keyword spotting), WakeVision, Anomaly Detector
Recommendation DLRM

5.4 Infrastructure and Systems Zoos

The Infrastructure Zoo provides four grid profiles with dramatically different carbon intensities (Quebec, Norway, US Average, Poland). The Systems Zoo provides pre-configured fleet topologies (256-GPU research cluster, 8,192-GPU frontier cluster) with appropriate networking fabrics.


6. Pedagogical Integration

MLSYSIM is designed as the computational companion to the Machine Learning Systems textbook mlsysbook2024?. Each solver maps to specific textbook chapters and their corresponding lecture slide decks:

Solver Textbook Chapters Slide Decks
SingleNodeModel Training, Hardware Acceleration, Benchmarking Training, HW Acceleration, Benchmarking
ServingModel Model Serving, Inference at Scale Model Serving, Inference at Scale
DistributedModel Distributed Training, Collective Communication Distributed Training, Collective Comm
EconomicsModel Compute Infrastructure Compute Infrastructure
SustainabilityModel Sustainable AI Sustainable AI
ReliabilityModel Fault Tolerance Fault Tolerance

6.1 Progressive Complexity

The framework supports three levels of engagement:

Level 1 — Guided Exploration (Getting Started):

profile = Engine.solve(model=Models.ResNet50, hardware=Hardware.Cloud.A100)
print(profile.bottleneck)  # "Memory Bound"

Level 2 — Comparative Analysis (Tutorials):

for hw in [Hardware.Cloud.A100, Hardware.Cloud.H100, Hardware.Cloud.B200]:
    p = Engine.solve(model=Models.Language.Llama3_8B, hardware=hw)
    print(f"{hw.name}: {p.latency.to('ms'):~.2f}, {p.bottleneck}")

Level 3 — Systems Design (Scenarios):

scenario = Scenarios.FrontierTraining  # Llama-70B on 8192 GPUs
evaluation = scenario.evaluate(batch_size=2048, precision="fp16")
print(evaluation.scorecard())

6.2 Dimensional Safety as Pedagogy

The use of pint.Quantity throughout the framework serves a dual purpose: it prevents bugs in the framework itself, and it teaches students to think in units. When a student writes:

# This raises DimensionalityError -- you can't add FLOP/s to GB/s
result = gpu.compute.peak_flops + gpu.memory.bandwidth

the error message itself becomes a teaching moment about the distinction between compute throughput and memory bandwidth. This approach follows the precedent set by Hennessy and Patterson [10], where dimensional analysis is a core skill for computer architects.

6.3 Scenario-Based Assessment

MLSYSIM provides four “lighthouse” scenarios that serve as recurring case studies across the textbook:

  1. Smart Doorbell: WakeVision on ESP32, tests TinyML feasibility (200 ms SLA)
  2. Autonomous Vehicle: ResNet-50 on Jetson Orin NX, tests edge latency (10 ms SLA)
  3. Local Fine-tuning: Llama-3-8B on MacBook M3 Max, tests workstation limits (100 ms SLA)
  4. Frontier Training: Llama-3-70B on 8,192 H100s, tests fleet-scale economics (500 ms SLA)

Each scenario bundles a workload, hardware/fleet, and SLA constraints into a Scenario object. The evaluate() method produces a multi-level scorecard assessing feasibility (does it fit in memory?), performance (does it meet SLA?), and macro impact (what are the costs and carbon?).

6.4 Lecture Slide Integration

The lecture slide collection provides visual foundations that MLSYSIM makes interactive:

A full-day MLSYSIM Tutorial (presented at ISCA) walks instructors through integrating these materials into a course. The tutorial slide decks cover morning sessions (Roofline, serving, distributed training) and afternoon sessions (economics, sustainability, reliability, scenario design).


8. Accuracy and Validation

MLSYSIM is a first-order analytical model. Its estimates capture the dominant constraint (the Roofline ceiling that determines whether a workload is memory-bound or compute-bound) but deliberately omit second-order effects (cache hierarchy, kernel fusion, operator scheduling).

8.1 What MLSYSIM Models

Factor Modeled Source
Peak FLOP/s per precision Yes Manufacturer datasheets
HBM bandwidth Yes Manufacturer datasheets
Precision-dependent weight sizing Yes Bytes-per-parameter x param count
Dispatch/launch overhead Yes Empirical constant per device
Ring/tree all-reduce communication Yes Standard network models
Pipeline bubble fraction Yes (P-1)/(VM+P-1) formula
KV-cache memory for transformers Yes Architectural formula
Expert Parallelism (MoE) communication Yes All-to-All model
Activation memory estimation Yes Per-layer formula
Checkpoint sizing and optimal interval Yes Young-Daly model

8.2 What MLSYSIM Does Not Model

  • Cache hierarchy effects (L1/L2 hit rates)
  • Operator fusion and kernel optimization
  • CPU-GPU data transfer latency
  • Memory fragmentation
  • Dynamic batching in serving
  • Network congestion and contention
  • Thermal throttling under sustained load

8.3 Expected Accuracy Range

For well-characterized workloads (large batch sizes, standard architectures), MLSYSIM estimates are typically within 1.5–3x of measured performance, with the dominant source of error being the efficiency parameter \(\eta\). The framework is designed to identify the correct bottleneck rather than predict exact latency, a distinction that is pedagogically more valuable for systems reasoning.


9. Discussion and Future Work

9.1 Limitations

The Efficiency Parameter (\(\eta\)). The single most significant limitation is the reliance on an efficiency parameter that must be estimated by the user. Typical values range from 0.25 to 0.55 for ML workloads, but the optimal value depends on software stack maturity, workload characteristics, and hardware-software co-design — factors that cannot be captured analytically.

Static Analysis. MLSYSIM models steady-state performance. It does not capture transient effects (warmup, JIT compilation), dynamic scheduling decisions, or workload-dependent memory access patterns.

Registry Staleness. Hardware specifications evolve rapidly. The registry requires ongoing maintenance to remain a trusted source of truth. We mitigate this through provenance metadata and version control.

9.2 Future Directions

Empirical Calibration. Systematic validation against MLPerf results and published benchmarks would strengthen confidence in the analytical models and help calibrate default efficiency parameters.

Extended Solver Suite. Planned solvers include a QuantizationSolver (accuracy-latency-size trade-offs) and a NetworkSolver (detailed modeling of collective communication patterns beyond ring all-reduce).

Interactive Browser-Based Labs. Marimo-based WASM notebooks would allow students to run MLSYSIM entirely in the browser, eliminating setup friction.

Community Registry. A contribution pipeline for hardware specifications would allow the registry to grow beyond the core team’s bandwidth.

Data Pipeline Modeling. A DataModel for modeling data loading, preprocessing, and I/O bottlenecks would complete the picture of end-to-end ML system performance.


10. Conclusion

MLSYSIM provides a rigorous, accessible, and dimensionally correct analytical platform for reasoning about machine learning systems. By codifying the Roofline model, 4D parallelism, LLM serving phases, and sustainability metrics into a typed Python framework, it enables students to develop the quantitative intuition that the Machine Learning Systems textbook aims to teach.

The framework’s design — first-principles equations over empirical traces, dimensional correctness over convenience, vetted specifications over magic numbers — reflects a pedagogical commitment: students who understand why a system behaves as it does are better equipped to build the next generation of ML infrastructure than those who only know how to use today’s tools.

MLSYSIM is open source and available as part of the Machine Learning Systems textbook project at mlsysbook.ai. The complete lecture slide collection and teaching guide provide instructors with everything needed to integrate MLSYSIM into their courses.


References

[1]
S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009, doi: 10.1145/1498765.1498785.
[2]
NVIDIA Corporation, NVIDIA H100 Tensor Core GPU datasheet.” https://www.nvidia.com/en-us/data-center/h100/, 2023.
[3]
D. Patterson, J. Gonzalez, Q. Le, et al., “Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2022.
[4]
W. Kwon, Z. Li, S. Zhuang, et al., “Efficient memory management for large language model serving with PagedAttention,” in Proceedings of the 29th ACM symposium on operating systems principles (SOSP), ACM, 2023. doi: 10.1145/3600006.3613165.
[5]
N. Shazeer, A. Mirhoseini, K. Maziarz, et al., “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
[6]
J. Dean, G. S. Corrado, R. Monga, et al., “Large scale distributed deep networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
[7]
D. Narayanan, M. Shoeybi, J. Casper, et al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[8]
J. W. Young, “A first order approximation to the optimum checkpoint interval,” Communications of the ACM, vol. 17, no. 9, pp. 530–531, 1974, doi: 10.1145/361147.361115.
[9]
J. T. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Generation Computer Systems, vol. 22, no. 3, pp. 303–312, 2006, doi: 10.1016/j.future.2004.11.016.
[10]
J. L. Hennessy and D. A. Patterson, Computer architecture: A quantitative approach, 6th ed. Morgan Kaufmann, 2019.
[11]
P. Mattson, C. Cheng, G. Diamos, et al., MLPerf: An industry standard benchmark suite for machine learning performance,” in IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, 2020. doi: 10.1109/MICRO50266.2020.00045.
[12]
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2020. doi: 10.1145/3394486.3406703.
[13]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
[14]
W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in IEEE international symposium on performance analysis of systems and software (ISPASS), IEEE, 2023. doi: 10.1109/ISPASS57527.2023.00035.
[15]
M. Isaev, N. McDonald, L. Dennison, and R. Vuduc, “Calculon: A methodology and tool for high-level co-design of systems and large language models,” in Proceedings of the international conference for high performance computing, networking, storage and analysis (SC), ACM, 2023. doi: 10.1145/3581784.3607102.
[16]
A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, J. Emer, et al., “Timeloop: A systematic approach to DNN accelerator evaluation,” in IEEE international symposium on performance analysis of systems and software (ISPASS), IEEE, 2019. doi: 10.1109/ISPASS.2019.00042.
[17]
Y. N. Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” in IEEE/ACM international conference on computer-aided design (ICCAD), IEEE, 2019. doi: 10.1109/ICCAD45719.2019.8942149.
[18]
J. Kaplan, S. McCandlish, T. Henighan, et al., “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
[19]
D. Amodei and D. Hernandez, AI and compute,” in OpenAI blog, OpenAI, 2018. Available: https://openai.com/blog/ai-and-compute

Cite This Work

If you use MLSYSIM in your research or course materials, please cite the Machine Learning Systems textbook:

@book{mlsysbook2024,
  title     = {Machine Learning Systems: Principles and Practices of
               Engineering Artificially Intelligent Systems},
  author    = {Reddi, Vijay Janapa and others},
  year      = {2024},
  publisher = {Harvard University},
  url       = {https://mlsysbook.ai}
}
Note

MLSYSIM is the companion framework for the textbook. For the most current citation format, see the textbook website.

Back to top