%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]
A --> E
B --> D
C --> D
D --> E
E --> F
The 5-Layer Architecture
Progressive Lowering: From Abstract Demand to Concrete Supply
MLSYSIM organizes the full ML systems domain into five composable layers. Abstract workload demand (Layer A) is progressively mapped onto concrete hardware supply (Layers B–D) through analytical solvers (Layer E). Each layer corresponds directly to chapters in the Machine Learning Systems textbook and its companion lecture slides.
Understanding this stack is the key to mastering both this library and the textbook it accompanies.
The Stack at a Glance
The flow reads top-to-bottom: a Workload (what you want to compute) and a System (the hardware, network, and environment you have) feed into a Solver (the analytical engine), which returns quantitative results you can act on.
Layer A: Workloads (Demand)
A Workload is a hardware-agnostic description of computational demand. The question is not “How fast is Llama-3?” but rather “How many FLOPs and memory bytes does Llama-3 require?”
MLSYSIM provides two workload types:
TransformerWorkload— LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.CNNWorkload— Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.
The crucial step happens when a workload is “lowered” at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the Arithmetic Intensity (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.
Layer A concepts are covered in the following lecture decks:
| Deck | Topic | Key Concepts |
|---|---|---|
| NN Computation (Vol I, Ch 5) | The math behind the model | FLOPs, parameter counting, activation memory, forward/backward pass cost |
| NN Architectures (Vol I, Ch 6) | Architecture as infrastructure | Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling |
| Model Compression (Vol I, Ch 10) | From benchmark winner to production model | Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob |
| Model Training (Vol I, Ch 8) | The physics of learning | Iron Law of training, memory breakdown (activations, gradients, optimizer states) |
See the Model Zoo for vetted workloads spanning language, vision, recommendation, and TinyML models.
Layer B: Hardware (Silicon)
A HardwareNode represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:
- Compute: Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via
ComputeCore. - Memory: HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via
MemoryHierarchy. - Power: Thermal Design Power (TDP) in watts.
Every piece of silicon has a ridge point — the ratio of peak compute to memory bandwidth (\(I^* = \text{Peak\_FLOPs} / \text{BW}\)). When a workload’s arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.
This is the Roofline model — the diagnostic framework at the heart of Layer E’s SingleNodeModel.
Layer B concepts are covered in the following lecture decks:
| Deck | Topic | Key Concepts |
|---|---|---|
| Hardware Acceleration (Vol I, Ch 11) | Moving data costs more than computing it | Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC) |
| Compute Infrastructure (Vol II, Ch 2) | The physics of the ML fleet | HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis |
| Benchmarking (Vol I, Ch 12) | Measuring what matters | MLPerf, roofline profiling, statistical rigor, the lab-to-production gap |
See the Silicon Zoo for 18+ vetted hardware specs from cloud GPUs to microcontrollers.
Layer C: Infrastructure (Environment)
Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.
A GridProfile captures this physical context:
- Carbon intensity (g CO₂/kWh) — varies 40x across regions: Quebec’s hydroelectric grid produces ~8 g CO₂/kWh; Poland’s coal-dominated grid produces ~700 g CO₂/kWh.
- Power Usage Effectiveness (PUE) — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
- Water Usage Effectiveness (WUE) — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).
A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.
Layer C concepts are covered in the following lecture deck:
| Deck | Topic | Key Concepts |
|---|---|---|
| Sustainable AI (Vol II, Ch 15) | Energy as a first-class engineering constraint | Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework |
See the Infrastructure Zoo for regional grid profiles and datacenter configurations.
Layer D: Systems (Topology)
A single GPU cannot train a 100-billion-parameter model. Layer D composes individual HardwareNodes from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.
Three types define a system:
Node— A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.NetworkFabric— The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.Fleet— A collection ofNodes connected by aNetworkFabric, deployed in a specificDatacenter(Layer C). This is the complete system that solvers operate on.
The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.
Layer D concepts span four lecture decks covering the full distributed systems stack:
| Deck | Topic | Key Concepts |
|---|---|---|
| Network Fabrics (Vol II, Ch 3) | The synchronization backbone | \(\alpha\)-\(\beta\) model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth |
| Distributed Training (Vol II, Ch 5) | The physics of scaling | Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies |
| Collective Communication (Vol II, Ch 6) | The traffic patterns of distributed ML | AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap |
| Fleet Orchestration (Vol II, Ch 8) | Extracting useful work from shared infrastructure | Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy |
See the Fleet Zoo for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.
Layer E: Solvers (Analysis)
The previous four layers are static definitions — nouns. Solvers are the analytical engines — verbs — that bridge demand and supply to answer specific questions.
Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.
| Solver | Bridges | Core Equation | Key Output |
|---|---|---|---|
| SingleNodeModel | A \(\to\) B | Roofline / Iron Law (Williams et al., 2009) | Latency, throughput, bottleneck classification |
| DistributedModel | A \(\to\) D | Ring All-Reduce + Pipeline schedules | Scaling efficiency, communication overhead, bubble fraction |
| ServingModel | A \(\to\) B | Pre-fill / Decode phase decomposition | TTFT, inter-token latency, KV-cache memory |
| EconomicsModel | D \(\to\) cost | CapEx + OpEx over time horizon | TCO in USD, cost per query |
| SustainabilityModel | D \(\to\) C | PUE \(\times\) grid carbon intensity | Energy (kWh), carbon (kg CO₂e), water (L) |
| ReliabilityModel | D \(\to\) uptime | Young-Daly optimal checkpointing | Fleet MTBF, failure probability, checkpoint interval |
Solvers are composable. To answer “What is the most sustainable way to serve Llama-70B?”, chain ServingModel (feasibility and latency) into EconomicsModel (cost) into SustainabilityModel (carbon). Each solver’s typed output feeds naturally into the next.
Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:
| Deck | Topic | Key Concepts |
|---|---|---|
| Hardware Acceleration (Vol I, Ch 11) | The Roofline model | Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis |
| Model Serving (Vol I, Ch 13) | Inverting every training priority | Latency budgets, queuing theory (Little’s Law), continuous batching, training-serving skew |
| Inference at Scale (Vol II, Ch 9) | Where ML systems live or die economically | Serving economics, KV-cache bottleneck, batching strategies, autoscaling |
| Performance Engineering (Vol II, Ch 10) | Match the software to the silicon | Operator fusion, FlashAttention, mixed precision, systematic profiling workflow |
| Distributed Training (Vol II, Ch 5) | Scaling efficiency analysis | Communication-Computation Ratio, parallelism strategy selection, scaling laws |
See the Solver Guide for a decision guide on choosing the right solver, and Math Foundations for the complete equations.
Progressive Lowering in Action
The architecture is best understood through a concrete example. Consider the question: “Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?”
This single question touches all five layers:
1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
~140 GB at FP16 precision
2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
80 GB capacity, 700W TDP
3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
and US Average (385 g CO₂/kWh)
4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
tensor parallelism across 4 GPUs
5. Layer E — Chain three solvers:
ServingModel → TTFT, ITL, KV-cache feasibility
EconomicsModel → TCO over 1 year
SustainabilityModel → carbon footprint by region
Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSYSIM separates concerns into five composable layers rather than offering a monolithic “predict performance” function.
Slide Deck Quick Reference
For instructors and students using the companion lecture slides, the table below maps every MLSYSIM layer to the relevant slide decks.
Volume I: Foundations
| Ch | Deck | MLSYSIM Layer(s) | Download |
|---|---|---|---|
| 5 | NN Computation | A (Workloads) | |
| 6 | NN Architectures | A (Workloads) | |
| 8 | Model Training | A + B + E | |
| 10 | Model Compression | A (Workloads) | |
| 11 | Hardware Acceleration | B (Hardware) + E (Solvers) | |
| 12 | Benchmarking | B (Hardware) + E (Solvers) | |
| 13 | Model Serving | E (Solvers) |
Volume II: At Scale
| Ch | Deck | MLSYSIM Layer(s) | Download |
|---|---|---|---|
| 2 | Compute Infrastructure | B (Hardware) | |
| 3 | Network Fabrics | D (Systems) | |
| 5 | Distributed Training | D (Systems) + E (Solvers) | |
| 6 | Collective Communication | D (Systems) | |
| 8 | Fleet Orchestration | D (Systems) | |
| 9 | Inference at Scale | E (Solvers) | |
| 10 | Performance Engineering | E (Solvers) | |
| 15 | Sustainable AI | C (Infrastructure) |
Where to Go Next
- Getting Started — Install MLSYSIM and run your first analysis in 5 minutes.
- Solver Guide — Decision guide for choosing the right analytical engine.
- Math Foundations — The closed-form equations behind every solver.
- Tutorials — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
- Zoo Overview — Browse the registries of vetted models, hardware, fleets, and infrastructure.