%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSys·im 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
A["<b>Layer A: Workloads (Demand)</b><br/>TransformerWorkload, CNNWorkload, SSMWorkload, DiffusionWorkload<br/><i>Parameters, FLOPs, Arithmetic Intensity</i>"]
B["<b>Layer B: Hardware (Silicon)</b><br/>HardwareNode, ComputeCore, MemoryHierarchy, StorageHierarchy<br/><i>Peak FLOP/s, BW, Capacity, TDP, IO BW</i>"]
C["<b>Layer C: Infrastructure (Environment)</b><br/>GridProfile, Datacenter<br/><i>Carbon Intensity, PUE, WUE</i>"]
D["<b>Layer D: Systems (Topology)</b><br/>Node, Fleet, NetworkFabric<br/><i>Topology, Accelerators/Node, Fabric BW</i>"]
E["<b>Layer E: Solvers (22 Walls, 22 Resolvers)</b><br/>Node (1–7): SingleNode · Efficiency · Serving · ContinuousBatching · WeightStreaming · TailLatency<br/>Data (8–10): Data · Transformation · Topology<br/>Algorithm (11–13): Scaling · InferenceScaling · Compression<br/>Fleet (14–16): Distributed · Reliability · Orchestration<br/>Ops (17–20): Economics · Sustainability · Checkpoint · ResponsibleEngineering<br/>Analysis (21–22): Sensitivity · Synthesis"]
F["<b>Typed Results</b><br/>PerformanceProfile · DistributedResult · ServingResult · ..."]
A --> B
B --> C
C --> D
D --> E
E --> F
The 5-Layer Architecture
Demand–Supply Separation and Progressive Lowering
Patterson and Hennessy did not give students cycle-accurate x86 simulators; they gave them a taxonomically complete instruction set that exposed every architectural concept through a model simple enough to reason about yet faithful enough to build real intuition. MLSys·im occupies the same niche for ML systems — sacrificing microarchitectural detail to achieve sub-second execution, enabling students and practitioners to sweep thousands of configurations in the time a cycle-accurate simulator requires for one.
The core philosophy of MLSys·im is Demand–Supply Separation with Progressive Lowering. Rather than treating machine learning systems as black boxes, MLSys·im cleanly decouples what a model computes (demand) from where it runs (supply) and why constraints emerge (analysis).
Abstract workload demand (Layer A) is progressively mapped onto concrete hardware supply (Layers B, C, D) through analytical solvers (Layer E) that enforce dimensional strictness at runtime — every physical quantity carries SI units via the pint library, making unit mismatches structurally impossible. Understanding this stack is the key to mastering both this library and the textbook it accompanies.
The Stack Diagram
1. Layer A: Workloads (Demand)
A Workload is a hardware-agnostic description of computational demand. You don’t ask “How fast is Llama-3?”, you ask “How many FLOPs and memory bytes does Llama-3 require?”
In MLSys·im, TransformerWorkload, CNNWorkload, SSMWorkload, and DiffusionWorkload define these intrinsic properties (parameter count, layer count, sequence length). You can also import models directly from HuggingFace Hub using import_hf_model(). The crucial step happens when a workload is “lowered” at a specific numerical precision (e.g., FP16 vs INT8). This lowering step determines the Arithmetic Intensity (ops/byte) — the ratio that decides whether a model will be compute-bound or memory-bound on physical hardware.
See the Model Zoo for vetted workloads.
2. Layer B: Hardware (Supply)
A HardwareNode represents a single physical accelerator (like an H100 GPU or an Apple M3 chip). It provides the raw physical supply:
- Compute: Theoretical peak throughput (TFLOP/s) across different precisions (FP32, FP16, INT8).
- Memory: High Bandwidth Memory (HBM) capacity and transfer speed (TB/s).
- Storage & IO: Persistent storage capacity/bandwidth and IO interconnect (e.g., PCIe Gen5) speeds.
- Power: Thermal Design Power (TDP).
Every piece of silicon has a “Ridge Point” (Peak FLOPs / Memory Bandwidth). If your Workload’s arithmetic intensity is lower than the hardware’s ridge point, you are memory-bound.
See the Silicon Zoo for vetted hardware specs.
3. Layer C: Infrastructure (Environment)
Hardware doesn’t run in a vacuum; it runs in datacenters plugged into regional power grids. The GridProfile captures this physical context.
A 1000-watt GPU running in Quebec (hydroelectric power) vs. Poland (coal power) produces vastly different carbon footprints, despite doing the exact same mathematical operations. This layer introduces Power Usage Effectiveness (PUE) and Carbon Intensity to the analytical model.
See the Infrastructure Zoo for regional grid profiles.
4. Layer D: Systems (Topology)
You cannot train a 100-Billion parameter model on a single GPU. A Fleet composes individual HardwareNodes into a distributed cluster.
Node: Groups accelerators within a physical server chassis (e.g., 8x GPUs).NetworkFabric: Specifies how servers talk to each other (e.g., 400 Gbps InfiniBand NDR).
The way you structure this system determines your communication overhead and your scaling efficiency when you apply 3D/4D Parallelism.
See the Fleet Zoo for production cluster topologies.
The 22 Systems Walls
Before introducing the solvers, it is essential to understand the 22 physical and logical constraints that bound ML system performance. Each wall represents a specific bottleneck, grounded in a published equation and resolved by a dedicated solver. The walls are organized into six domains that progress from local node resources to fleet-scale operations.
SingleNodeModel, yielding 21 distinct solvers across 22 walls. See the research paper for full citations.
| # | Wall | Domain | What’s Bounded | Core Equation | Key Reference |
|---|---|---|---|---|---|
| 1 | Compute | Node | Peak FLOP/s ceiling | \(T = \text{OPs} / (\text{Peak} \times \eta)\) | Williams et al. (2009) |
| 2 | Memory | Node | HBM BW + capacity | \(T = \|W\| / BW_{\text{HBM}}\) | Williams et al. (2009) |
| 3 | Software | Node | Achieved MFU | \(\eta = f(\text{kernel, fusion, occupancy})\) | Chowdhery et al. (2022) |
| 4 | Serving | Node | Prefill vs. decode | Two-phase roofline | Pope et al. (2023) |
| 5 | Batching | Node | KV-cache fragmentation | \(\text{KV} = 2LHD \lceil S/p \rceil pBb\) | Kwon et al. (2023) |
| 6 | Streaming | Node | Injection BW | \(T = \max(T_{\text{inject}}, T_{\text{compute}})\) | Cerebras (2024) |
| 7 | Tail Latency | Node | P99 queueing delay | Erlang-C M/M/\(c\) | Dean & Barroso (2013) |
| 8 | Ingestion | Data | Storage I/O throughput | \(\rho = BW_{\text{demand}} / BW_{\text{supply}}\) | Mohan et al. (2022) |
| 9 | Transformation | Data | CPU preprocessing | \(T = B \cdot S / C_{\text{tput}}\) | Murray et al. (2021) |
| 10 | Locality | Data | Network bisection BW | \(BW_{\text{eff}} = BW_{\text{link}} \cdot \beta / \text{osub}\) | Leiserson (1985) |
| 11 | Complexity | Algorithm | Scaling law bounds | \(C = 6PD\); \(P^{*} = \sqrt{C/120}\) | Hoffmann et al. (2022) |
| 12 | Reasoning | Algorithm | Inference-time compute | \(T = K \times T_{\text{step}}\) | Brown et al. (2024) |
| 13 | Fidelity | Algorithm | Accuracy–efficiency | \(r = 32/b\); \(r = 1/(1{-}s)\) | Han et al. (2015) |
| 14 | Communication | Fleet | AllReduce overhead | \(T = 2\frac{N{-}1}{N}\frac{M}{\beta} + 2(N{-}1)\alpha\) | Shoeybi et al. (2019) |
| 15 | Fragility | Fleet | Cluster MTBF | \(\text{MTBF}_{\text{cl}} = \text{MTBF}_{\text{node}}/N\) | Daly (2006) |
| 16 | Multi-tenant | Fleet | Queue wait time | \(T_{\text{wait}} = \rho / [2\mu(1{-}\rho)]\) | Little (1961) |
| 17 | Capital | Ops | Total cost of ownership | \(\text{TCO} = \text{CapEx} + \text{OpEx}\) | Barroso et al. (2018) |
| 18 | Sustainability | Ops | Carbon + water | \(\text{CO}_2 = E \times \text{PUE} \times \text{CI}\) | Patterson et al. (2022) |
| 19 | Checkpoint | Ops | I/O burst MFU penalty | \(\text{penalty} = T_{\text{write}} / T_{\text{interval}}\) | Eisenman et al. (2022) |
| 20 | Safety | Ops | DP-SGD overhead | \(\sigma \propto 1/\epsilon\) | Abadi et al. (2016) |
| 21 | Sensitivity | Analysis | Binding constraint | \(\partial T / \partial x_i\) | Williams et al. (2009) |
| 22 | Synthesis | Analysis | Inverse spec derivation | \(BW_{\text{req}} = \|W\| / T_{\text{target}}\) | Kwon et al. (2023) |
5. Layer E: Solvers (Analysis)
The previous four layers define what exists — hardware specs, model architectures, infrastructure configurations. Solvers are the engines that bridge demand and supply to answer specific questions: they take a (workload, hardware) pair and compute whether the system is feasible, how it performs, and what it costs.
Each solver resolves one or more of the 22 ML Systems Walls — physical or logical constraints that bound system performance. The walls are organized into six domains that progress from local node resources through data movement and algorithmic scaling to fleet coordination, operations, and cross-cutting analysis.
Walls 1–2 (Compute and Memory) share the SingleNodeModel, yielding 21 distinct solvers across 22 walls.
Domain 1 — Node (Single-Accelerator Resources, Walls 1–7):
SingleNodeModel(Wall 1: Compute, Wall 2: Memory): Roofline model — peak FLOP/s ceiling and HBM bandwidth + capacity.EfficiencyModel(Wall 3: Software): MFU decomposition — kernel fusion, FlashAttention, SM occupancy.ServingModel(Wall 4: Serving): Two-phase LLM inference — compute-bound prefill vs. memory-bound decode.ContinuousBatchingModel(Wall 5: Batching): PagedAttention and iteration-level scheduling with non-contiguous KV-cache allocation.WeightStreamingModel(Wall 6: Streaming): Wafer-scale inference (e.g., Cerebras CS-3) — injection bandwidth bottleneck.TailLatencyModel(Wall 7: Tail Latency): M/M/c queueing (Erlang-C) for P50/P99 SLA analysis.
Domain 2 — Data (Movement & Pipelines, Walls 8–10):
DataModel(Wall 8: Ingestion): Storage I/O demand–supply bandwidth ratio.TransformationModel(Wall 9: Transformation): CPU preprocessing stall detection (JPEG decode, tokenization, augmentation).TopologyModel(Wall 10: Locality): Network bisection bandwidth for fat-tree, dragonfly, torus, ring topologies.
Domain 3 — Algorithm (Scaling & Compression, Walls 11–13):
ScalingModel(Wall 11: Complexity): Chinchilla scaling laws — compute-optimal model and dataset sizing.InferenceScalingModel(Wall 12: Reasoning): Inference-time compute scaling — chain-of-thought and tree-search cost.CompressionModel(Wall 13: Fidelity): Quantization/pruning accuracy–efficiency trade-offs.
Domain 4 — Fleet (Multi-Node Coordination, Walls 14–16):
DistributedModel(Wall 14: Communication): 4D parallelism (DP × TP × PP × EP), Ring AllReduce, pipeline bubbles.ReliabilityModel(Wall 15: Fragility): Fleet MTBF and Young-Daly optimal checkpoint interval.OrchestrationModel(Wall 16: Multi-tenant): Queueing theory (M/D/1) for shared cluster wait times.
Domain 5 — Ops (Economics, Sustainability & Safety, Walls 17–20):
EconomicsModel(Wall 17: Capital): Total Cost of Ownership (CapEx + OpEx).SustainabilityModel(Wall 18: Sustainability): Energy, carbon footprint (kg CO₂e), and water usage.CheckpointModel(Wall 19: Checkpoint): Checkpoint I/O burst penalties and MFU impact.ResponsibleEngineeringModel(Wall 20: Safety): DP-SGD slowdown and fairness data cost.
Domain 6 — Analysis (Cross-Cutting Diagnostics, Walls 21–22):
SensitivitySolver(Wall 21: Sensitivity): Binding constraint identification via partial derivatives.SynthesisSolver(Wall 22: Synthesis): Inverse Roofline — derive hardware specs from an SLA.
Solver Composition
Because every solver is a stateless pure function — \(s_i : (\text{Workload}, \text{Hardware}, \text{Infra}, \text{Systems}) \rightarrow \mathcal{R}_i\) — solvers compose naturally through chaining. The output of one solver feeds into the next, with dimensional correctness preserved at every step:
\[ \text{Scaling} \xrightarrow{\;\mathcal{R}_1\;} \text{Distributed} \xrightarrow{\;\mathcal{R}_2\;} \text{Economics} \xrightarrow{\;\mathcal{R}_3\;} \text{Sustainability} \]
This design yields three critical properties: reproducibility (identical inputs always produce identical outputs), testability (each solver validates against textbook equations in isolation), and transparency (students inspect exactly which equations produced a result).
Extensibility
The layered architecture is designed for extension at every level. New workload types (e.g., a RetrievalAugmentedWorkload for RAG pipelines) require only implementing the lower() method to produce a ComputationGraph; all existing solvers apply without modification. New hardware entries are added to the Silicon Zoo as declarative HardwareNode specifications, with no solver changes needed. New solvers can be introduced for emerging constraints by implementing the solver interface: accept typed inputs, return dimensioned outputs. The type system enforces correctness at every boundary, so extensions compose safely with existing components.
See the Contributing guide for a hands-on walkthrough of adding custom solvers and hardware.
Three-Level Evaluation
The Scenario.evaluate() entry point orchestrates solver composition through a three-level evaluation:
- Level 1 — Feasibility. Does the model fit in memory? Can the data pipeline keep pace? Any wall where demand exceeds supply is flagged immediately.
- Level 2 — Performance. What are the achievable latency, throughput, and utilization? Roofline analysis, communication modeling, and pipeline bubble combine for end-to-end step time.
- Level 3 — Macro. What does it cost, and what does it emit? TCO, carbon, water, and responsibility overhead are computed from the performance results.
A feasibility failure at Level 1 short-circuits the evaluation — there is no point optimizing AllReduce if the model does not fit in memory.
| Level | Purpose | Walls Evaluated |
|---|---|---|
| 1. Feasibility | Does it fit? | 1–2 (capacity), 5 (KV-cache), 8 (I/O) |
| 2. Performance | How fast? | 1–7 (node), 11–16 (algorithm + fleet) |
| 3. Macro | What does it cost? | 17–20 (operations), 21–22 (analysis) |
See the Resolver Guide to learn how to apply these solvers, and Math Foundations for the equations behind each.