The 5-Layer Architecture

Demand–Supply Separation and Progressive Lowering

TipThe MIPS Analogy

Patterson and Hennessy did not give students cycle-accurate x86 simulators; they gave them a taxonomically complete instruction set that exposed every architectural concept through a model simple enough to reason about yet faithful enough to build real intuition. MLSys·im occupies the same niche for ML systems — sacrificing microarchitectural detail to achieve sub-second execution, enabling students and practitioners to sweep thousands of configurations in the time a cycle-accurate simulator requires for one.

The core philosophy of MLSys·im is Demand–Supply Separation with Progressive Lowering. Rather than treating machine learning systems as black boxes, MLSys·im cleanly decouples what a model computes (demand) from where it runs (supply) and why constraints emerge (analysis).

Abstract workload demand (Layer A) is progressively mapped onto concrete hardware supply (Layers B, C, D) through analytical solvers (Layer E) that enforce dimensional strictness at runtime — every physical quantity carries SI units via the pint library, making unit mismatches structurally impossible. Understanding this stack is the key to mastering both this library and the textbook it accompanies.

The Stack Diagram

%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSys·im 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
    A["<b>Layer A: Workloads (Demand)</b><br/>TransformerWorkload, CNNWorkload, SSMWorkload, DiffusionWorkload<br/><i>Parameters, FLOPs, Arithmetic Intensity</i>"]
    B["<b>Layer B: Hardware (Silicon)</b><br/>HardwareNode, ComputeCore, MemoryHierarchy, StorageHierarchy<br/><i>Peak FLOP/s, BW, Capacity, TDP, IO BW</i>"]
    C["<b>Layer C: Infrastructure (Environment)</b><br/>GridProfile, Datacenter<br/><i>Carbon Intensity, PUE, WUE</i>"]
    D["<b>Layer D: Systems (Topology)</b><br/>Node, Fleet, NetworkFabric<br/><i>Topology, Accelerators/Node, Fabric BW</i>"]
    E["<b>Layer E: Solvers (22 Walls, 22 Resolvers)</b><br/>Node (1–7): SingleNode · Efficiency · Serving · ContinuousBatching · WeightStreaming · TailLatency<br/>Data (8–10): Data · Transformation · Topology<br/>Algorithm (11–13): Scaling · InferenceScaling · Compression<br/>Fleet (14–16): Distributed · Reliability · Orchestration<br/>Ops (17–20): Economics · Sustainability · Checkpoint · ResponsibleEngineering<br/>Analysis (21–22): Sensitivity · Synthesis"]
    F["<b>Typed Results</b><br/>PerformanceProfile · DistributedResult · ServingResult · ..."]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F


1. Layer A: Workloads (Demand)

A Workload is a hardware-agnostic description of computational demand. You don’t ask “How fast is Llama-3?”, you ask “How many FLOPs and memory bytes does Llama-3 require?”

In MLSys·im, TransformerWorkload, CNNWorkload, SSMWorkload, and DiffusionWorkload define these intrinsic properties (parameter count, layer count, sequence length). You can also import models directly from HuggingFace Hub using import_hf_model(). The crucial step happens when a workload is “lowered” at a specific numerical precision (e.g., FP16 vs INT8). This lowering step determines the Arithmetic Intensity (ops/byte) — the ratio that decides whether a model will be compute-bound or memory-bound on physical hardware.

See the Model Zoo for vetted workloads.


2. Layer B: Hardware (Supply)

A HardwareNode represents a single physical accelerator (like an H100 GPU or an Apple M3 chip). It provides the raw physical supply:

  • Compute: Theoretical peak throughput (TFLOP/s) across different precisions (FP32, FP16, INT8).
  • Memory: High Bandwidth Memory (HBM) capacity and transfer speed (TB/s).
  • Storage & IO: Persistent storage capacity/bandwidth and IO interconnect (e.g., PCIe Gen5) speeds.
  • Power: Thermal Design Power (TDP).

Every piece of silicon has a “Ridge Point” (Peak FLOPs / Memory Bandwidth). If your Workload’s arithmetic intensity is lower than the hardware’s ridge point, you are memory-bound.

See the Silicon Zoo for vetted hardware specs.


3. Layer C: Infrastructure (Environment)

Hardware doesn’t run in a vacuum; it runs in datacenters plugged into regional power grids. The GridProfile captures this physical context.

A 1000-watt GPU running in Quebec (hydroelectric power) vs. Poland (coal power) produces vastly different carbon footprints, despite doing the exact same mathematical operations. This layer introduces Power Usage Effectiveness (PUE) and Carbon Intensity to the analytical model.

See the Infrastructure Zoo for regional grid profiles.


4. Layer D: Systems (Topology)

You cannot train a 100-Billion parameter model on a single GPU. A Fleet composes individual HardwareNodes into a distributed cluster.

  • Node: Groups accelerators within a physical server chassis (e.g., 8x GPUs).
  • NetworkFabric: Specifies how servers talk to each other (e.g., 400 Gbps InfiniBand NDR).

The way you structure this system determines your communication overhead and your scaling efficiency when you apply 3D/4D Parallelism.

See the Fleet Zoo for production cluster topologies.


The 22 Systems Walls

Before introducing the solvers, it is essential to understand the 22 physical and logical constraints that bound ML system performance. Each wall represents a specific bottleneck, grounded in a published equation and resolved by a dedicated solver. The walls are organized into six domains that progress from local node resources to fleet-scale operations.

Table 1: The 22 ML Systems Walls. Each wall represents a physical or logical constraint resolved by a dedicated solver. Walls 1–2 (Compute and Memory) share the SingleNodeModel, yielding 21 distinct solvers across 22 walls. See the research paper for full citations.
# Wall Domain What’s Bounded Core Equation Key Reference
1 Compute Node Peak FLOP/s ceiling \(T = \text{OPs} / (\text{Peak} \times \eta)\) Williams et al. (2009)
2 Memory Node HBM BW + capacity \(T = \|W\| / BW_{\text{HBM}}\) Williams et al. (2009)
3 Software Node Achieved MFU \(\eta = f(\text{kernel, fusion, occupancy})\) Chowdhery et al. (2022)
4 Serving Node Prefill vs. decode Two-phase roofline Pope et al. (2023)
5 Batching Node KV-cache fragmentation \(\text{KV} = 2LHD \lceil S/p \rceil pBb\) Kwon et al. (2023)
6 Streaming Node Injection BW \(T = \max(T_{\text{inject}}, T_{\text{compute}})\) Cerebras (2024)
7 Tail Latency Node P99 queueing delay Erlang-C M/M/\(c\) Dean & Barroso (2013)
8 Ingestion Data Storage I/O throughput \(\rho = BW_{\text{demand}} / BW_{\text{supply}}\) Mohan et al. (2022)
9 Transformation Data CPU preprocessing \(T = B \cdot S / C_{\text{tput}}\) Murray et al. (2021)
10 Locality Data Network bisection BW \(BW_{\text{eff}} = BW_{\text{link}} \cdot \beta / \text{osub}\) Leiserson (1985)
11 Complexity Algorithm Scaling law bounds \(C = 6PD\); \(P^{*} = \sqrt{C/120}\) Hoffmann et al. (2022)
12 Reasoning Algorithm Inference-time compute \(T = K \times T_{\text{step}}\) Brown et al. (2024)
13 Fidelity Algorithm Accuracy–efficiency \(r = 32/b\); \(r = 1/(1{-}s)\) Han et al. (2015)
14 Communication Fleet AllReduce overhead \(T = 2\frac{N{-}1}{N}\frac{M}{\beta} + 2(N{-}1)\alpha\) Shoeybi et al. (2019)
15 Fragility Fleet Cluster MTBF \(\text{MTBF}_{\text{cl}} = \text{MTBF}_{\text{node}}/N\) Daly (2006)
16 Multi-tenant Fleet Queue wait time \(T_{\text{wait}} = \rho / [2\mu(1{-}\rho)]\) Little (1961)
17 Capital Ops Total cost of ownership \(\text{TCO} = \text{CapEx} + \text{OpEx}\) Barroso et al. (2018)
18 Sustainability Ops Carbon + water \(\text{CO}_2 = E \times \text{PUE} \times \text{CI}\) Patterson et al. (2022)
19 Checkpoint Ops I/O burst MFU penalty \(\text{penalty} = T_{\text{write}} / T_{\text{interval}}\) Eisenman et al. (2022)
20 Safety Ops DP-SGD overhead \(\sigma \propto 1/\epsilon\) Abadi et al. (2016)
21 Sensitivity Analysis Binding constraint \(\partial T / \partial x_i\) Williams et al. (2009)
22 Synthesis Analysis Inverse spec derivation \(BW_{\text{req}} = \|W\| / T_{\text{target}}\) Kwon et al. (2023)

5. Layer E: Solvers (Analysis)

The previous four layers define what exists — hardware specs, model architectures, infrastructure configurations. Solvers are the engines that bridge demand and supply to answer specific questions: they take a (workload, hardware) pair and compute whether the system is feasible, how it performs, and what it costs.

Each solver resolves one or more of the 22 ML Systems Walls — physical or logical constraints that bound system performance. The walls are organized into six domains that progress from local node resources through data movement and algorithmic scaling to fleet coordination, operations, and cross-cutting analysis.

NoteThe 22 Walls at a Glance

Walls 1–2 (Compute and Memory) share the SingleNodeModel, yielding 21 distinct solvers across 22 walls.

Domain 1 — Node (Single-Accelerator Resources, Walls 1–7):

  • SingleNodeModel (Wall 1: Compute, Wall 2: Memory): Roofline model — peak FLOP/s ceiling and HBM bandwidth + capacity.
  • EfficiencyModel (Wall 3: Software): MFU decomposition — kernel fusion, FlashAttention, SM occupancy.
  • ServingModel (Wall 4: Serving): Two-phase LLM inference — compute-bound prefill vs. memory-bound decode.
  • ContinuousBatchingModel (Wall 5: Batching): PagedAttention and iteration-level scheduling with non-contiguous KV-cache allocation.
  • WeightStreamingModel (Wall 6: Streaming): Wafer-scale inference (e.g., Cerebras CS-3) — injection bandwidth bottleneck.
  • TailLatencyModel (Wall 7: Tail Latency): M/M/c queueing (Erlang-C) for P50/P99 SLA analysis.

Domain 2 — Data (Movement & Pipelines, Walls 8–10):

  • DataModel (Wall 8: Ingestion): Storage I/O demand–supply bandwidth ratio.
  • TransformationModel (Wall 9: Transformation): CPU preprocessing stall detection (JPEG decode, tokenization, augmentation).
  • TopologyModel (Wall 10: Locality): Network bisection bandwidth for fat-tree, dragonfly, torus, ring topologies.

Domain 3 — Algorithm (Scaling & Compression, Walls 11–13):

  • ScalingModel (Wall 11: Complexity): Chinchilla scaling laws — compute-optimal model and dataset sizing.
  • InferenceScalingModel (Wall 12: Reasoning): Inference-time compute scaling — chain-of-thought and tree-search cost.
  • CompressionModel (Wall 13: Fidelity): Quantization/pruning accuracy–efficiency trade-offs.

Domain 4 — Fleet (Multi-Node Coordination, Walls 14–16):

  • DistributedModel (Wall 14: Communication): 4D parallelism (DP × TP × PP × EP), Ring AllReduce, pipeline bubbles.
  • ReliabilityModel (Wall 15: Fragility): Fleet MTBF and Young-Daly optimal checkpoint interval.
  • OrchestrationModel (Wall 16: Multi-tenant): Queueing theory (M/D/1) for shared cluster wait times.

Domain 5 — Ops (Economics, Sustainability & Safety, Walls 17–20):

  • EconomicsModel (Wall 17: Capital): Total Cost of Ownership (CapEx + OpEx).
  • SustainabilityModel (Wall 18: Sustainability): Energy, carbon footprint (kg CO₂e), and water usage.
  • CheckpointModel (Wall 19: Checkpoint): Checkpoint I/O burst penalties and MFU impact.
  • ResponsibleEngineeringModel (Wall 20: Safety): DP-SGD slowdown and fairness data cost.

Domain 6 — Analysis (Cross-Cutting Diagnostics, Walls 21–22):

  • SensitivitySolver (Wall 21: Sensitivity): Binding constraint identification via partial derivatives.
  • SynthesisSolver (Wall 22: Synthesis): Inverse Roofline — derive hardware specs from an SLA.

Solver Composition

Because every solver is a stateless pure function\(s_i : (\text{Workload}, \text{Hardware}, \text{Infra}, \text{Systems}) \rightarrow \mathcal{R}_i\) — solvers compose naturally through chaining. The output of one solver feeds into the next, with dimensional correctness preserved at every step:

\[ \text{Scaling} \xrightarrow{\;\mathcal{R}_1\;} \text{Distributed} \xrightarrow{\;\mathcal{R}_2\;} \text{Economics} \xrightarrow{\;\mathcal{R}_3\;} \text{Sustainability} \]

This design yields three critical properties: reproducibility (identical inputs always produce identical outputs), testability (each solver validates against textbook equations in isolation), and transparency (students inspect exactly which equations produced a result).

Extensibility

The layered architecture is designed for extension at every level. New workload types (e.g., a RetrievalAugmentedWorkload for RAG pipelines) require only implementing the lower() method to produce a ComputationGraph; all existing solvers apply without modification. New hardware entries are added to the Silicon Zoo as declarative HardwareNode specifications, with no solver changes needed. New solvers can be introduced for emerging constraints by implementing the solver interface: accept typed inputs, return dimensioned outputs. The type system enforces correctness at every boundary, so extensions compose safely with existing components.

See the Contributing guide for a hands-on walkthrough of adding custom solvers and hardware.

Three-Level Evaluation

The Scenario.evaluate() entry point orchestrates solver composition through a three-level evaluation:

  1. Level 1 — Feasibility. Does the model fit in memory? Can the data pipeline keep pace? Any wall where demand exceeds supply is flagged immediately.
  2. Level 2 — Performance. What are the achievable latency, throughput, and utilization? Roofline analysis, communication modeling, and pipeline bubble combine for end-to-end step time.
  3. Level 3 — Macro. What does it cost, and what does it emit? TCO, carbon, water, and responsibility overhead are computed from the performance results.

A feasibility failure at Level 1 short-circuits the evaluation — there is no point optimizing AllReduce if the model does not fit in memory.

Level Purpose Walls Evaluated
1. Feasibility Does it fit? 1–2 (capacity), 5 (KV-cache), 8 (I/O)
2. Performance How fast? 1–7 (node), 11–16 (algorithm + fleet)
3. Macro What does it cost? 17–20 (operations), 21–22 (analysis)

See the Resolver Guide to learn how to apply these solvers, and Math Foundations for the equations behind each.

Back to top