The 5-Layer Architecture

Progressive Lowering: From Abstract Demand to Concrete Supply

MLSYSIM organizes the full ML systems domain into five composable layers. Abstract workload demand (Layer A) is progressively mapped onto concrete hardware supply (Layers B–D) through analytical solvers (Layer E). Each layer corresponds directly to chapters in the Machine Learning Systems textbook and its companion lecture slides.

Understanding this stack is the key to mastering both this library and the textbook it accompanies.

The Stack at a Glance

%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSYSIM 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
    A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
    B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
    C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
    D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
    E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
    F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]

    A --> E
    B --> D
    C --> D
    D --> E
    E --> F

The flow reads top-to-bottom: a Workload (what you want to compute) and a System (the hardware, network, and environment you have) feed into a Solver (the analytical engine), which returns quantitative results you can act on.


Layer A: Workloads (Demand)

A Workload is a hardware-agnostic description of computational demand. The question is not “How fast is Llama-3?” but rather “How many FLOPs and memory bytes does Llama-3 require?”

MLSYSIM provides two workload types:

  • TransformerWorkload — LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.
  • CNNWorkload — Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.

The crucial step happens when a workload is “lowered” at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the Arithmetic Intensity (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.

NoteTextbook and slides

Layer A concepts are covered in the following lecture decks:

Deck Topic Key Concepts
NN Computation (Vol I, Ch 5) The math behind the model FLOPs, parameter counting, activation memory, forward/backward pass cost
NN Architectures (Vol I, Ch 6) Architecture as infrastructure Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling
Model Compression (Vol I, Ch 10) From benchmark winner to production model Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob
Model Training (Vol I, Ch 8) The physics of learning Iron Law of training, memory breakdown (activations, gradients, optimizer states)

See the Model Zoo for vetted workloads spanning language, vision, recommendation, and TinyML models.


Layer B: Hardware (Silicon)

A HardwareNode represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:

  • Compute: Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via ComputeCore.
  • Memory: HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via MemoryHierarchy.
  • Power: Thermal Design Power (TDP) in watts.

Every piece of silicon has a ridge point — the ratio of peak compute to memory bandwidth (\(I^* = \text{Peak\_FLOPs} / \text{BW}\)). When a workload’s arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.

This is the Roofline model — the diagnostic framework at the heart of Layer E’s SingleNodeModel.

NoteTextbook and slides

Layer B concepts are covered in the following lecture decks:

Deck Topic Key Concepts
Hardware Acceleration (Vol I, Ch 11) Moving data costs more than computing it Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC)
Compute Infrastructure (Vol II, Ch 2) The physics of the ML fleet HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis
Benchmarking (Vol I, Ch 12) Measuring what matters MLPerf, roofline profiling, statistical rigor, the lab-to-production gap

See the Silicon Zoo for 18+ vetted hardware specs from cloud GPUs to microcontrollers.


Layer C: Infrastructure (Environment)

Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.

A GridProfile captures this physical context:

  • Carbon intensity (g CO₂/kWh) — varies 40x across regions: Quebec’s hydroelectric grid produces ~8 g CO₂/kWh; Poland’s coal-dominated grid produces ~700 g CO₂/kWh.
  • Power Usage Effectiveness (PUE) — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
  • Water Usage Effectiveness (WUE) — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).

A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.

NoteTextbook and slides

Layer C concepts are covered in the following lecture deck:

Deck Topic Key Concepts
Sustainable AI (Vol II, Ch 15) Energy as a first-class engineering constraint Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework

See the Infrastructure Zoo for regional grid profiles and datacenter configurations.


Layer D: Systems (Topology)

A single GPU cannot train a 100-billion-parameter model. Layer D composes individual HardwareNodes from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.

Three types define a system:

  • Node — A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.
  • NetworkFabric — The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.
  • Fleet — A collection of Nodes connected by a NetworkFabric, deployed in a specific Datacenter (Layer C). This is the complete system that solvers operate on.

The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.

NoteTextbook and slides

Layer D concepts span four lecture decks covering the full distributed systems stack:

Deck Topic Key Concepts
Network Fabrics (Vol II, Ch 3) The synchronization backbone \(\alpha\)-\(\beta\) model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth
Distributed Training (Vol II, Ch 5) The physics of scaling Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies
Collective Communication (Vol II, Ch 6) The traffic patterns of distributed ML AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap
Fleet Orchestration (Vol II, Ch 8) Extracting useful work from shared infrastructure Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy

See the Fleet Zoo for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.


Layer E: Solvers (Analysis)

The previous four layers are static definitions — nouns. Solvers are the analytical engines — verbs — that bridge demand and supply to answer specific questions.

Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.

Solver Bridges Core Equation Key Output
SingleNodeModel A \(\to\) B Roofline / Iron Law (Williams et al., 2009) Latency, throughput, bottleneck classification
DistributedModel A \(\to\) D Ring All-Reduce + Pipeline schedules Scaling efficiency, communication overhead, bubble fraction
ServingModel A \(\to\) B Pre-fill / Decode phase decomposition TTFT, inter-token latency, KV-cache memory
EconomicsModel D \(\to\) cost CapEx + OpEx over time horizon TCO in USD, cost per query
SustainabilityModel D \(\to\) C PUE \(\times\) grid carbon intensity Energy (kWh), carbon (kg CO₂e), water (L)
ReliabilityModel D \(\to\) uptime Young-Daly optimal checkpointing Fleet MTBF, failure probability, checkpoint interval

Solvers are composable. To answer “What is the most sustainable way to serve Llama-70B?”, chain ServingModel (feasibility and latency) into EconomicsModel (cost) into SustainabilityModel (carbon). Each solver’s typed output feeds naturally into the next.

NoteTextbook and slides

Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:

Deck Topic Key Concepts
Hardware Acceleration (Vol I, Ch 11) The Roofline model Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis
Model Serving (Vol I, Ch 13) Inverting every training priority Latency budgets, queuing theory (Little’s Law), continuous batching, training-serving skew
Inference at Scale (Vol II, Ch 9) Where ML systems live or die economically Serving economics, KV-cache bottleneck, batching strategies, autoscaling
Performance Engineering (Vol II, Ch 10) Match the software to the silicon Operator fusion, FlashAttention, mixed precision, systematic profiling workflow
Distributed Training (Vol II, Ch 5) Scaling efficiency analysis Communication-Computation Ratio, parallelism strategy selection, scaling laws

See the Solver Guide for a decision guide on choosing the right solver, and Math Foundations for the complete equations.


Progressive Lowering in Action

The architecture is best understood through a concrete example. Consider the question: “Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?”

This single question touches all five layers:

1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
             ~140 GB at FP16 precision

2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
             80 GB capacity, 700W TDP

3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
             and US Average (385 g CO₂/kWh)

4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
             tensor parallelism across 4 GPUs

5. Layer E — Chain three solvers:
             ServingModel  → TTFT, ITL, KV-cache feasibility
             EconomicsModel → TCO over 1 year
             SustainabilityModel → carbon footprint by region

Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSYSIM separates concerns into five composable layers rather than offering a monolithic “predict performance” function.


Slide Deck Quick Reference

For instructors and students using the companion lecture slides, the table below maps every MLSYSIM layer to the relevant slide decks.

Volume I: Foundations

Ch Deck MLSYSIM Layer(s) Download
5 NN Computation A (Workloads) PDF
6 NN Architectures A (Workloads) PDF
8 Model Training A + B + E PDF
10 Model Compression A (Workloads) PDF
11 Hardware Acceleration B (Hardware) + E (Solvers) PDF
12 Benchmarking B (Hardware) + E (Solvers) PDF
13 Model Serving E (Solvers) PDF

Volume II: At Scale

Ch Deck MLSYSIM Layer(s) Download
2 Compute Infrastructure B (Hardware) PDF
3 Network Fabrics D (Systems) PDF
5 Distributed Training D (Systems) + E (Solvers) PDF
6 Collective Communication D (Systems) PDF
8 Fleet Orchestration D (Systems) PDF
9 Inference at Scale E (Solvers) PDF
10 Performance Engineering E (Solvers) PDF
15 Sustainable AI C (Infrastructure) PDF

Where to Go Next

  • Getting Started — Install MLSYSIM and run your first analysis in 5 minutes.
  • Solver Guide — Decision guide for choosing the right analytical engine.
  • Math Foundations — The closed-form equations behind every solver.
  • Tutorials — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
  • Zoo Overview — Browse the registries of vetted models, hardware, fleets, and infrastructure.
Back to top