The 5-Layer Architecture

Progressive Lowering: From Abstract Demand to Concrete Supply

MLSys·im organizes the full ML systems domain into five composable layers. Abstract workload demand (Layer A) is progressively mapped onto concrete hardware supply (Layers B–D) through analytical solvers (Layer E). Each layer corresponds directly to chapters in the Machine Learning Systems textbook and its companion lecture slides.

Understanding this stack is the key to mastering both this library and the textbook it accompanies.

The Stack at a Glance

%%{init: {'theme': 'neutral'}}%%
%%| fig-cap: "The MLSys·im 5-Layer Stack. Workloads (demand) are lowered onto Hardware (supply) through Infrastructure and Systems layers. Solvers bridge demand and supply to produce analytical profiles."
%%| fig-width: 100%
flowchart TB
    A["<b>Layer A · Workloads</b> — <i>Demand</i><br/>TransformerWorkload · CNNWorkload<br/>Parameters · FLOPs · Arithmetic Intensity"]
    B["<b>Layer B · Hardware</b> — <i>Silicon</i><br/>HardwareNode · ComputeCore · MemoryHierarchy<br/>Peak FLOP/s · Bandwidth · Capacity · TDP"]
    C["<b>Layer C · Infrastructure</b> — <i>Environment</i><br/>GridProfile · Datacenter<br/>Carbon Intensity · PUE · WUE"]
    D["<b>Layer D · Systems</b> — <i>Topology</i><br/>Node · Fleet · NetworkFabric<br/>Topology · Accelerators/Node · Fabric BW"]
    E["<b>Layer E · Solvers</b> — <i>Analysis</i><br/>SingleNode · Distributed · Serving<br/>Economics · Sustainability · Reliability"]
    F["<b>Results</b><br/>PerformanceProfile · Dict[str, Quantity]"]

    A --> E
    B --> D
    C --> D
    D --> E
    E --> F

    style E fill:#e0f2fe,stroke:#0284c7,stroke-width:2px

The flow reads top-to-bottom: a Workload (what you want to compute) and a System (the hardware, network, and environment you have) feed into a Solver (the analytical engine), which returns quantitative results you can act on.

Physics formulas

Closed-form equations live in mlsysim.physics.* (performance, memory, communication, serving, economics, reliability). Pint units live in mlsysim.core.units; there is no constants module — chip, model, scenario, and system specs live only in their registries. See Math Foundations.

Two views of the same package

The five layers are the runtime view: how demand meets supply through solvers. The registries that populate Layers A–D are organized as eight zoos (Hardware, Models, Datasets, Platforms, Infrastructure, Systems, Ops, Scenarios) plus support namespaces — that data view is documented in DATA_MODEL.md. Both descriptions are true at different altitudes.

Layer A: Workloads (Demand)

A Workload is a hardware-agnostic description of computational demand. The question is not “How fast is Llama-3?” but rather “How many FLOPs and memory bytes does Llama-3 require?”

MLSys·im provides two workload types:

TransformerWorkload — LLMs and attention-based models (GPT, Llama, BERT). Defined by parameter count, layer count, hidden dimension, attention heads, and sequence length. Supports KV-cache size estimation for autoregressive inference.
CNNWorkload — Vision and convolutional models (ResNet, MobileNet, YOLO). Defined by parameter count and total FLOPs per forward pass.

The crucial step happens when a workload is “lowered” at a specific numerical precision (FP32, FP16, INT8, INT4). This precision lowering determines the Arithmetic Intensity (FLOPs/Byte) — the single ratio that decides whether a model will be compute-bound or memory-bound on any given hardware target.

Textbook and slides

Layer A concepts are covered in the following lecture decks:

Deck	Topic	Key Concepts
NN Computation (Vol I, Ch 5)	The math behind the model	FLOPs, parameter counting, activation memory, forward/backward pass cost
NN Architectures (Vol I, Ch 6)	Architecture as infrastructure	Computational signatures of CNNs, RNNs, Transformers; quadratic attention scaling
Model Compression (Vol I, Ch 10)	From benchmark winner to production model	Pruning, distillation, quantization (FP32 to INT4); precision as a workload knob
Model Training (Vol I, Ch 8)	The physics of learning	Iron Law of training, memory breakdown (activations, gradients, optimizer states)

See the Model Zoo for vetted workloads spanning language, vision, recommendation, and TinyML models.

Layer B: Hardware (Silicon)

A HardwareNode represents a single physical accelerator — an H100 GPU, an Apple M3 chip, a Jetson Orin, or even an ESP32 microcontroller. It provides the raw physical supply:

Compute: Peak throughput (TFLOP/s) across precisions (FP32, FP16, INT8), modeled via ComputeCore.
Memory: HBM or SRAM capacity and transfer bandwidth (TB/s), modeled via MemoryHierarchy.
Power: Thermal Design Power (TDP) in watts.

Every piece of silicon has a ridge point — the ratio of peak compute to memory bandwidth ($I^* = \text{Peak\_FLOPs} / \text{BW}$). When a workload’s arithmetic intensity falls below this ridge point, the workload is memory-bound: the compute units starve waiting for data. When above, the workload is compute-bound: memory delivers data faster than the cores can consume it.

This is the Roofline model — the diagnostic framework at the heart of Layer E’s SingleNodeModel.

Textbook and slides

Layer B concepts are covered in the following lecture decks:

Deck	Topic	Key Concepts
Hardware Acceleration (Vol I, Ch 11)	Moving data costs more than computing it	Systolic arrays, tensor cores, Roofline model, accelerator spectrum (GPU to ASIC)
Compute Infrastructure (Vol II, Ch 2)	The physics of the ML fleet	HBM generations, bandwidth hierarchy (HBM to NVLink to InfiniBand), TCO analysis
Benchmarking (Vol I, Ch 12)	Measuring what matters	MLPerf, roofline profiling, statistical rigor, the lab-to-production gap

See the Silicon Zoo for 30+ vetted hardware specs from cloud GPUs to microcontrollers.

Layer C: Infrastructure (Environment)

Hardware does not run in a vacuum. It runs in datacenters plugged into regional power grids, cooled by air or liquid systems, and constrained by physical energy budgets.

A GridProfile captures this physical context:

Carbon intensity (g CO₂/kWh) — varies 40x across regions: Quebec’s hydroelectric grid produces ~8 g CO₂/kWh; Poland’s coal-dominated grid produces ~700 g CO₂/kWh.
Power Usage Effectiveness (PUE) — the ratio of total facility power to IT equipment power. A PUE of 1.1 means 10% overhead for cooling and infrastructure; 1.6 means 60%.
Water Usage Effectiveness (WUE) — liters of water consumed per kWh, determined by cooling technology (air, liquid, evaporative).

A 1000-watt GPU running identical computations in Quebec versus Poland produces vastly different carbon footprints. This layer makes that difference quantifiable.

Textbook and slides

Layer C concepts are covered in the following lecture deck:

Deck	Topic	Key Concepts
Sustainable AI (Vol II, Ch 15)	Energy as a first-class engineering constraint	Lifecycle carbon accounting, PUE modeling, carbon-aware scheduling, the 4 Ms framework

See the Infrastructure Zoo for regional grid profiles and datacenter configurations.

Layer D: Systems (Topology)

A single GPU cannot train a 100-billion-parameter model. Layer D composes individual HardwareNodes from Layer B into distributed clusters, connected by the network fabrics that determine communication overhead.

Core system types include:

Node — A single compute server grouping accelerators within a physical chassis (e.g., 8x H100 GPUs connected by 900 GB/s NVLink). Includes intra-node bandwidth and NIC count.
RackProfile — A physical rack composition that aggregates node count, accelerator count, accelerator power, and non-accelerator support power.
NetworkFabric — The inter-node interconnect: topology (fat-tree, rail-optimized), bandwidth (e.g., 400 Gbps InfiniBand NDR), latency, and oversubscription ratio.
Fleet — A collection of Nodes connected by a NetworkFabric, deployed in a specific Datacenter (Layer C). This is the complete system that solvers operate on.

The way you structure this system — how many GPUs per node, what fabric connects them, which parallelism strategy you apply — determines your communication overhead and scaling efficiency.

Textbook and slides

Layer D concepts span four lecture decks covering the full distributed systems stack:

Deck	Topic	Key Concepts
Network Fabrics (Vol II, Ch 3)	The synchronization backbone	$\alpha$-$\beta$ model, topology (fat-tree, dragonfly), RDMA, congestion control, bisection bandwidth
Distributed Training (Vol II, Ch 5)	The physics of scaling	Data/tensor/pipeline parallelism, ZeRO/FSDP, Communication-Computation Ratio (CCR), 3D hybrid strategies
Collective Communication (Vol II, Ch 6)	The traffic patterns of distributed ML	AllReduce algorithms (ring, tree), gradient compression, hierarchical communication, computation-communication overlap
Fleet Orchestration (Vol II, Ch 8)	Extracting useful work from shared infrastructure	Slurm vs. Kubernetes, topology-aware scheduling, elastic training, multi-tenancy

See the Fleet Zoo for production cluster topologies from 256-GPU research clusters to 8192-GPU frontier fleets.

Layer E: Solvers (Analysis)

The previous four layers are static definitions — nouns. Solvers are the analytical engines — verbs — that bridge demand and supply to answer specific questions.

Each solver implements closed-form equations from peer-reviewed systems literature. No simulation, no benchmarking, no hardware required.

Solver	Bridges	Core Equation	Key Output
SingleNodeModel	A $\to$ B	Roofline / Iron Law (Williams et al., 2009)	Latency, throughput, bottleneck classification
DistributedModel	A $\to$ D	Ring All-Reduce + Pipeline schedules	Scaling efficiency, communication overhead, bubble fraction
ServingModel	A $\to$ B	Pre-fill / Decode phase decomposition	TTFT, inter-token latency, KV-cache memory
EconomicsModel	D $\to$ cost	CapEx + OpEx over time horizon	TCO in USD, cost per query
SustainabilityModel	D $\to$ C	PUE $\times$ grid carbon intensity	Energy (kWh), carbon (kg CO₂e), water (L)
ReliabilityModel	D $\to$ uptime	Young-Daly optimal checkpointing	Fleet MTBF, failure probability, checkpoint interval

Solvers are composable. To answer “What is the most sustainable way to serve Llama-70B?”, chain ServingModel (feasibility and latency) into EconomicsModel (cost) into SustainabilityModel (carbon). Each solver’s typed output feeds naturally into the next.

Textbook and slides

Layer E concepts draw from multiple lecture decks that teach the diagnostic and optimization frameworks:

Deck	Topic	Key Concepts
Hardware Acceleration (Vol I, Ch 11)	The Roofline model	Arithmetic intensity, ridge points, memory-bound vs. compute-bound diagnosis
Model Serving (Vol I, Ch 13)	Inverting every training priority	Latency budgets, queuing theory (Little’s Law), continuous batching, training-serving skew
Performance Engineering (Vol II, Ch 9)	Match the software to the silicon	Operator fusion, FlashAttention, mixed precision, systematic profiling workflow
Inference at Scale (Vol II, Ch 10)	Where ML systems live or die economically	Serving economics, KV-cache bottleneck, batching strategies, autoscaling
Distributed Training (Vol II, Ch 5)	Scaling efficiency analysis	Communication-Computation Ratio, parallelism strategy selection, scaling laws

See the Solver Guide for a decision guide on choosing the right solver, and Math Foundations for the complete equations.

Progressive Lowering in Action

The architecture is best understood through a concrete example. Consider the question: “Can I serve Llama-3-70B on a cluster of 4 H100s within a $50K/year budget while minimizing carbon?”

This single question touches all five layers:

1. Layer A — Llama-3-70B workload: 70B parameters, GQA with 8 KV heads,
             ~140 GB at FP16 precision

2. Layer B — H100 hardware: 990 TFLOP/s (FP16), 3.35 TB/s HBM3,
             80 GB capacity, 700W TDP

3. Layer C — Infrastructure: choose between Quebec (8 g CO₂/kWh)
             and US Average (385 g CO₂/kWh)

4. Layer D — System: 1 node × 4 H100s, NVLink 900 GB/s intra-node,
             tensor parallelism across 4 GPUs

5. Layer E — Chain three solvers:
             ServingModel  → TTFT, ITL, KV-cache feasibility
             EconomicsModel → TCO over 1 year
             SustainabilityModel → carbon footprint by region

Each layer contributes a different piece of the answer. No single layer is sufficient alone. This is why MLSys·im separates concerns into five composable layers rather than offering a monolithic “predict performance” function.

Slide Deck Quick Reference

For instructors and students using the companion lecture slides, the table below maps every MLSys·im layer to the relevant slide decks.

Volume I: Foundations

Ch	Deck	MLSys·im Layer(s)	Download
5	NN Computation	A (Workloads)	PDF
6	NN Architectures	A (Workloads)	PDF
8	Model Training	A + B + E	PDF
10	Model Compression	A (Workloads)	PDF
11	Hardware Acceleration	B (Hardware) + E (Solvers)	PDF
12	Benchmarking	B (Hardware) + E (Solvers)	PDF
13	Model Serving	E (Solvers)	PDF

Volume II: At Scale

Ch	Deck	MLSys·im Layer(s)	Download
2	Compute Infrastructure	B (Hardware)	PDF
3	Network Fabrics	D (Systems)	PDF
5	Distributed Training	D (Systems) + E (Solvers)	PDF
6	Collective Communication	D (Systems)	PDF
8	Fleet Orchestration	D (Systems)	PDF
9	Performance Engineering	E (Solvers)	PDF
10	Inference at Scale	E (Solvers)	PDF
15	Sustainable AI	C (Infrastructure)	PDF

Where to Go Next

Getting Started — Install MLSys·im and run your first analysis in 5 minutes.
Solver Guide — Decision guide for choosing the right analytical engine.
Math Foundations — The closed-form equations behind every solver.
Tutorials — Hands-on notebooks for roofline analysis, distributed training, LLM serving, and sustainability.
Zoo Overview — Browse the registries of vetted models, hardware, fleets, and infrastructure.