One model, six domains, twelve walls — a complete systems analysis in 60 seconds.
capstone
advanced
Compose 6+ solvers across all six taxonomy domains to produce a holistic training analysis. Discover that the binding constraint is compute, but checkpoint overhead is the hidden cost.
The Question
What does a complete systems analysis look like? No single solver captures the full picture. Training a 70B-parameter model on 512 H100 GPUs involves compute walls, memory walls, communication overhead, checkpoint I/O, energy costs, and carbon emissions — simultaneously. This tutorial traces all six taxonomy domains and exercises 12 of the 22 systems walls through a single workload.
Compose six solver families across all taxonomy domains into a holistic analysis
Identify which of the 22 systems walls bind for a real training workload
Quantify the hidden costs: checkpoint overhead, carbon, water, and TCO
Produce a summary table mapping domain -> solver -> binding wall
TipSolver Quick Reference
This capstone uses solvers from all six domains. If you arrived via an accelerated learning path, here is what each solver does:
Solver
Domain
What It Computes
SingleNodeModel
Node
Roofline bottleneck, latency, throughput
DataModel
Data
Whether the data pipeline can sustain GPU demand
ScalingModel
Algorithm
Compute-optimal training budget (Chinchilla)
DistributedModel
Fleet
Communication overhead and scaling efficiency
ReliabilityModel
Fleet
Cluster MTBF and optimal checkpoint intervals
EconomicsModel
Ops
CapEx, OpEx, and total cost of ownership (TCO)
SustainabilityModel
Ops
Energy, carbon footprint, and water usage
SensitivitySolver
Analysis
Partial derivatives identifying the binding constraint
SynthesisSolver
Analysis
Minimum hardware specs from a latency target
TipBackground: The Six Taxonomy Domains
The MLSys wall taxonomy organizes 22 systems walls into six domains:
Domain
Walls
What It Covers
Node
1–3
Compute, memory capacity, memory bandwidth
Data
8–10
Storage throughput, data pipeline stalls
Algorithm
11–13
Scaling laws, compute-optimal training
Fleet
14–16
Communication, synchronization, reliability
Ops
17–20
TCO, energy, carbon, water, safety
Analysis
21–22
Sensitivity, inverse synthesis
No single solver spans all six. The insight emerges from composition.
1. Setup: Build the Fleet
We construct a 512-GPU training cluster: 64 DGX H100 nodes, 8 GPUs per node, NVLink intra-node, InfiniBand NDR inter-node, powered by Quebec’s hydroelectric grid.
=== Domain: Node (Walls 1-3) ===
Bottleneck: Memory
Per GPU latency: 2,214.3 ms
Throughput: 2 / second samples/s
Training at batch size 4 per GPU puts us in the compute-bound regime — unlike inference, training has high arithmetic intensity due to the backward pass. Wall 1 (Compute) is the binding constraint at the node level.
Compute-bound is good news — it means the GPU is doing useful work, not waiting for data. But can the data pipeline actually keep up with 512 GPUs demanding training samples?
3. Data (Walls 8–10): Can the Pipeline Keep Up?
The roofline tells us each GPU can consume data at a certain rate. But can the storage and preprocessing pipeline actually deliver data that fast? If not, the GPUs stall — and “compute-bound” becomes a meaningless label.
from mlsysim import DataModel# Estimate data demand per step: 4 samples/GPU * 512 GPUs * 2048 tokens * 2 bytes ≈ 8 MB/step# At ~1 step/sec, this is ~8 MB/s — tokenized text is compactdata_demand = Q_("8 MB/s")data_solver = DataModel()data_result = data_solver.solve( workload_data_rate=data_demand, hardware=h100)banner("Domain: Data (Walls 8-10)")info(Data_demand=data_result.demand_bw, Data_supply=data_result.supply_bw, Utilization=f"{data_result.utilization:.1%}", Stalled=data_result.is_stalled, Bottleneck=data_result.bottleneck)
=== Domain: Data (Walls 8-10) ===
Data demand: 8.00e-03 GB/s
Data supply: 7 GB/s
Utilization: 0.1%
Stalled: False
Bottleneck: Storage
For text-based training, the data pipeline is rarely the bottleneck — tokenized text is compact. But for image or video training, this wall can dominate.
The data pipeline can keep up. The GPUs are compute-bound and well-fed. But are we spending our compute budget wisely? A 30-day run on 512 GPUs is an enormous investment — the scaling laws tell us whether we are allocating it optimally.
If the tokens-per-parameter ratio is significantly above or below 20, the training budget is not optimally allocated. Over-training wastes compute; under-training wastes model capacity.
So far, everything looks manageable: compute-bound GPUs, adequate data pipeline, reasonable training budget. If we throw 512 GPUs at this, we should scale linearly, right? The fleet-level analysis reveals what single-node reasoning misses.
5. Fleet (Walls 14–16): Communication and Reliability
The distributed solver models AllReduce overhead and pipeline bubbles. The reliability solver computes cluster MTBF and optimal checkpoint intervals.
Fleet MTBF: 83.71 h
Failure probability: 99.98%
Expected failures: 8.6
Optimal ckpt interval: 141.7 min
At 512 GPUs, the cluster MTBF shrinks significantly. Checkpoint overhead becomes a non-trivial fraction of wall-clock time — this is the “hidden cost” that single-node analysis misses entirely.
The reliability analysis tells us HOW OFTEN the cluster fails. But failures cost money — and so does the energy to keep 512 GPUs running for 30 days. The operational domain quantifies these costs.
6. Ops (Walls 17–20): TCO, Energy, Carbon, Water
The economics solver combines CapEx, OpEx, and sustainability into a single financial model.
IT Energy: 149.7 MWh
Total Energy PUE: 158.6 MWh
Carbon footprint: 3173 kg CO2
Water usage: 0 liters
PUE: 1.06
Region: Quebec (Hydro)
Quebec’s hydroelectric grid makes this one of the lowest-carbon training locations in the world. The same run in Poland (coal-heavy grid) would produce dramatically more CO2 — infrastructure geography is a first-class engineering variable.
7. Analysis (Walls 21–22): Sensitivity and Synthesis
Finally, confirm the binding constraint and derive minimum hardware for a 14-day completion target.
from mlsysim import SensitivitySolver, SynthesisSolver# Sensitivity: confirm compute is the binding constraint for trainingsens_solver = SensitivitySolver()sens_result = sens_solver.solve( model=model, hardware=h100, precision="fp16")banner("Domain: Analysis (Walls 21-22)")info(Binding_constraint=sens_result.binding_constraint)sens_rows = [[param, f"{val:+.4f}"] for param, val in sens_result.sensitivities.items()]table(["Parameter", "Sensitivity"], sens_rows)
# Synthesis: what per-GPU step latency is needed to finish in 14 days?# Total training FLOPs / (N_GPUs * MFU * peak_FLOPS) = wall_clock_secondstarget_days =14target_seconds = target_days *86400# Per-GPU step target: total_steps * step_latency = target_seconds# Approximate: we need each step to complete within a target latencysynth_solver = SynthesisSolver()synth_result = synth_solver.solve( model=model, target_latency=Q_("200 ms"), # per-GPU training step target precision="fp16")info("Synthesis (200ms per-GPU training step target)", Required_BW=synth_result.required_bw.to('TB/s'), Required_FLOPS=synth_result.required_flops.to('TFLOPs/s'), Required_memory=synth_result.required_memory.to('GB'))
We have now traced a single workload through all six domains. Each solver answered one question in isolation. But the systems engineer’s job is synthesis: seeing the complete picture at once. The table below is that picture — and its most important property is that no single row captures the full story.
No single solver captures the full picture — the systems view emerges from composition. This end-to-end trace exercises 12 of 22 walls through a single model. The per-GPU binding constraint is compute (Wall 1), but the hidden costs only appear at fleet scale: checkpoint overhead (Wall 19) consumes wall-clock time proportional to the MTBF-driven checkpoint frequency, and infrastructure geography (Quebec vs. Poland) can change the carbon footprint by 40x (as Tutorial 7 demonstrated). A complete systems analysis is not one solver run — it is the composition of all six domains.
Your Turn
CautionExercises
Exercise 1: Predict before you compute. What if you train in Poland instead of Quebec? Before running code, predict how the TCO and carbon footprint will change. (Hint: Poland’s grid is coal-heavy with ~800 g CO2/kWh vs. Quebec’s ~20 g CO2/kWh, and Poland has a higher PUE.) Then re-run the economics and sustainability solvers with Grids.Poland and compare. How close was your prediction?
Exercise 2: Double the cluster. Scale the fleet to 1024 GPUs (128 nodes). Re-run the distributed solver and reliability solver. Does scaling efficiency hold? How does the MTBF change? At what cluster size does the checkpoint overhead exceed 5% of wall-clock time?
Exercise 3: Minimum viable cluster. What is the minimum cluster size to complete Llama-3 70B training in 14 days? Use the scaling result to determine the required total FLOPS, then work backward to find the number of H100 GPUs needed at 40% MFU. Verify with the distributed solver that the communication overhead is acceptable at that scale.
Exercise 4: Propose a design change. Using the full-stack analysis, identify the single highest-leverage change — hardware upgrade, parallelism strategy, region change, or precision change — that would reduce TCO by at least 20%. Re-run the relevant solvers with your proposed change and compute the new TCO. Write one paragraph justifying why this change has the largest impact, referencing at least two domains from the summary table.
Self-check: If the fleet MTBF is 4 hours and each checkpoint takes 2 minutes, what fraction of wall-clock time is spent checkpointing? (Use the Young-Daly formula: optimal interval = sqrt(2 * delta * MTBF).)
Key Takeaways
TipSummary
Composition is the method: no single solver spans all six taxonomy domains; the systems view emerges only from composing 6+ solvers
Compute binds at the node level, but checkpoint overhead and communication are the hidden costs at fleet scale
Infrastructure geography matters: Quebec vs. Poland can change carbon footprint by 40x and TCO by 20–30%
The summary table is the deliverable: one row per domain, solver, key metric, and binding wall
12 of 22 walls are exercised through a single model-fleet pair — this is what a complete analysis looks like