Full-Stack Audit: LLaMA-70B Training

One model, six domains, twelve walls — a complete systems analysis in 60 seconds.

capstone

advanced

Compose 6+ solvers across all six taxonomy domains to produce a holistic training analysis. Discover that the binding constraint is compute, but checkpoint overhead is the hidden cost.

The Question

What does a complete systems analysis look like? No single solver captures the full picture. Training a 70B-parameter model on 512 H100 GPUs involves compute walls, memory walls, communication overhead, checkpoint I/O, energy costs, and carbon emissions — simultaneously. This tutorial traces all six taxonomy domains and exercises 12 of the 22 systems walls through a single workload.

Prerequisites

Complete Tutorial 0: Hello, Roofline, Tutorial 1: The Memory Wall, Tutorial 6: Scaling to 1000 GPUs, and Tutorial 9: Sensitivity Analysis. You should understand roofline analysis, distributed training, and binding constraint identification.

What You Will Learn

Compose six solver families across all taxonomy domains into a holistic analysis
Identify which of the 22 systems walls bind for a real training workload
Quantify the hidden costs: checkpoint overhead, carbon, water, and TCO
Produce a summary table mapping domain -> solver -> binding wall

Solver Quick Reference

This capstone uses solvers from all six domains. If you arrived via an accelerated learning path, here is what each solver does:

Solver	Domain	What It Computes
`SingleNodeModel`	Node	Roofline bottleneck, latency, throughput
`DataModel`	Data	Whether the data pipeline can sustain GPU demand
`ScalingModel`	Algorithm	Compute-optimal training budget (Chinchilla)
`DistributedModel`	Fleet	Communication overhead and scaling efficiency
`ReliabilityModel`	Fleet	Cluster MTBF and optimal checkpoint intervals
`EconomicsModel`	Ops	CapEx, OpEx, and total cost of ownership (TCO)
`SustainabilityModel`	Ops	Energy, carbon footprint, and water usage
`SensitivitySolver`	Analysis	Partial derivatives identifying the binding constraint
`SynthesisSolver`	Analysis	Minimum hardware specs from a latency target

Background: The Six Taxonomy Domains

The MLSys wall taxonomy organizes 22 systems walls into six domains:

Domain	Walls	What It Covers
Node	1–3	Compute, memory capacity, memory bandwidth
Data	8–10	Storage throughput, data pipeline stalls
Algorithm	11–13	Scaling laws, compute-optimal training
Fleet	14–16	Communication, synchronization, reliability
Ops	17–20	TCO, energy, carbon, water, safety
Analysis	21–22	Sensitivity, inverse synthesis

No single solver spans all six. The insight emerges from composition.

1. Setup: Build the Fleet

We construct a 512-GPU training cluster: 64 DGX H100 nodes, 8 GPUs per node, NVLink intra-node, InfiniBand NDR inter-node, powered by Quebec’s hydroelectric grid.

import mlsysim
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.core.constants import Q_

from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.infra.registry import Grids
from mlsysim.core.constants import Q_, NVLINK_H100_BW, INFINIBAND_NDR_BW

model = mlsysim.Models.Language.Llama3_70B
h100 = mlsysim.Hardware.Cloud.H100

# Build the DGX H100 node: 8 GPUs connected by NVLink 4.0
node = Node(
    name="DGX H100",
    accelerator=h100,
    accelerators_per_node=8,
    intra_node_bw=NVLINK_H100_BW
)

# Build the cluster fabric: InfiniBand NDR (400 Gbps)
fabric = NetworkFabric(
    name="InfiniBand NDR",
    topology="fat-tree",
    bandwidth=INFINIBAND_NDR_BW
)

# Build the fleet: 64 nodes = 512 GPUs, Quebec grid
fleet = Fleet(
    name="Training Cluster",
    node=node,
    count=64,
    fabric=fabric,
    region=Grids.Quebec
)

from mlsysim.show import table, info, banner

info("Fleet Configuration",
     Model=f"{model.name} ({model.parameters.to('Bparam'):.1f~})",
     Fleet=f"{fleet.count} nodes x {node.accelerators_per_node} GPUs = {fleet.total_accelerators} GPUs",
     Intra_node=f"NVLink 4.0 ({NVLINK_H100_BW.to('GB/s'):.0f~})",
     Inter_node=f"IB NDR ({INFINIBAND_NDR_BW.to('Gbps'):.0f~})",
     Region=Grids.Quebec.name)

── Fleet Configuration ─────────────────────
Model:       Llama-3.1-70B (70.6 Bparam)
Fleet:       64 nodes x 8 GPUs = 512 GPUs
Intra node:  NVLink 4.0 (900 GB / s)
Inter node:  IB NDR (400 Gbps)
Region:      Quebec (Hydro)

2. Node (Walls 1–3): Single-GPU Roofline

First, classify the per-GPU forward-backward pass. Is each GPU compute-bound or memory-bound during training?

from mlsysim import SingleNodeModel

node_solver = SingleNodeModel()
node_result = node_solver.solve(
    model=model, hardware=h100,
    batch_size=4, precision="fp16"
)

banner("Domain: Node (Walls 1-3)")
info(Bottleneck=node_result.bottleneck,
     Per_GPU_latency=node_result.latency.to('ms'),
     Throughput=f"{node_result.throughput:.0f} samples/s")


=== Domain: Node (Walls 1-3) ===
Bottleneck:       Memory
Per GPU latency:  2,214.3 ms
Throughput:       2 / second samples/s

Training at batch size 4 per GPU puts us in the compute-bound regime — unlike inference, training has high arithmetic intensity due to the backward pass. Wall 1 (Compute) is the binding constraint at the node level.

Compute-bound is good news — it means the GPU is doing useful work, not waiting for data. But can the data pipeline actually keep up with 512 GPUs demanding training samples?

3. Data (Walls 8–10): Can the Pipeline Keep Up?

The roofline tells us each GPU can consume data at a certain rate. But can the storage and preprocessing pipeline actually deliver data that fast? If not, the GPUs stall — and “compute-bound” becomes a meaningless label.

from mlsysim import DataModel

# Estimate data demand per step: 4 samples/GPU * 512 GPUs * 2048 tokens * 2 bytes ≈ 8 MB/step
# At ~1 step/sec, this is ~8 MB/s — tokenized text is compact
data_demand = Q_("8 MB/s")

data_solver = DataModel()
data_result = data_solver.solve(
    workload_data_rate=data_demand,
    hardware=h100
)

banner("Domain: Data (Walls 8-10)")
info(Data_demand=data_result.demand_bw,
     Data_supply=data_result.supply_bw,
     Utilization=f"{data_result.utilization:.1%}",
     Stalled=data_result.is_stalled,
     Bottleneck=data_result.bottleneck)


=== Domain: Data (Walls 8-10) ===
Data demand:  8.00e-03 GB/s
Data supply:  7 GB/s
Utilization:  0.1%
Stalled:      False
Bottleneck:   Storage

For text-based training, the data pipeline is rarely the bottleneck — tokenized text is compact. But for image or video training, this wall can dominate.

The data pipeline can keep up. The GPUs are compute-bound and well-fed. But are we spending our compute budget wisely? A 30-day run on 512 GPUs is an enormous investment — the scaling laws tell us whether we are allocating it optimally.

4. Algorithm (Walls 11–13): Compute-Optimal Budget

Is our training budget compute-optimal? The Chinchilla scaling law says D = 20P (tokens = 20x parameters) for optimal allocation.

from mlsysim import ScalingModel

# MFU (Model FLOP Utilization): the fraction of peak hardware FLOP/s that goes
# to useful model computation (excluding communication, idle time, overhead).
# MFU = 0.4 means 40% of theoretical peak -- typical for large-scale LLM training.
# Published values: 0.30-0.45 (Llama-2/3), up to 0.50 (highly optimized runs).
# Compute budget: 512 GPUs * 989 TFLOPs * 30 days * 86400s * 0.4 MFU
gpu_flops = h100.compute.peak_flops.to("flop/s").magnitude
total_flops = 512 * gpu_flops * 30 * 86400 * 0.4
compute_budget = Q_(total_flops, "flop")

scaling_solver = ScalingModel()
scaling_result = scaling_solver.solve(
    compute_budget=compute_budget,
    target_model_size=model.parameters
)

banner("Domain: Algorithm (Walls 11-13)")
info(Compute_budget=compute_budget.to('EFLOP'),
     Optimal_tokens=f"{scaling_result.optimal_tokens.magnitude:.2e}",
     Tokens_per_parameter=f"{scaling_result.tokens_per_parameter:.1f}",
     Chinchilla_ratio=f"{'OVER' if scaling_result.tokens_per_parameter > 20 else 'UNDER'}-trained")


=== Domain: Algorithm (Walls 11-13) ===
Compute budget:        525,002.3 EFLOP
Optimal tokens:        1.24e+12
Tokens per parameter:  17.6
Chinchilla ratio:      UNDER-trained

If the tokens-per-parameter ratio is significantly above or below 20, the training budget is not optimally allocated. Over-training wastes compute; under-training wastes model capacity.

So far, everything looks manageable: compute-bound GPUs, adequate data pipeline, reasonable training budget. If we throw 512 GPUs at this, we should scale linearly, right? The fleet-level analysis reveals what single-node reasoning misses.

5. Fleet (Walls 14–16): Communication and Reliability

The distributed solver models AllReduce overhead and pipeline bubbles. The reliability solver computes cluster MTBF and optimal checkpoint intervals.

from mlsysim import DistributedModel, ReliabilityModel

# 3D parallelism: TP=8 (within node), PP=1, DP=64
dist_solver = DistributedModel()
dist_result = dist_solver.solve(
    model=model, fleet=fleet,
    batch_size=2048, precision="fp16",
    tp_size=8, pp_size=1,
    overlap_comm=True, seq_len=2048
)

banner("Domain: Fleet (Walls 14-16)")
info(Scaling_efficiency=f"{dist_result.scaling_efficiency:.2%}",
     Step_latency=dist_result.step_latency_total.to('ms'),
     DP_comm_latency=dist_result.dp_communication_latency.to('ms'),
     TP_comm_latency=dist_result.tp_communication_latency.to('ms'),
     Bubble_fraction=f"{dist_result.bubble_fraction:.2%}")


=== Domain: Fleet (Walls 14-16) ===
Scaling efficiency:  98.19%
Step latency:        6,765.3 ms
DP comm latency:     686.5 ms
TP comm latency:     19.61 ms
Bubble fraction:     0.00%

# Reliability: 30-day training job
rel_solver = ReliabilityModel()
rel_result = rel_solver.solve(
    fleet=fleet,
    job_duration_hours=30*24,
    checkpoint_time_s=120
)

info(Fleet_MTBF=rel_result.fleet_mtbf.to('hour'),
     Failure_probability=f"{rel_result.failure_probability:.2%}",
     Expected_failures=f"{rel_result.expected_failures:.1f}",
     Optimal_ckpt_interval=rel_result.optimal_checkpoint_interval.to('minute'))

Fleet MTBF:             83.71 h
Failure probability:    99.98%
Expected failures:      8.6
Optimal ckpt interval:  141.7 min

At 512 GPUs, the cluster MTBF shrinks significantly. Checkpoint overhead becomes a non-trivial fraction of wall-clock time — this is the “hidden cost” that single-node analysis misses entirely.

The reliability analysis tells us HOW OFTEN the cluster fails. But failures cost money — and so does the energy to keep 512 GPUs running for 30 days. The operational domain quantifies these costs.

6. Ops (Walls 17–20): TCO, Energy, Carbon, Water

The economics solver combines CapEx, OpEx, and sustainability into a single financial model.

from mlsysim import EconomicsModel, SustainabilityModel

# 30-day training run
econ_solver = EconomicsModel()
econ_result = econ_solver.solve(
    fleet=fleet,
    duration_days=30,
    grid=Grids.Quebec,
    mfu=0.4
)

banner("Domain: Ops (Walls 17-20)")
info(CapEx=f"${econ_result.capex_usd:,.0f}",
     OpEx_energy=f"${econ_result.opex_energy_usd:,.0f}",
     OpEx_maintenance=f"${econ_result.opex_maintenance_usd:,.0f}",
     Total_TCO=f"${econ_result.tco_usd:,.0f}")


=== Domain: Ops (Walls 17-20) ===
CapEx:             $15,360,000
OpEx energy:       $19,038
OpEx maintenance:  $63,123
Total TCO:         $15,442,161

sust_solver = SustainabilityModel()
sust_result = sust_solver.solve(
    fleet=fleet,
    duration_days=30,
    datacenter=Grids.Quebec,
    mfu=0.4
)

info(IT_Energy=sust_result.it_energy_kwh.to('MWh'),
     Total_Energy_PUE=sust_result.total_energy_kwh.to('MWh'),
     Carbon_footprint=f"{sust_result.carbon_footprint_kg:.0f} kg CO2",
     Water_usage=f"{sust_result.water_usage_liters:.0f} liters",
     PUE=sust_result.pue,
     Region=sust_result.region_name)

IT Energy:         149.7 MWh
Total Energy PUE:  158.6 MWh
Carbon footprint:  3173 kg CO2
Water usage:       0 liters
PUE:               1.06
Region:            Quebec (Hydro)

Quebec’s hydroelectric grid makes this one of the lowest-carbon training locations in the world. The same run in Poland (coal-heavy grid) would produce dramatically more CO2 — infrastructure geography is a first-class engineering variable.

7. Analysis (Walls 21–22): Sensitivity and Synthesis

Finally, confirm the binding constraint and derive minimum hardware for a 14-day completion target.

from mlsysim import SensitivitySolver, SynthesisSolver

# Sensitivity: confirm compute is the binding constraint for training
sens_solver = SensitivitySolver()
sens_result = sens_solver.solve(
    model=model, hardware=h100, precision="fp16"
)

banner("Domain: Analysis (Walls 21-22)")
info(Binding_constraint=sens_result.binding_constraint)

sens_rows = [[param, f"{val:+.4f}"] for param, val in sens_result.sensitivities.items()]
table(["Parameter", "Sensitivity"], sens_rows)


=== Domain: Analysis (Walls 21-22) ===
Binding constraint:  peak_flops
Parameter         Sensitivity
─────────────────────────────
peak_flops            +0.0000
memory_bandwidth      +0.0000
memory_capacity       +0.0000

# Synthesis: what per-GPU step latency is needed to finish in 14 days?
# Total training FLOPs / (N_GPUs * MFU * peak_FLOPS) = wall_clock_seconds
target_days = 14
target_seconds = target_days * 86400
# Per-GPU step target: total_steps * step_latency = target_seconds
# Approximate: we need each step to complete within a target latency
synth_solver = SynthesisSolver()
synth_result = synth_solver.solve(
    model=model,
    target_latency=Q_("200 ms"),   # per-GPU training step target
    precision="fp16"
)

info("Synthesis (200ms per-GPU training step target)",
     Required_BW=synth_result.required_bw.to('TB/s'),
     Required_FLOPS=synth_result.required_flops.to('TFLOPs/s'),
     Required_memory=synth_result.required_memory.to('GB'))

── Synthesis (200ms per-GPU training step target) 
Required BW:      0.706 TB/s
Required FLOPS:   1.41 TFLOPs/s
Required memory:  141.2 GB

8. Summary Table: The Complete Picture

We have now traced a single workload through all six domains. Each solver answered one question in isolation. But the systems engineer’s job is synthesis: seeing the complete picture at once. The table below is that picture — and its most important property is that no single row captures the full story.

mtbf_hours = rel_result.fleet_mtbf.to('hour').magnitude
summary_rows = [
    ["Node",      "SingleNodeModel",      f"Bottleneck: {node_result.bottleneck}",              "Wall 1: Compute"],
    ["Data",      "DataModel",            f"Util: {data_result.utilization:.0%}",                "Not binding"],
    ["Algorithm", "ScalingModel",         f"Tok/param: {scaling_result.tokens_per_parameter:.0f}","Wall 11"],
    ["Fleet",     "DistributedModel",     f"Efficiency: {dist_result.scaling_efficiency:.0%}",   "Wall 14: Comm"],
    ["Fleet",     "ReliabilityModel",     f"MTBF: {mtbf_hours:.0f}h",                           "Wall 19: Ckpt"],
    ["Ops",       "EconomicsModel",       f"TCO: ${econ_result.tco_usd:,.0f}",                  "Wall 17: Cost"],
    ["Ops",       "SustainabilityModel",  f"CO2: {sust_result.carbon_footprint_kg:.0f} kg",     "Wall 18: Energy"],
    ["Analysis",  "SensitivitySolver",     f"Binding: {sens_result.binding_constraint}",         "Wall 21"],
]

table(["Domain", "Solver", "Key Metric", "Binding Wall"], summary_rows, "<<>>")

Domain     Solver                        Key Metric     Binding Wall
────────────────────────────────────────────────────────────────────
Node       SingleNodeModel       Bottleneck: Memory  Wall 1: Compute
Data       DataModel                       Util: 0%      Not binding
Algorithm  ScalingModel               Tok/param: 18          Wall 11
Fleet      DistributedModel         Efficiency: 98%    Wall 14: Comm
Fleet      ReliabilityModel               MTBF: 84h    Wall 19: Ckpt
Ops        EconomicsModel          TCO: $15,442,161    Wall 17: Cost
Ops        SustainabilityModel         CO2: 3173 kg  Wall 18: Energy
Analysis   SensitivitySolver    Binding: peak_flops          Wall 21

Key Insight

No single solver captures the full picture — the systems view emerges from composition. This end-to-end trace exercises 12 of 22 walls through a single model. The per-GPU binding constraint is compute (Wall 1), but the hidden costs only appear at fleet scale: checkpoint overhead (Wall 19) consumes wall-clock time proportional to the MTBF-driven checkpoint frequency, and infrastructure geography (Quebec vs. Poland) can change the carbon footprint by 40x (as Tutorial 7 demonstrated). A complete systems analysis is not one solver run — it is the composition of all six domains.

Your Turn

Exercises

Exercise 1: Predict before you compute. What if you train in Poland instead of Quebec? Before running code, predict how the TCO and carbon footprint will change. (Hint: Poland’s grid is coal-heavy with ~800 g CO2/kWh vs. Quebec’s ~20 g CO2/kWh, and Poland has a higher PUE.) Then re-run the economics and sustainability solvers with Grids.Poland and compare. How close was your prediction?

Exercise 2: Double the cluster. Scale the fleet to 1024 GPUs (128 nodes). Re-run the distributed solver and reliability solver. Does scaling efficiency hold? How does the MTBF change? At what cluster size does the checkpoint overhead exceed 5% of wall-clock time?

Exercise 3: Minimum viable cluster. What is the minimum cluster size to complete Llama-3 70B training in 14 days? Use the scaling result to determine the required total FLOPS, then work backward to find the number of H100 GPUs needed at 40% MFU. Verify with the distributed solver that the communication overhead is acceptable at that scale.

Exercise 4: Propose a design change. Using the full-stack analysis, identify the single highest-leverage change — hardware upgrade, parallelism strategy, region change, or precision change — that would reduce TCO by at least 20%. Re-run the relevant solvers with your proposed change and compute the new TCO. Write one paragraph justifying why this change has the largest impact, referencing at least two domains from the summary table.

Self-check: If the fleet MTBF is 4 hours and each checkpoint takes 2 minutes, what fraction of wall-clock time is spent checkpointing? (Use the Young-Daly formula: optimal interval = sqrt(2 * delta * MTBF).)

Key Takeaways

Summary

Composition is the method: no single solver spans all six taxonomy domains; the systems view emerges only from composing 6+ solvers
Compute binds at the node level, but checkpoint overhead and communication are the hidden costs at fleet scale
Infrastructure geography matters: Quebec vs. Poland can change carbon footprint by 40x and TCO by 20–30%
The summary table is the deliverable: one row per domain, solver, key metric, and binding wall
12 of 22 walls are exercised through a single model-fleet pair — this is what a complete analysis looks like

Next Steps

Sensitivity Analysis — Dive deeper into the Analysis domain solvers
GPU vs. Wafer-Scale — See how architecture shifts the binding wall
Geography of AI — Explore how datacenter location changes sustainability
The $9 Million GPU — Deep dive into TCO modeling