System Assumptions
Purpose
What assumptions sit underneath the distributed systems calculations throughout this book?
Every quantitative example in this volume, from cluster failure rates to collective-communication costs to carbon footprint estimates, uses a specific set of assumed numbers: canonical cluster sizes, fabric bandwidths, accelerator reliability, power usage effectiveness, regional carbon intensity, model utilization ranges, and dozens more. This appendix collects those fleet-scale assumptions in one place so the book’s arithmetic can be audited, checked against chapter napkin math, or substituted with local values; the tables cover node-level accelerator specs, cluster tiers, network fabrics, reliability and recovery parameters, communication models, sustainability, cloud economics, capacity-planning utilization, unit conventions, and the provenance catalog that ties assumptions back to vendor datasheets, fleet studies, grid reports, and book conventions. In C³ terms, the assumptions keep compute capacity, communication cost, and coordination overhead measured on the same scale across chapters.
How to Use This Appendix
Use this appendix when you want to verify a fleet-scale estimate, swap in an alternative assumption (for example, NDR vs. HDR fabric, Quebec vs. Poland carbon intensity, or a different MFU), or see which numbers a chapter’s calculation depends on. Find the relevant section, read the Value and Unit columns, and plug them into your own arithmetic.
A short napkin-math callout at the start shows how several assumptions combine (failure rate, AllReduce time, carbon footprint). Hardware and link bandwidths are ceilings unless a chapter states an effective-bandwidth discount. Treat the tables as this edition’s shared assumption sheet—not immutable physical laws.
Napkin Math 1.1: Quick calculations with these assumptions
Problem: Estimate how many GPU failures a cluster should expect per day.
Variables: Use a 8,192 GPUs cluster and an individual GPU MTTF of 50,000 hours.
Math: The GPU failure rate is 8,192 GPUs/50,000 hours \(\approx\) 0.16 failures/hour, or roughly ~3.9 failures/day.
Result: This estimate excludes NIC, PSU, and cable failures. At 100,000 GPUs, the GPU-only rate is approximately ~48 failures/day; adding NIC, PSU, and cable failures can roughly double the operational incident stream.
Systems insight: At fleet scale, failure becomes a continuous background process rather than an exceptional event.
Problem: Estimate how long an AllReduce takes for a 70B model.
Variables: Use BF16 parameters and InfiniBand NDR with 50 GB/s effective bandwidth per port.
Math: The gradient payload is 70 \(\times 10^9 \times\) 2 bytes = 140 GB. The ring AllReduce napkin estimate is 2 \(\times\) 140 GB/50 GB/s \(\approx\) 5.6 s.
Result: The collective can consume a significant fraction of a training step.
Systems insight: Pipeline and tensor parallelism exist partly to shrink the gradient volume each rank must communicate.
Problem: Estimate the carbon footprint of a 10,000 GPUs training run.
Variables: Each H100 draws 700 W at TDP. Use PUE 1.12, Quebec grid intensity 20 g/kWh, and Poland grid intensity 820 g/kWh.
Math: Total facility power is 10,000 GPUs \(\times\) 700 W \(\times\) 1.12 = 7.84 MW. In Quebec, 7.84 MW \(\times\) 720 h/month \(\times\) 20 g/kWh = 112.9 t CO\(_2\) per month.
Result: The same run in Poland emits 4628.7 t, a 41× difference.
Systems insight: Grid carbon intensity can dominate the emissions difference for the same hardware, model, and training duration.
Foundational Hardware Recap
Distributed-systems reasoning in this book builds on single-node performance bounds. Table 1 recaps the H100 assumptions used as the primary accelerator reference in this volume.
| Category | Assumption | Value |
|---|---|---|
| Compute | H100 FP16 peak throughput | 989 TFLOP/s |
| Compute | H100 FP8 peak throughput | 1,979 TFLOP/s |
| Memory | HBM3 bandwidth | 3.35 TB/s |
| Memory | HBM3 capacity | 80 GB |
| Thermal | TDP | 700 W |
Cluster Reference Configurations
This book uses four canonical cluster tiers (256, 2,048, 8,192, and 100,000 GPUs) to illustrate how system behavior changes across scale. They are editorial reference points aligned with common research-lab, production, large-training, and hyperscale fleets—not measurements from a single deployment. Table 2 defines GPU counts; chapters derive node counts, failure rates, and network requirements from these baselines and the assumptions in subsequent tables.
| Assumption | Value | Unit |
|---|---|---|
| Small cluster GPU count (256 tier) | 256 | GPUs |
| Medium cluster GPU count (2,048 tier) | 2048 | GPUs |
| Large cluster GPU count (8,192 tier) | 8192 | GPUs |
| Mega cluster GPU count (100,000 tier) | 100000 | GPUs |
The relationship between cluster size and system behavior is not linear. A 256-GPU cluster might go days between failures; a 100,000-GPU cluster experiences multiple failures per hour. This nonlinearity is why the chapters on fault tolerance (Fault Tolerance) and fleet orchestration (Fleet Orchestration) treat failure as a continuous background process rather than an exceptional event.
With cluster sizes established, the next question is: what connects the machines? The network fabric determines whether gradient synchronization takes milliseconds or seconds, and whether pipeline parallelism is viable across node boundaries.
Network and Interconnect Specifications
Distributed training and inference are bottlenecked by data movement between machines as often as by computation within them. Network assumptions fix bandwidth and latency for every interconnect tier—from NVLink/PCIe within a node (NVIDIA Corporation 2017, 2020; Choquette 2023), through InfiniBand, RDMA, and Ethernet fabrics (InfiniBand Trade Association 2000; Gangidi et al. 2024; NVIDIA 2026; Ethernet Alliance 2025), to the speed-of-light WAN floor (physics identity).
Intra-node interconnects
Within a single node, GPUs communicate over NVLink or PCIe. These bandwidths determine whether tensor parallelism (which requires high-bandwidth, low-latency communication) is viable within the node boundary. Table 3 lists the intra-node interconnect bandwidths and latencies used throughout this book. These values are collected here for convenience and reused throughout the fleet-scale chapters.
| Assumption | Value | Unit |
|---|---|---|
| NVLink bandwidth (V100) | 300 | GB/s |
| NVLink bandwidth (A100) | 600 | GB/s |
| NVLink bandwidth (H100) | 900 | GB/s |
| PCIe Gen3 bandwidth (V100 host) | 15.75 | GB/s |
| PCIe Gen4 bandwidth (A100 host) | 32 | GB/s |
| PCIe Gen5 bandwidth (H100 host) | 64 | GB/s |
| NVLink one-way latency | 500 | ns |
| PCIe Gen5 one-way latency | 1000 | ns |
Inter-node interconnects
Once data crosses the node boundary, bandwidth drops by an order of magnitude and latency increases by 5–10\(\times\). This cliff shapes every decision about parallelism strategy: operations that require frequent, fine-grained communication (tensor parallelism) stay within the node, while operations with coarser communication patterns (data parallelism, pipeline parallelism) span nodes. Table 4 lists both the bit-rate and byte-rate forms of each fabric, since different chapters use different conventions.
| Assumption | Value | Unit |
|---|---|---|
| InfiniBand HDR link bandwidth (bit rate) | 200 | Gb/s |
| InfiniBand HDR bandwidth (byte rate) | 25 | GB/s |
| InfiniBand NDR link bandwidth (bit rate) | 400 | Gb/s |
| InfiniBand NDR bandwidth (byte rate) | 50 | GB/s |
| InfiniBand XDR bandwidth (byte rate) | 100 | GB/s |
| 400 GbE bandwidth (byte rate) | 50 | GB/s |
| 800 GbE bandwidth (byte rate) | 100 | GB/s |
| 100 GbE RoCE bandwidth (byte rate) | 12.5 | GB/s |
| InfiniBand one-way latency | 5000 | ns |
Wide-area network
Cross-region communication introduces latency floors set by the speed of light in optical fiber. For a 5000 km link (roughly New York to London), the minimum one-way latency is 5000 km/200,000 km/s = 25 ms—orders of magnitude higher than intra-data-center latency. This physical constraint is why geo-distributed training uses asynchronous methods or careful placement of pipeline stages, and why federated learning (Edge Intelligence) tolerates stale gradients.
Network specifications define how fast data moves, but they say nothing about how often the machines sending that data fail. The next section quantifies the reliability of individual components—the building blocks from which cluster-level failure rates are derived.
Reliability Assumptions
At fleet scale, component failures are not exceptional events but statistical certainties. Reliability assumptions here fix Mean Time To Failure (MTTF) for each major component class in a data-center GPU node and the time to detect and recover from failures. These values drive analysis in Fault Tolerance and scheduling in Fleet Orchestration; deeper failure modeling appears in Reliability Foundations.
Component MTTF
Table 5 lists the MTTF for each component class. These are steady-state values that exclude the “infant mortality” period (first 30–90 days) and the “wear-out” period (beyond rated lifetime). Sources include large-GPU research-cluster analysis Kokolis et al. (2025), TPUv4 supercomputer resiliency and operations Zu et al. (2024), and warehouse-scale machine design Barroso et al. (2019).
| Assumption | Value | Unit |
|---|---|---|
| GPU MTTF | 50000 | hours |
| NIC MTTF | 150000 | hours |
| PSU MTTF | 100000 | hours |
| PCIe switch MTTF | 200000 | hours |
| HBM MTTF | 200000 | hours |
| Top-of-rack switch MTTF | 300000 | hours |
| Optical cable and transceiver MTTF | 50000 | hours |
Recovery time assumptions
Failure detection and recovery introduce downtime that compounds with cluster size. Table 6 lists the assumptions used in checkpoint interval optimization (the Young-Daly formula in The Young-Daly law: Optimal checkpointing) and effective throughput calculations throughout this book.
| Assumption | Value | Unit |
|---|---|---|
| Failure detection timeout (heartbeat) | 30 | seconds |
| Job reschedule delay | 60 | seconds |
| Checkpoint write bandwidth (aggregate) | 100 | GB/s |
Reliability constants quantify how often failures occur and how long recovery takes. The cost of each failure, however, depends on how much communication must be repeated—which brings us to the parameters that model communication overhead.
Communication Model Parameters
This book models communication cost using the \(\alpha\)-\(\beta\) framework: \(T(n) = \alpha + n/\beta\), where \(\alpha\) is startup latency and \(\beta\) is sustained bandwidth (Hockney 1994). Table 7 lists values from fabric specifications, vendor product documentation, and large-cluster training studies (InfiniBand Trade Association 2000; Gangidi et al. 2024; NVIDIA 2026; Ethernet Alliance 2025; Jiang et al. 2024). They feed collective-communication models in Collective Communication and parallelism comparisons in Distributed Training.
| Assumption | Value | Unit |
|---|---|---|
| InfiniBand NDR startup latency (\(\alpha\)) | 5 | us |
| InfiniBand NDR sustained bandwidth (\(\beta\)) | 50 | GB/s |
| InfiniBand HDR startup latency (\(\alpha\)) | 7 | us |
| InfiniBand HDR sustained bandwidth (\(\beta\)) | 25 | GB/s |
| RoCE startup latency (\(\alpha\)) | 10 | us |
| RoCE sustained bandwidth (\(\beta\)) | 12.5 | GB/s |
| TCP startup latency (\(\alpha\)) | 50 | us |
AllReduce cost model
The ring AllReduce—the most common collective for gradient synchronization—transmits \(2(N_{\text{rank}}-1)M/N_{\text{rank}}\) bytes per rank, where \(N_{\text{rank}}\) is the number of ranks and \(M\) is the message size. For large \(N_{\text{rank}}\), this approaches \(2M\) bytes per rank. Table 8 lists the constants used in AllReduce cost estimation.
Systems.Nodes.DGX_H100 (8 GPUs per node) sets the intra-node versus inter-node boundary in hierarchical collectives.
| Assumption | Value | Unit |
|---|---|---|
| Ring AllReduce bandwidth factor | 2 | (dimensionless) |
| GPUs per node | 8 | GPUs/node |
Communication models quantify the overhead of coordination. The sustainability cost of that coordination—the power drawn, the water consumed, the carbon emitted—requires separate assumptions for the physical infrastructure surrounding the compute.
Sustainability Assumptions
The environmental impact of fleet-scale ML depends on three factors: how much power the facility draws beyond the IT equipment (PUE), how much water the cooling system consumes (WUE), and how much carbon the local grid emits per kilowatt-hour. Sustainability assumptions here support Sustainable AI and total-cost-of-ownership estimates in ML Operations at Scale.
Power usage effectiveness
Power Usage Effectiveness (PUE) is the ratio of total facility power to IT equipment power. A PUE of 1.0 would mean zero cooling overhead; real data centers range from 1.06 (liquid-cooled) to 1.58 (legacy air-cooled). Table 9 lists the reference PUE values used throughout this book.
| Assumption | Value | Unit |
|---|---|---|
| PUE (liquid-cooled, best in class) | 1.06 | (ratio) |
| PUE (best air-cooled) | 1.12 | (ratio) |
| PUE (industry average) | 1.4 | (ratio) |
| PUE (legacy air-cooled) | 1.58 | (ratio) |
Water usage effectiveness
Water Usage Effectiveness (WUE) measures liters of water consumed per kilowatt-hour of IT energy (The Green Grid 2011). Evaporative cooling towers achieve excellent PUE but consume significant water; closed-loop liquid cooling uses near-zero water but requires higher capital investment. Table 10 lists the reference WUE values.
| Assumption | Value | Unit |
|---|---|---|
| WUE (air-cooled) | 1.8 | L/kWh |
| WUE (evaporative cooling) | 1.8 | L/kWh |
| WUE (closed-loop liquid) | 0 | L/kWh |
Regional carbon intensity
The same training run emits vastly different amounts of CO\(_2\) depending on the grid that supplies its electricity. Table 11 lists regional grid carbon intensities from IEA and Canadian government emissions-factor data (International Energy Agency 2024; Environment and Climate Change Canada 2026), rounded for napkin math. These values are central to location-aware scheduling in Sustainable AI.
| Assumption | Value | Unit |
|---|---|---|
| Grid carbon intensity (Norway) | 10 | gCO2/kWh |
| Grid carbon intensity (Quebec) | 20 | gCO2/kWh |
| Grid carbon intensity (France) | 50 | gCO2/kWh |
| Grid carbon intensity (EU average) | 270 | gCO2/kWh |
| Grid carbon intensity (US average) | 429 | gCO2/kWh |
| Grid carbon intensity (Poland) | 820 | gCO2/kWh |
Power density
AI accelerator racks draw 6–8\(\times\) more power than traditional data center racks, creating thermal density challenges that force the transition from air cooling to liquid cooling. Table 12 lists the reference rack power levels used in Compute Infrastructure and Sustainable AI, anchored to current AI power-and-cooling product disclosures (Dell Technologies 2026; Schneider Electric 2024).
| Assumption | Value | Unit |
|---|---|---|
| Rack power (traditional enterprise) | 12 | kW |
| Rack power (AI, typical) | 70 | kW |
| Rack power (AI, high density) | 100 | kW |
| Air-cooling practical limit per rack | 30 | kW |
Sustainability constants capture the physical and environmental costs of running a fleet. The economic constants in the next section translate those physical costs into dollar values, completing the total-cost-of-ownership picture.
Economic Assumptions
Fleet-scale cost estimates depend on GPU rental rates, electricity pricing, and data transfer charges. These are illustrative hyperscaler-order rates (2024–2025 magnitude) for ratio analysis, not quotes for a specific contract. Table 13 supports ML Operations at Scale and Inference at Scale TCO napkin math.
| Assumption | Value | Unit |
|---|---|---|
| Cloud GPU training rental rate | 4 | dollar/h |
| Cloud GPU inference rental rate | 2.5 | dollar/h |
| Cloud electricity price | 0.12 | dollar/kWh |
| Cloud egress price per GB | 0.09 | dollar/GB |
Economic constants set the price per unit of resource. The next question is: what fraction of each resource unit does useful work? Capacity planning constants quantify the gap between peak capability and achievable throughput.
Capacity Planning Assumptions
Capacity-planning assumptions quantify two related phenomena: (1) the fraction of peak hardware performance that real workloads achieve (Model FLOPs Utilization), and (2) how efficiently that performance scales as more machines are added (scaling efficiency). Together they determine the effective compute available for a given cluster size and budget—the number that matters for project planning.
The planning equation is multiplicative: \[ R_{\text{eff}} = N R_{\text{peak}} \times \text{MFU} \times \eta_{\text{scaling}} \times (1 - f_{\text{overhead}}) \]
Here \(R_{\text{eff}}\) is effective fleet throughput, \(N R_{\text{peak}}\) is nominal cluster peak throughput, MFU discounts peak hardware to realized model throughput, \(\eta_{\text{scaling}}\) discounts added machines for communication and coordination losses, and \(f_{\text{overhead}}\) removes wall time spent on pipeline bubbles, checkpoints, failure recovery, and maintenance. The three subsections that follow fix those factors in order.
Model FLOPs utilization
Model FLOPs Utilization (MFU) measures the ratio of actual compute throughput to the hardware’s peak theoretical throughput. An MFU of 0.50 means the workload achieves half the peak FLOP/s. Table 14 lists the reference MFU ranges used in training time and cost estimates throughout this book.
| Assumption | Value | Unit |
|---|---|---|
| Training MFU (low end) | 0.3 | (fraction) |
| Training MFU (high end) | 0.5 | (fraction) |
| Inference MFU (batch size 1) | 0.05 | (fraction) |
| Inference MFU (batched) | 0.4 | (fraction) |
Scaling efficiency
Scaling efficiency \(\eta_{\text{scaling}} = T_1/(N \times T_N)\) measures how much of the added compute actually reduces training time. Table 15 lists illustrative efficiency tiers aligned with published LLM training at scale (Chowdhery et al. 2022; Jiang et al. 2024)—workload and fabric dependent.
| Assumption | Value | Unit |
|---|---|---|
| Scaling efficiency (32 GPUs) | 0.9 | (fraction) |
| Scaling efficiency (256 GPUs) | 0.7 | (fraction) |
| Scaling efficiency (1,024 GPUs) | 0.5 | (fraction) |
| Scaling efficiency (8,192 GPUs) | 0.35 | (fraction) |
Overhead budgets
Real training jobs spend a fraction of wall time on noncompute activities: pipeline bubble idle time, checkpoint writes, failure recovery, and scheduled maintenance. Table 16 lists the overhead budgets assumed throughout this book. These fractions are additive: total overhead at fleet scale is approximately 5 percent + 3 percent + 10 percent + 5 percent = 23 percent, meaning only 77 percent of wall time produces useful training progress.
| Assumption | Value | Unit |
|---|---|---|
| Wall-time overhead (pipeline bubbles) | 0.05 | (fraction) |
| Wall-time overhead (checkpointing) | 0.03 | (fraction) |
| Wall-time overhead (failure recovery) | 0.1 | (fraction) |
| Wall-time overhead (scheduled maintenance) | 0.05 | (fraction) |
The tables above use the unit conventions in table 17—decimal data prefixes (KB \(= 10^3\) bytes), separate work (FLOPs) from throughput (FLOP/s), and the aliases below.
Unit Conventions
Table 17 fixes the unit conventions used in every quantitative example in this volume. Each row gives the multiplier \(k\) in 1 alias \(=\) \(k\) base. Data prefixes use decimal SI (\(\mathrm{KB} = 10^3\) bytes, not 1024). Throughput quantities (FLOP/s, GB/s) combine these aliases with time; adding incompatible dimensions (bytes to FLOP/s) is a category error in napkin math, not a unit conversion.
| Alias | Multiplier | Base unit |
|---|---|---|
byte |
1 | byte |
KB |
1000 | byte |
MB |
1e+06 | byte |
GB |
1e+09 | byte |
TB |
1e+12 | byte |
PB |
1e+15 | byte |
flop |
1 | flop |
GFLOPs |
1e+09 | flop |
TFLOPs |
1e+12 | flop |
ZFLOPs |
1e+21 | flop |
param |
1 | param |
Mparam |
1e+06 | param |
Gbps |
1e+09 | bit/s |
NS |
1e-09 | second |
US |
1e-06 | second |
MS |
0.001 | second |
second |
1 | second |
hour |
3600 | second |
day |
86400 | second |
joule |
1 | joule |
watt |
1 | watt |
meter |
1 | meter |
Two fleet-scale conventions deserve emphasis beyond the table:
- Network bandwidth appears in both bit-rate (Gbps) and byte-rate (GB/s). Divide by 8 to convert Gbps to GB/s (for example, 400 Gbps = 50 GB/s for InfiniBand NDR). Chapters use whichever form fits the calculation; byte-rate is usually easier for transfer-time estimates.
- Carbon intensity uses gCO\(_2\)/kWh (grams of CO\(_2\) per kilowatt-hour). Multiply total energy (kWh) by regional carbon intensity to get emissions in grams, then divide by \(10^6\) for tonnes.
Assumption Provenance
The catalog below maps each appendix section to source class and primary references. The book cites with @citekey; mlsysim mirrors the same provenance as Provenance records on Sourced registry scalars and metadata (e.g. Systems.Reliability.Gpu.mttf_hours.provenance, Hardware.Cloud.H100.metadata.provenance) without shipping a .bib file. See table 18 for the full table.
| Appendix section | Source type | Primary references |
|---|---|---|
| H100 recap | Vendor datasheet peak | (Choquette 2023) |
| Cluster tiers | Book convention | (Kokolis et al. 2025) (scale context) |
| Network fabrics | Vendor specs; InfiniBand standard | (NVIDIA Corporation 2017, 2020; Choquette 2023; InfiniBand Trade Association 2000) |
| Reliability (MTTF) | Fleet studies | (Kokolis et al. 2025; Zu et al. 2024; Barroso et al. 2019) |
| Recovery times | Design assumptions | (Young 1974; Daly 2006) |
| Communication (\(\alpha\)–\(\beta\)) | Fabric specs; training at scale | (InfiniBand Trade Association 2000; Jiang et al. 2024) |
| AllReduce model | Algorithm identity | (Gibiansky 2017) |
| Sustainability (PUE/WUE/rack power) | Industry surveys | (Uptime Institute 2022; The Green Grid 2007) |
| Carbon intensity | Grid statistics | (International Energy Agency 2023) |
| Cloud economics | Illustrative rates | Editorial (2024–2025 order of magnitude) |
| MFU and scaling efficiency | Published LLM training | (Chowdhery et al. 2022; Narayanan et al. 2021; Pope et al. 2023; Jiang et al. 2024) |
| Overhead budgets | Book convention | Editorial engineering targets |
| Unit conventions | Decimal SI; book notation | Editorial |
Summary
This appendix is the book’s shared fleet-scale assumption sheet: every napkin-math estimate in the volume should be traceable to a row in the tables above—and, where it matters, to section 1.1 and table 18.
Key Takeaways: Using the fleet assumption tables
- One place for shared fleet numbers: Cluster tiers, fabric bandwidths, MTTF and recovery times, \(\alpha\)–\(\beta\) communication parameters, facility efficiency, water use, carbon intensity, cloud pricing, MFU, scaling efficiency, and overhead budgets used in worked examples are listed here so you can audit or replace them without re-deriving from prose.
- Reliability assumptions set the failure clock: MTTF and recovery-time rows determine how often failures interrupt training and how much wall time is lost. At 10,000+ GPUs, failure is steady state, not an exception; see Reliability Foundations for deeper modeling.
- Communication assumptions determine viability: \(\alpha\)–\(\beta\) values determine whether synchronous distributed training is viable for a given model size and fabric. The 10× latency gap between RDMA and TCP explains why training clusters favor InfiniBand.
- Sustainability assumptions expose location leverage: Facility efficiency, water use, and carbon intensity show that data-center location and cooling can outweigh many algorithmic tweaks. The 82× spread between Norway and Poland grid carbon intensity is often the largest lever.
- Capacity assumptions quantify usable throughput: MFU, scaling efficiency, and overhead budgets bridge peak hardware to useful throughput. At 8,192 GPUs, only about 35 percent of ideal scaling is realized, and combined overheads consume roughly 23 percent of wall time.
- Substitute your own assumptions: When your deployment differs—different fabric, region, MFU, or cluster size—swap the
Valuecolumn and rerun the same formulas the chapters use. - Check provenance before debating a number: Datasheet peaks, fleet studies, illustrative rates, and book conventions are sourced differently; table 18 and section captions say which is which.