System Assumptions

Purpose

What assumptions sit underneath the distributed systems calculations throughout this book?

Every quantitative example in this volume, from cluster failure rates to collective-communication costs to carbon footprint estimates, uses a specific set of assumed numbers: canonical cluster sizes, fabric bandwidths, accelerator reliability, power usage effectiveness, regional carbon intensity, model utilization ranges, and dozens more. This appendix collects those fleet-scale assumptions in one place so the book’s arithmetic can be audited, checked against chapter napkin math, or substituted with local values; the tables cover node-level accelerator specs, cluster tiers, network fabrics, reliability and recovery parameters, communication models, sustainability, cloud economics, capacity-planning utilization, unit conventions, and the provenance catalog that ties assumptions back to vendor datasheets, fleet studies, grid reports, and book conventions. In C³ terms, the assumptions keep compute capacity, communication cost, and coordination overhead measured on the same scale across chapters.

How to Use This Appendix

Use this appendix when you want to verify a fleet-scale estimate, swap in an alternative assumption (for example, NDR vs. HDR fabric, Quebec vs. Poland carbon intensity, or a different MFU), or see which numbers a chapter’s calculation depends on. Find the relevant section, read the Value and Unit columns, and plug them into your own arithmetic.

A short napkin-math callout at the start shows how several assumptions combine (failure rate, AllReduce time, carbon footprint). Hardware and link bandwidths are ceilings unless a chapter states an effective-bandwidth discount. Treat the tables as this edition’s shared assumption sheet—not immutable physical laws.

Napkin Math 1.1: Quick calculations with these assumptions
These tables support quick distributed-systems napkin math. Three examples show how several assumptions combine.

Problem: Estimate how many GPU failures a cluster should expect per day.

Variables: Use a 8,192 GPUs cluster and an individual GPU MTTF of 50,000 hours.

Math: The GPU failure rate is 8,192 GPUs/50,000 hours \(\approx\) 0.16 failures/hour, or roughly ~3.9 failures/day.

Result: This estimate excludes NIC, PSU, and cable failures. At 100,000 GPUs, the GPU-only rate is approximately ~48 failures/day; adding NIC, PSU, and cable failures can roughly double the operational incident stream.

Systems insight: At fleet scale, failure becomes a continuous background process rather than an exceptional event.

Problem: Estimate how long an AllReduce takes for a 70B model.

Variables: Use BF16 parameters and InfiniBand NDR with 50 GB/s effective bandwidth per port.

Math: The gradient payload is 70 \(\times 10^9 \times\) 2 bytes = 140 GB. The ring AllReduce napkin estimate is 2 \(\times\) 140 GB/50 GB/s \(\approx\) 5.6 s.

Result: The collective can consume a significant fraction of a training step.

Systems insight: Pipeline and tensor parallelism exist partly to shrink the gradient volume each rank must communicate.

Problem: Estimate the carbon footprint of a 10,000 GPUs training run.

Variables: Each H100 draws 700 W at TDP. Use PUE 1.12, Quebec grid intensity 20 g/kWh, and Poland grid intensity 820 g/kWh.

Math: Total facility power is 10,000 GPUs \(\times\) 700 W \(\times\) 1.12 = 7.84 MW. In Quebec, 7.84 MW \(\times\) 720 h/month \(\times\) 20 g/kWh = 112.9 t CO\(_2\) per month.

Result: The same run in Poland emits 4628.7 t, a 41× difference.

Systems insight: Grid carbon intensity can dominate the emissions difference for the same hardware, model, and training duration.

Foundational Hardware Recap

Distributed-systems reasoning in this book builds on single-node performance bounds. Table 1 recaps the H100 assumptions used as the primary accelerator reference in this volume.

Table 1: Single-node H100 assumptions: Peak specs from (Choquette 2023) and NVIDIA H100 documentation. Provide \(R_{\text{peak}}\) and \(\text{BW}\) baselines for iron-law calculations.
Category Assumption Value
Compute H100 FP16 peak throughput 989 TFLOP/s
Compute H100 FP8 peak throughput 1,979 TFLOP/s
Memory HBM3 bandwidth 3.35 TB/s
Memory HBM3 capacity 80 GB
Thermal TDP 700 W

Cluster Reference Configurations

This book uses four canonical cluster tiers (256, 2,048, 8,192, and 100,000 GPUs) to illustrate how system behavior changes across scale. They are editorial reference points aligned with common research-lab, production, large-training, and hyperscale fleets—not measurements from a single deployment. Table 2 defines GPU counts; chapters derive node counts, failure rates, and network requirements from these baselines and the assumptions in subsequent tables.

Table 2: Canonical Cluster Sizes: Book-convention tier sizes for fleet-scale examples. At 8 per node: 32 nodes, 256 nodes, 1,024 nodes, and 12,500 nodes. Failure handling shifts from exception to steady state between the 256- and 8,192-GPU tiers (Kokolis et al. 2025).
Assumption Value Unit
Small cluster GPU count (256 tier) 256 GPUs
Medium cluster GPU count (2,048 tier) 2048 GPUs
Large cluster GPU count (8,192 tier) 8192 GPUs
Mega cluster GPU count (100,000 tier) 100000 GPUs

The relationship between cluster size and system behavior is not linear. A 256-GPU cluster might go days between failures; a 100,000-GPU cluster experiences multiple failures per hour. This nonlinearity is why the chapters on fault tolerance (Fault Tolerance) and fleet orchestration (Fleet Orchestration) treat failure as a continuous background process rather than an exceptional event.

With cluster sizes established, the next question is: what connects the machines? The network fabric determines whether gradient synchronization takes milliseconds or seconds, and whether pipeline parallelism is viable across node boundaries.

Network and Interconnect Specifications

Distributed training and inference are bottlenecked by data movement between machines as often as by computation within them. Network assumptions fix bandwidth and latency for every interconnect tier—from NVLink/PCIe within a node (NVIDIA Corporation 2017, 2020; Choquette 2023), through InfiniBand, RDMA, and Ethernet fabrics (InfiniBand Trade Association 2000; Gangidi et al. 2024; NVIDIA 2026; Ethernet Alliance 2025), to the speed-of-light WAN floor (physics identity).

Intra-node interconnects

Within a single node, GPUs communicate over NVLink or PCIe. These bandwidths determine whether tensor parallelism (which requires high-bandwidth, low-latency communication) is viable within the node boundary. Table 3 lists the intra-node interconnect bandwidths and latencies used throughout this book. These values are collected here for convenience and reused throughout the fleet-scale chapters.

Table 3: Intra-Node Interconnect Bandwidth and Latency: NVLink/PCIe peaks from vendor documentation (NVIDIA Corporation 2017, 2020; Choquette 2023). NVLink is 7–10× faster than same-generation PCIe, which is why tensor parallelism stays intra-node.
Assumption Value Unit
NVLink bandwidth (V100) 300 GB/s
NVLink bandwidth (A100) 600 GB/s
NVLink bandwidth (H100) 900 GB/s
PCIe Gen3 bandwidth (V100 host) 15.75 GB/s
PCIe Gen4 bandwidth (A100 host) 32 GB/s
PCIe Gen5 bandwidth (H100 host) 64 GB/s
NVLink one-way latency 500 ns
PCIe Gen5 one-way latency 1000 ns

Inter-node interconnects

Once data crosses the node boundary, bandwidth drops by an order of magnitude and latency increases by 5–10\(\times\). This cliff shapes every decision about parallelism strategy: operations that require frequent, fine-grained communication (tensor parallelism) stay within the node, while operations with coarser communication patterns (data parallelism, pipeline parallelism) span nodes. Table 4 lists both the bit-rate and byte-rate forms of each fabric, since different chapters use different conventions.

Table 4: Inter-Node Network Bandwidth and Latency: Fabric link rates from (InfiniBand Trade Association 2000) and product-line specifications. The 9× NVLink-vs-NDR gap explains topology-aware parallelism (Collective Communication, Distributed Training).
Assumption Value Unit
InfiniBand HDR link bandwidth (bit rate) 200 Gb/s
InfiniBand HDR bandwidth (byte rate) 25 GB/s
InfiniBand NDR link bandwidth (bit rate) 400 Gb/s
InfiniBand NDR bandwidth (byte rate) 50 GB/s
InfiniBand XDR bandwidth (byte rate) 100 GB/s
400 GbE bandwidth (byte rate) 50 GB/s
800 GbE bandwidth (byte rate) 100 GB/s
100 GbE RoCE bandwidth (byte rate) 12.5 GB/s
InfiniBand one-way latency 5000 ns

Wide-area network

Cross-region communication introduces latency floors set by the speed of light in optical fiber. For a 5000 km link (roughly New York to London), the minimum one-way latency is 5000 km/200,000 km/s = 25 ms—orders of magnitude higher than intra-data-center latency. This physical constraint is why geo-distributed training uses asynchronous methods or careful placement of pipeline stages, and why federated learning (Edge Intelligence) tolerates stale gradients.

Network specifications define how fast data moves, but they say nothing about how often the machines sending that data fail. The next section quantifies the reliability of individual components—the building blocks from which cluster-level failure rates are derived.

Reliability Assumptions

At fleet scale, component failures are not exceptional events but statistical certainties. Reliability assumptions here fix Mean Time To Failure (MTTF) for each major component class in a data-center GPU node and the time to detect and recover from failures. These values drive analysis in Fault Tolerance and scheduling in Fleet Orchestration; deeper failure modeling appears in Reliability Foundations.

Component MTTF

Table 5 lists the MTTF for each component class. These are steady-state values that exclude the “infant mortality” period (first 30–90 days) and the “wear-out” period (beyond rated lifetime). Sources include large-GPU research-cluster analysis Kokolis et al. (2025), TPUv4 supercomputer resiliency and operations Zu et al. (2024), and warehouse-scale machine design Barroso et al. (2019).

Table 5: Component Mean Time To Failure: Steady-state MTTF anchors from (Kokolis et al. 2025; Zu et al. 2024; Barroso et al. 2019). Combined node MTTF \(\approx\) 3592.8 h (149.7 d) at 8 per node.
Assumption Value Unit
GPU MTTF 50000 hours
NIC MTTF 150000 hours
PSU MTTF 100000 hours
PCIe switch MTTF 200000 hours
HBM MTTF 200000 hours
Top-of-rack switch MTTF 300000 hours
Optical cable and transceiver MTTF 50000 hours

Recovery time assumptions

Failure detection and recovery introduce downtime that compounds with cluster size. Table 6 lists the assumptions used in checkpoint interval optimization (the Young-Daly formula in The Young-Daly law: Optimal checkpointing) and effective throughput calculations throughout this book.

Table 6: Recovery Time Parameters: Design assumptions for heartbeat, reschedule, and checkpoint-write bandwidth (Young–Daly context (Young 1974; Daly 2006)). Total recovery \(\approx\) detect + reschedule + checkpoint size divided by write bandwidth; 70B FP32 example \(\approx\) 92.8 s per failure.
Assumption Value Unit
Failure detection timeout (heartbeat) 30 seconds
Job reschedule delay 60 seconds
Checkpoint write bandwidth (aggregate) 100 GB/s

Reliability constants quantify how often failures occur and how long recovery takes. The cost of each failure, however, depends on how much communication must be repeated—which brings us to the parameters that model communication overhead.

Communication Model Parameters

This book models communication cost using the \(\alpha\)-\(\beta\) framework: \(T(n) = \alpha + n/\beta\), where \(\alpha\) is startup latency and \(\beta\) is sustained bandwidth (Hockney 1994). Table 7 lists values from fabric specifications, vendor product documentation, and large-cluster training studies (InfiniBand Trade Association 2000; Gangidi et al. 2024; NVIDIA 2026; Ethernet Alliance 2025; Jiang et al. 2024). They feed collective-communication models in Collective Communication and parallelism comparisons in Distributed Training.

Hockney, Roger W. 1994. “The Communication Challenge for MPP: Intel Paragon and Meiko CS-2.” Parallel Computing 20 (3): 389–98. https://doi.org/10.1016/s0167-8191(06)80021-9.
Gangidi, Adi, Rui Miao, Sandeep Hebbani, Gaya Nagarajan, Omar Baldonado, Lixin Gao, Hany Morsy Goes, et al. 2024. RDMA over Ethernet for Distributed AI Training at Meta Scale.” Proceedings of the ACM SIGCOMM 2024 Conference, 56–69. https://doi.org/10.1145/3651890.3672233.
NVIDIA. 2026. NVIDIA Quantum-X800 InfiniBand Platform. NVIDIA product documentation.
Ethernet Alliance. 2025. 2025 Ethernet Roadmap. Ethernet Alliance roadmap.
Table 7: Communication Model Parameters (\(\alpha\)-\(\beta\)): \(\alpha\) (startup) and \(\beta\) (bandwidth) from fabric specifications and measured cluster studies (InfiniBand Trade Association 2000; Jiang et al. 2024). The 10× NDR-vs-TCP gap explains why RDMA fabrics dominate synchronous training.
Assumption Value Unit
InfiniBand NDR startup latency (\(\alpha\)) 5 us
InfiniBand NDR sustained bandwidth (\(\beta\)) 50 GB/s
InfiniBand HDR startup latency (\(\alpha\)) 7 us
InfiniBand HDR sustained bandwidth (\(\beta\)) 25 GB/s
RoCE startup latency (\(\alpha\)) 10 us
RoCE sustained bandwidth (\(\beta\)) 12.5 GB/s
TCP startup latency (\(\alpha\)) 50 us

AllReduce cost model

The ring AllReduce—the most common collective for gradient synchronization—transmits \(2(N_{\text{rank}}-1)M/N_{\text{rank}}\) bytes per rank, where \(N_{\text{rank}}\) is the number of ranks and \(M\) is the message size. For large \(N_{\text{rank}}\), this approaches \(2M\) bytes per rank. Table 8 lists the constants used in AllReduce cost estimation.

Table 8: AllReduce Cost Parameters: Ring AllReduce factor of 2 from the standard ring algorithm (Gibiansky 2017). Systems.Nodes.DGX_H100 (8 GPUs per node) sets the intra-node versus inter-node boundary in hierarchical collectives.
Assumption Value Unit
Ring AllReduce bandwidth factor 2 (dimensionless)
GPUs per node 8 GPUs/node

Communication models quantify the overhead of coordination. The sustainability cost of that coordination—the power drawn, the water consumed, the carbon emitted—requires separate assumptions for the physical infrastructure surrounding the compute.

Sustainability Assumptions

The environmental impact of fleet-scale ML depends on three factors: how much power the facility draws beyond the IT equipment (PUE), how much water the cooling system consumes (WUE), and how much carbon the local grid emits per kilowatt-hour. Sustainability assumptions here support Sustainable AI and total-cost-of-ownership estimates in ML Operations at Scale.

Power usage effectiveness

Power Usage Effectiveness (PUE) is the ratio of total facility power to IT equipment power. A PUE of 1.0 would mean zero cooling overhead; real data centers range from 1.06 (liquid-cooled) to 1.58 (legacy air-cooled). Table 9 lists the reference PUE values used throughout this book.

Table 9: Power Usage Effectiveness (PUE): Illustrative PUE tiers from hyperscale surveys and Green Grid definitions (Uptime Institute 2022; The Green Grid 2007). Legacy vs. liquid-cooled gap: 49.1 percent more facility power for the same IT load.
Assumption Value Unit
PUE (liquid-cooled, best in class) 1.06 (ratio)
PUE (best air-cooled) 1.12 (ratio)
PUE (industry average) 1.4 (ratio)
PUE (legacy air-cooled) 1.58 (ratio)

Water usage effectiveness

Water Usage Effectiveness (WUE) measures liters of water consumed per kilowatt-hour of IT energy (The Green Grid 2011). Evaporative cooling towers achieve excellent PUE but consume significant water; closed-loop liquid cooling uses near-zero water but requires higher capital investment. Table 10 lists the reference WUE values.

Table 10: Water Usage Effectiveness (WUE): Illustrative WUE tiers for cooling-class comparisons (The Green Grid 2011; Uptime Institute 2022). Evaporative example at 10 MW IT load: 432,000 L/day.
The Green Grid. 2011. Water Usage Effectiveness (WUE): A Green Grid Data Center Sustainability Metric. White Paper 35. The Green Grid.
Assumption Value Unit
WUE (air-cooled) 1.8 L/kWh
WUE (evaporative cooling) 1.8 L/kWh
WUE (closed-loop liquid) 0 L/kWh

Regional carbon intensity

The same training run emits vastly different amounts of CO\(_2\) depending on the grid that supplies its electricity. Table 11 lists regional grid carbon intensities from IEA and Canadian government emissions-factor data (International Energy Agency 2024; Environment and Climate Change Canada 2026), rounded for napkin math. These values are central to location-aware scheduling in Sustainable AI.

Table 11: Regional Grid Carbon Intensity: Grid carbon intensity in gCO\(_2\)/kWh, anchored to emissions-factor data (International Energy Agency 2024; Environment and Climate Change Canada 2026) and ordered from low (Norway) to high (Poland). Location spans a 82× carbon spread—often larger than algorithmic tweaks.
International Energy Agency. 2024. Emissions Factors 2024.
Environment and Climate Change Canada. 2026. Emission Factors and Reference Values.
Assumption Value Unit
Grid carbon intensity (Norway) 10 gCO2/kWh
Grid carbon intensity (Quebec) 20 gCO2/kWh
Grid carbon intensity (France) 50 gCO2/kWh
Grid carbon intensity (EU average) 270 gCO2/kWh
Grid carbon intensity (US average) 429 gCO2/kWh
Grid carbon intensity (Poland) 820 gCO2/kWh

Power density

AI accelerator racks draw 6–8\(\times\) more power than traditional data center racks, creating thermal density challenges that force the transition from air cooling to liquid cooling. Table 12 lists the reference rack power levels used in Compute Infrastructure and Sustainable AI, anchored to current AI power-and-cooling product disclosures (Dell Technologies 2026; Schneider Electric 2024).

Table 12: Rack Power Density: Illustrative rack-power tiers and air-cooling limit (Uptime Institute 2022; Dell Technologies 2026; Schneider Electric 2024). AI racks (70 kW–100 kW) exceed the 30 kW air-cooling practical limit.
Dell Technologies. 2026. Data Center Power and Cooling Solutions.
Schneider Electric. 2024. Schneider Electric Announces New Solutions to Address the Energy and Sustainability Challenges Spurred by AI.
Assumption Value Unit
Rack power (traditional enterprise) 12 kW
Rack power (AI, typical) 70 kW
Rack power (AI, high density) 100 kW
Air-cooling practical limit per rack 30 kW

Sustainability constants capture the physical and environmental costs of running a fleet. The economic constants in the next section translate those physical costs into dollar values, completing the total-cost-of-ownership picture.

Economic Assumptions

Fleet-scale cost estimates depend on GPU rental rates, electricity pricing, and data transfer charges. These are illustrative hyperscaler-order rates (2024–2025 magnitude) for ratio analysis, not quotes for a specific contract. Table 13 supports ML Operations at Scale and Inference at Scale TCO napkin math.

Table 13: Fleet-Scale Economic Parameters: Illustrative cloud GPU, electricity, and egress rates for TCO examples (substitute contract pricing for absolute budgets).
Assumption Value Unit
Cloud GPU training rental rate 4 dollar/h
Cloud GPU inference rental rate 2.5 dollar/h
Cloud electricity price 0.12 dollar/kWh
Cloud egress price per GB 0.09 dollar/GB

Economic constants set the price per unit of resource. The next question is: what fraction of each resource unit does useful work? Capacity planning constants quantify the gap between peak capability and achievable throughput.

Capacity Planning Assumptions

Capacity-planning assumptions quantify two related phenomena: (1) the fraction of peak hardware performance that real workloads achieve (Model FLOPs Utilization), and (2) how efficiently that performance scales as more machines are added (scaling efficiency). Together they determine the effective compute available for a given cluster size and budget—the number that matters for project planning.

The planning equation is multiplicative: \[ R_{\text{eff}} = N R_{\text{peak}} \times \text{MFU} \times \eta_{\text{scaling}} \times (1 - f_{\text{overhead}}) \]

Here \(R_{\text{eff}}\) is effective fleet throughput, \(N R_{\text{peak}}\) is nominal cluster peak throughput, MFU discounts peak hardware to realized model throughput, \(\eta_{\text{scaling}}\) discounts added machines for communication and coordination losses, and \(f_{\text{overhead}}\) removes wall time spent on pipeline bubbles, checkpoints, failure recovery, and maintenance. The three subsections that follow fix those factors in order.

Model FLOPs utilization

Model FLOPs Utilization (MFU) measures the ratio of actual compute throughput to the hardware’s peak theoretical throughput. An MFU of 0.50 means the workload achieves half the peak FLOP/s. Table 14 lists the reference MFU ranges used in training time and cost estimates throughout this book.

Table 14: Model FLOPs Utilization (MFU) Ranges: Training MFU 0.3–0.5 from large-model training reports (Chowdhery et al. 2022; Narayanan et al. 2021); inference batch-1 MFU 5 percent from (Pope et al. 2023) (memory-bound decode).
Assumption Value Unit
Training MFU (low end) 0.3 (fraction)
Training MFU (high end) 0.5 (fraction)
Inference MFU (batch size 1) 0.05 (fraction)
Inference MFU (batched) 0.4 (fraction)

Scaling efficiency

Scaling efficiency \(\eta_{\text{scaling}} = T_1/(N \times T_N)\) measures how much of the added compute actually reduces training time. Table 15 lists illustrative efficiency tiers aligned with published LLM training at scale (Chowdhery et al. 2022; Jiang et al. 2024)—workload and fabric dependent.

Table 15: Scaling Efficiency by Cluster Size: Illustrative \(\eta_{\text{scaling}}\) tiers from (Chowdhery et al. 2022; Jiang et al. 2024). At 8,192 GPUs, 2,867.2 effective GPUs of 0.35 nominal.
Assumption Value Unit
Scaling efficiency (32 GPUs) 0.9 (fraction)
Scaling efficiency (256 GPUs) 0.7 (fraction)
Scaling efficiency (1,024 GPUs) 0.5 (fraction)
Scaling efficiency (8,192 GPUs) 0.35 (fraction)

Overhead budgets

Real training jobs spend a fraction of wall time on noncompute activities: pipeline bubble idle time, checkpoint writes, failure recovery, and scheduled maintenance. Table 16 lists the overhead budgets assumed throughout this book. These fractions are additive: total overhead at fleet scale is approximately 5 percent + 3 percent + 10 percent + 5 percent = 23 percent, meaning only 77 percent of wall time produces useful training progress.

Table 16: Overhead Budgets (Fraction of Wall Time): Book-convention wall-time fractions for pipeline bubbles, checkpointing, failure recovery, and maintenance—engineering targets, not physical constants. Combined 23 percent overhead \(\Rightarrow\) 23.1 days of effective work in a 30-day run.
Assumption Value Unit
Wall-time overhead (pipeline bubbles) 0.05 (fraction)
Wall-time overhead (checkpointing) 0.03 (fraction)
Wall-time overhead (failure recovery) 0.1 (fraction)
Wall-time overhead (scheduled maintenance) 0.05 (fraction)

The tables above use the unit conventions in table 17—decimal data prefixes (KB \(= 10^3\) bytes), separate work (FLOPs) from throughput (FLOP/s), and the aliases below.

Unit Conventions

Table 17 fixes the unit conventions used in every quantitative example in this volume. Each row gives the multiplier \(k\) in 1 alias \(=\) \(k\) base. Data prefixes use decimal SI (\(\mathrm{KB} = 10^3\) bytes, not 1024). Throughput quantities (FLOP/s, GB/s) combine these aliases with time; adding incompatible dimensions (bytes to FLOP/s) is a category error in napkin math, not a unit conversion.

Table 17: Unit conventions: Decimal SI aliases and scale factors (book convention). Throughput forms (FLOP/s, GB/s) divide work or data by time using the same conventions.
Alias Multiplier Base unit
byte 1 byte
KB 1000 byte
MB 1e+06 byte
GB 1e+09 byte
TB 1e+12 byte
PB 1e+15 byte
flop 1 flop
GFLOPs 1e+09 flop
TFLOPs 1e+12 flop
ZFLOPs 1e+21 flop
param 1 param
Mparam 1e+06 param
Gbps 1e+09 bit/s
NS 1e-09 second
US 1e-06 second
MS 0.001 second
second 1 second
hour 3600 second
day 86400 second
joule 1 joule
watt 1 watt
meter 1 meter

Two fleet-scale conventions deserve emphasis beyond the table:

  • Network bandwidth appears in both bit-rate (Gbps) and byte-rate (GB/s). Divide by 8 to convert Gbps to GB/s (for example, 400 Gbps = 50 GB/s for InfiniBand NDR). Chapters use whichever form fits the calculation; byte-rate is usually easier for transfer-time estimates.
  • Carbon intensity uses gCO\(_2\)/kWh (grams of CO\(_2\) per kilowatt-hour). Multiply total energy (kWh) by regional carbon intensity to get emissions in grams, then divide by \(10^6\) for tonnes.

Assumption Provenance

The catalog below maps each appendix section to source class and primary references. The book cites with @citekey; mlsysim mirrors the same provenance as Provenance records on Sourced registry scalars and metadata (e.g. Systems.Reliability.Gpu.mttf_hours.provenance, Hardware.Cloud.H100.metadata.provenance) without shipping a .bib file. See table 18 for the full table.

Table 18: Fleet assumption provenance catalog: Quick map from appendix section to source class and bibliography.
Appendix section Source type Primary references
H100 recap Vendor datasheet peak (Choquette 2023)
Cluster tiers Book convention (Kokolis et al. 2025) (scale context)
Network fabrics Vendor specs; InfiniBand standard (NVIDIA Corporation 2017, 2020; Choquette 2023; InfiniBand Trade Association 2000)
Reliability (MTTF) Fleet studies (Kokolis et al. 2025; Zu et al. 2024; Barroso et al. 2019)
Recovery times Design assumptions (Young 1974; Daly 2006)
Communication (\(\alpha\)\(\beta\)) Fabric specs; training at scale (InfiniBand Trade Association 2000; Jiang et al. 2024)
AllReduce model Algorithm identity (Gibiansky 2017)
Sustainability (PUE/WUE/rack power) Industry surveys (Uptime Institute 2022; The Green Grid 2007)
Carbon intensity Grid statistics (International Energy Agency 2023)
Cloud economics Illustrative rates Editorial (2024–2025 order of magnitude)
MFU and scaling efficiency Published LLM training (Chowdhery et al. 2022; Narayanan et al. 2021; Pope et al. 2023; Jiang et al. 2024)
Overhead budgets Book convention Editorial engineering targets
Unit conventions Decimal SI; book notation Editorial
Choquette, Jack. 2023. NVIDIA Hopper H100 GPU: Scaling Performance.” IEEE Micro 43 (3): 9–17. https://doi.org/10.1109/mm.2023.3256796.
Kokolis, Apostolos, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zach DeVito, Shubho Sengupta, Kalyan Saladi, and Carole-Jean Wu. 2025. “Revisiting Reliability in Large-Scale Machine Learning Research Clusters.” 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 1259–74. https://doi.org/10.1109/hpca61900.2025.00096.
NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture. NVIDIA Whitepaper.
NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Whitepaper, V1.0.
InfiniBand Trade Association. 2000. InfiniBand Architecture Specification Volume 1. InfiniBand Trade Association.
Zu, Y., A. Ghaffarkhah, H.-V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, et al. 2024. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 761–74.
Barroso, Luiz André, Urs Hölzle, and Parthasarathy Ranganathan. 2019. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Springer International Publishing. https://doi.org/10.1007/978-3-031-01761-2.
Young, John W. 1974. “A First Order Approximation to the Optimum Checkpoint Interval.” Communications of the ACM 17 (9): 530–31. https://doi.org/10.1145/361147.361115.
Daly, J. T. 2006. “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps.” Future Generation Computer Systems 22 (3): 303–12. https://doi.org/10.1016/j.future.2004.11.016.
Jiang, Ziheng, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, et al. 2024. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs.” 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 745–60.
Gibiansky, A. 2017. Bringing HPC Techniques to Deep Learning. Baidu Research Technical Blog.
Uptime Institute. 2022. Uptime Institute Global Data Center Survey 2022. Uptime Institute.
The Green Grid. 2007. Green Grid Data Center Power Efficiency Metrics: PUE and DCIE. The Green Grid.
International Energy Agency. 2023. World Energy Outlook 2023. IEA. https://doi.org/10.1787/827374a6-en.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311.
Narayanan, Deepak, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, et al. 2021. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15. https://doi.org/10.1145/3458817.3476209.
Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. “Efficiently Scaling Transformer Inference.” Proceedings of Machine Learning and Systems (MLSys) 5: 606–24.

Summary

This appendix is the book’s shared fleet-scale assumption sheet: every napkin-math estimate in the volume should be traceable to a row in the tables above—and, where it matters, to section 1.1 and table 18.

Key Takeaways: Using the fleet assumption tables
  • One place for shared fleet numbers: Cluster tiers, fabric bandwidths, MTTF and recovery times, \(\alpha\)\(\beta\) communication parameters, facility efficiency, water use, carbon intensity, cloud pricing, MFU, scaling efficiency, and overhead budgets used in worked examples are listed here so you can audit or replace them without re-deriving from prose.
  • Reliability assumptions set the failure clock: MTTF and recovery-time rows determine how often failures interrupt training and how much wall time is lost. At 10,000+ GPUs, failure is steady state, not an exception; see Reliability Foundations for deeper modeling.
  • Communication assumptions determine viability: \(\alpha\)\(\beta\) values determine whether synchronous distributed training is viable for a given model size and fabric. The 10× latency gap between RDMA and TCP explains why training clusters favor InfiniBand.
  • Sustainability assumptions expose location leverage: Facility efficiency, water use, and carbon intensity show that data-center location and cooling can outweigh many algorithmic tweaks. The 82× spread between Norway and Poland grid carbon intensity is often the largest lever.
  • Capacity assumptions quantify usable throughput: MFU, scaling efficiency, and overhead budgets bridge peak hardware to useful throughput. At 8,192 GPUs, only about 35 percent of ideal scaling is realized, and combined overheads consume roughly 23 percent of wall time.
  • Substitute your own assumptions: When your deployment differs—different fabric, region, MFU, or cluster size—swap the Value column and rerun the same formulas the chapters use.
  • Check provenance before debating a number: Datasheet peaks, fleet studies, illustrative rates, and book conventions are sourced differently; table 18 and section captions say which is which.
Back to top