Fleet Foundations

Purpose

What reference numbers and physical laws should every fleet-scale ML engineer carry into distributed system design decisions?

Designing a system for a single machine requires knowledge of memory hierarchy latencies, roofline ridge points, and precision trade-offs. The moment a training job spans two machines, however, a new set of numbers takes over. Network latency replaces cache latency as the dominant concern. Component failure rates compound from negligible to inevitable. Communication overhead erodes the scaling efficiency that justifies buying more accelerators in the first place. This appendix collects the reference numbers and compact models for fleet-scale reasoning: the three system paradigms that underpin this book, the numbers every fleet engineer should know across Compute, Communication, and Coordination, and the scaling physics and thermal constraints that govern cluster design. In C³ terms, these numbers define the physical scale on which compute capacity, communication bandwidth, and coordination overhead become measurable.

How to Use This Appendix

This appendix is designed as a reference. When a fleet-scale design question arises, use it to turn a vague symptom into a specific constraint and then choose the lever that can actually move.

“How fast can I communicate between nodes?”: Start with the Communication Numbers in section 1.2 for the bandwidth and latency hierarchy.
“How many GPUs do I actually need?”: Use the Scaling Physics in section 1.3 to understand why doubling GPUs does not halve training time.
“How often will my cluster fail?”: Check the Coordination Numbers in section 1.2 for MTBF tables and failure probability calculations.
“Is my cluster power-limited or compute-limited?”: See Thermal and Power Physics in section 1.4 for power density and cooling constraints.
“What does a typical overhead budget look like?”: The Coordination Numbers include the four overhead categories that erode goodput.

The Three System Paradigms

Machine learning infrastructure at scale inherits design DNA from two distinct computing lineages—and then breaks the assumptions of both. Understanding where ML systems borrow from High-Performance Computing (HPC) and where they borrow from Warehouse-Scale Computing (WSC) is essential for choosing the right design trade-offs. A system architect who treats ML training as “just HPC” will build infrastructure that cannot tolerate failures. One who treats it as “just web services” will build infrastructure that cannot sustain the tight coupling that synchronous training demands.

High-performance computing

HPC systems descend from the supercomputer tradition. Their design philosophy is to maximize FLOP/s on tightly coupled simulations—weather modeling, molecular dynamics, nuclear physics. Every node matters: they use specialized interconnects (InfiniBand), low-latency fabrics, and homogeneous hardware. Fault tolerance follows the checkpoint/restart model: if one node fails, the entire job stops, rolls back to the last checkpoint, and restarts from scratch. Scheduling is batch-oriented (Slurm), with jobs requesting rigid resource shapes (“512 nodes for 24 hours”). Nodes are pets—individually important, individually tracked.

Warehouse-scale computing

WSC systems descend from the web services tradition. Their design philosophy is to maximize queries per second across loosely coupled services—search, email, social media. Hardware is commodity Ethernet, varying generations coexist, and nodes are heterogeneous. Fault tolerance follows the redundancy model: if one node fails, the load balancer reroutes traffic to another replica. The user never notices. Scheduling is dynamic (Kubernetes, Borg), with elastic bin-packing. Nodes are cattle—interchangeable and expendable.

The ML fleet: A hybrid architecture

ML systems require the computational throughput of HPC (to train massive models with synchronous gradient updates) but must operate at the scale and unreliability of WSC (thousands of accelerators running for weeks). This creates a hybrid that borrows selectively from both traditions.

Training workloads are synchronous and bandwidth-hungry like HPC, but long-running and failure-tolerant like WSC. Inference workloads are latency-sensitive like WSC, but computationally heavy like HPC. The network is a fusion: TCP/IP for control planes, InfiniBand or NVLink for data planes. Fault tolerance uses elastic training strategies: training jobs can shrink, expand, pause, or resume without full restarts. Scheduling combines gang allocation (all-or-nothing for training) with dynamic preemption and replacement.

Table 1 summarizes the design trade-offs across the three paradigms. The key insight is that ML fleets cannot simply adopt either HPC or WSC patterns wholesale; they must selectively combine elements from each based on the workload phase.

Table 1: Three System Paradigms: ML fleets inherit design DNA from both HPC and WSC but break assumptions of each. Training resembles HPC (tight coupling), inference resembles WSC (elastic serving), and fault tolerance is a hybrid of both.

Dimension	HPC (Supercomputer)	WSC (Web Cloud)	ML Fleet (AI Cluster)
Philosophy	Maximize FLOP/s	Maximize QPS	Maximize Model Quality per Dollar/Watt
Coupling	Tight (MPI)	Loose (RPC/HTTP)	Hybrid (NCCL + RPC)
State	Stateful (RAM)	Stateless (DB-backed)	Semi-Stateful (Checkpoints + KV Cache)
Network	Latency-optimized	Bandwidth-optimized	Bisection-bandwidth critical
Bottleneck	Compute throughput (FLOP/s)	I/O (Disk/Net)	Memory bandwidth (HBM)
Fault model	Checkpoint/Restart	Redundancy/Replicas	Elastic shrink/expand
Scheduling	Batch (Slurm)	Orchestration (K8s)	Gang + preemption
Node model	Pets (tracked)	Cattle (expendable)	Pets during job, cattle between jobs

Foundations recap

The following provides a compact reference for the key foundational ideas that reappear throughout the distributed systems chapters.

The iron law (\(T \approx D_{\text{vol}}/\text{BW} + O/(R_{\text{peak}} \cdot \eta_{\text{hw}}) + L_{\text{lat}}\)): Performance is bounded by data movement or compute. At fleet scale, the data movement term expands to include inter-node communication, not just memory bandwidth.
Roofline Model: Distinguishes compute-bound from memory-bound workloads using arithmetic intensity. At fleet scale, a third ceiling appears: network-bound workloads whose performance is limited by inter-node bandwidth.
Amdahl’s Law: Caps strong-scaling speedup at \(1/s\) (Amdahl 1967). At fleet scale, the “serial fraction” includes not just sequential code but also synchronization barriers, collective communication, and pipeline bubbles.
Training state rule: about 14 bytes per parameter for common mixed-precision Adam checkpoint state (BF16 weights, FP32 master weights, and Adam moments), before framework metadata and implementation-specific extras. At fleet scale, this determines how model state is partitioned across nodes (ZeRO, tensor parallelism, pipeline parallelism).
Little’s Law (\(Q_{\text{req}} = \lambda_{\text{arr}} T_{\text{lat}}\)): Sizes inference infrastructure, where \(\lambda_{\text{arr}}\) is the request arrival rate and \(Q_{\text{req}}\) is the expected concurrent request count (Little 1961). At fleet scale, it determines how many serving replicas are needed behind a load balancer.

Amdahl, Gene M. 1967. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Proceedings of the April 18-20, 1967, Spring Joint Computer Conference on - AFIPS ’67 (Spring), AFIPS ’67 (spring), 483–85. https://doi.org/10.1145/1465482.1465560.

Little, John D. C. 1961. “A Proof for the Queuing Formula: <I>l</i> = \(\Lambda\)<i>w</i>.” Operations Research 9 (3): 383–87. https://doi.org/10.1287/opre.9.3.383.

The transition from single-machine to fleet-scale reasoning requires extending these models with new dimensions: network topology, failure probability, and coordination overhead. The numbers in the next section provide the quantitative foundation for that extension.

Numbers Every Fleet Engineer Should Know

Just as single-machine analysis depends on a set of core numbers, fleet-scale engineering is governed by a set of predictable ratios and scaling behaviors. The single-machine numbers still apply within each node, but a new set of numbers governs the spaces between nodes. While absolute values evolve with hardware generations, the ratios between communication tiers and the scaling behavior of failure rates remain remarkably stable. Memorize the ratios and scaling trends; use the specific numbers as sanity checks.

Systems Perspective 1.1: Node-level numbers for fleet reasoning

Fleet reasoning depends on a few node-level numbers that directly affect fleet design: the training state rule of about 14 bytes per parameter for common Adam checkpoint state, with larger footprints when gradients, activations, metadata, or extra optimizer buffers are included; NVLink vs. HBM bandwidth for intra-node parallelism placement; and peak FLOP/s and HBM capacity for MFU, effective FLOP/s, batch sizing, and model sharding. Table 2 and the communication numbers below give inter-node and current-generation values; the memory hierarchy and roofline ridge points within a node provide the necessary baseline.

Those node-level baselines make the fleet-level ratios easier to interpret.

Systems Perspective 1.2: Three fleet numbers that matter most

Bandwidth gap: NVLink per-direction bandwidth within a node is ~9× faster than an InfiniBand NDR port between nodes. This ratio determines where parallelism boundaries belong—model parallelism within a node, data parallelism across nodes.
MTBF scaling: A cluster’s mean time between failures is the single-component MTBF divided by the number of components (\(1/N\) scaling). At 100,000 GPUs, expect a failure every 30 minutes.
Effective utilization: After MFU, scaling efficiency, and overhead losses compound, a cluster with 1,024 GPUs delivers roughly 19.2 percent of its peak FLOP/s as useful training work.

Quick reference—table 2 condenses the fleet-scale numbers into one place. Use it for back-of-envelope checks; use the detailed tables in each subsection when designing or debugging.

Table 2: Numbers Every Fleet Engineer Should Know (Quick Reference): One-page summary of the fleet-scale reference numbers in this section. See table 3 and table 4 for Communication, table 5 for Compute, and table 6 for Coordination.

Category	Number	Use
Communication	NVLink per-direction bandwidth ~9× IB NDR	Parallelism boundary (in-node vs. cross-node)
Communication	IB NDR ~50 GB/s, ~5 μs one-way	Inter-node bandwidth and latency
Compute	MFU 30–50%, \(\eta_{\text{scaling}}\) ~50% @ 1K, ~35% @ 8K	Effective FLOP/s and scaling sanity checks
Coordination	MTBF 8K: ~366.2 min; 100K: ~30 min	Failure expectation and checkpoint cadence
Coordination	175B checkpoint ~2450 GB (14 B/param common Adam state)	Recovery and storage sizing
Coordination	Goodput ~77% after overheads	Wall-clock utilization
Power & sustainability	AI rack ~70 kW; air limit ~30 kW	Cooling feasibility (liquid required above air limit)
Power & sustainability	PUE liquid ~1.06, typical ~1.40; H100 700 W -> 10,000 ≈ 7 MW IT \(\times\) PUE	Facility load and carbon (see section 1.4)

The invariants: Ratios that will not change

These relationships are governed by physics or architecture—they will still be true in 2035.

Network hierarchy ratio

The bandwidth gap between intra-node and inter-node communication is an architectural invariant. Chip-to-chip links (NVLink, ICI) connect through short, wide, dedicated paths on a shared substrate. Inter-node links (InfiniBand, Ethernet) must traverse cables, switches, and protocol stacks. This structural difference guarantees that intra-node bandwidth will always be an order of magnitude higher than inter-node bandwidth.

Currently, NVLink 4.0 provides 9× more per-direction bandwidth than an InfiniBand NDR port. Even as both technologies improve, the ratio persists because both are constrained by the same physics: signaling rates, lane counts, and connector density. This ratio is the single most important number for parallelism strategy: any operation requiring more bandwidth than the inter-node link can provide must be confined within a single node.

Failure scaling law

For \(N\) independent components, each with mean time between failures \(\text{MTBF}_{\text{component}}\) under a constant-failure-rate assumption, equation 1 gives the system MTBF:

\[ \text{MTBF}_{\text{system}} = \frac{\text{MTBF}_{\text{component}}}{N} \tag{1}\]

This is pure arithmetic, not an approximation. Doubling the cluster size halves the time between failures. At 100,000 GPUs with a per-GPU MTTF of 50,000 hours, the cluster experiences a GPU failure every 30 minutes. No fault tolerance strategy can avoid this—the question is how quickly the system recovers.

AllReduce overhead scaling

The bandwidth-optimal Ring AllReduce algorithm transfers \(2(N-1)M/N\) bytes per participant, where \(M\) is the message size and \(N\) is the number of participants (Patarasuk and Yuan 2009). As \(N\) grows large, this approaches \(2M\) per GPU, so the per-GPU bandwidth term is nearly independent of the number of GPUs; aggregate cluster traffic still grows with the number of participants. This is why Ring AllReduce scales well in the bandwidth term. The latency term, however, grows as \(2(N-1) \times \alpha\), making latency the bottleneck for small messages on large rings. This trade-off motivates hierarchical AllReduce strategies that use Ring AllReduce within nodes and Tree AllReduce across nodes.

Patarasuk, Pitch, and Xin Yuan. 2009. “Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.” Journal of Parallel and Distributed Computing 69 (2): 117–24. https://doi.org/10.1016/j.jpdc.2008.09.002.

Communication numbers

Communication defines the boundaries of parallelism. These tables quantify the bandwidth and latency at each tier of the network hierarchy, from the fastest intra-node links to the slowest wide-area connections. The key question for any distributed ML operation is: which tier of the network hierarchy does this communication cross?

Table 3 shows the bandwidth available at each tier. Note the order-of-magnitude drops as communication crosses node boundaries.

Table 3 lists the bandwidth tiers that determine where tensor, pipeline, and data-parallel traffic should stay local.

Table 3: Communication Bandwidth Hierarchy: Per-direction bandwidth drops by roughly 9× crossing from intra-node (NVLink) to inter-node (InfiniBand). This ratio compares the H100 NVLink one-way rate against one InfiniBand NDR port and determines parallelism placement.

Interconnect	Bandwidth	Typical Role
NVLink 4.0 (H100)	450 GB/s	Tensor/pipeline parallelism within a node
TPU v5p ICI	1,200 GB/s	Intra-pod model parallelism (Google)
PCIe Gen5 x16	64 GB/s	CPU-GPU data transfer, NIC attachment
IB GXDR (1.6 Tbps)	200 GB/s	Next-gen inter-node (2026+)
IB XDR (800 Gbps)	100 GB/s	Inter-node standard (2025+)
IB NDR (400 Gbps)	50 GB/s	Current inter-node standard for AI clusters
IB HDR (200 Gbps)	25 GB/s	Previous-gen inter-node
RoCE v2 (100 GbE)	12.5 GB/s	Budget clusters, inference fleets

Table 4 shows the one-way latency at each tier. For collective operations on small messages, latency—not bandwidth—is the bottleneck.

Table 4: Communication Latency Hierarchy: Latency determines whether synchronous training is feasible. TCP/IP is roughly 10\(\times\) slower than InfiniBand NDR, making it unsuitable for gradient synchronization in large clusters.

Interconnect	One-Way Latency	Implication
InfiniBand NDR	~5 μs	Low enough for synchronous AllReduce
InfiniBand HDR	~7 μs	Adequate for most training topologies
RoCE v2	~10 μs	Acceptable for data parallelism
TCP/IP (Ethernet)	~50 μs	Too slow for synchronous training
Cross-data-center	~40,000 \(\mu\)s (40 ms)	Physics floor; async training only

Compute numbers

Raw peak FLOP/s is a necessary but misleading metric for fleet capacity planning. Two multiplicative losses—Model FLOPs Utilization (MFU) and scaling efficiency—reduce effective throughput dramatically. Understanding these losses transforms fleet sizing from guesswork into engineering.

Model FLOPs Utilization (MFU) measures what fraction of peak FLOP/s a training workload actually achieves. Well-optimized large-model training on current hardware achieves 30–50 percent MFU. The gap comes from memory stalls, kernel launch overhead, pipeline bubbles, and suboptimal operator fusion. MFU below the low end of that range signals optimization opportunities; MFU above the high end indicates excellent hardware utilization.

Scaling efficiency (\(\eta_{\text{scaling}}\)) measures how much useful computation survives as accelerators are added. Table 5 shows the empirical ranges for well-optimized distributed training.

Table 5: Scaling Efficiency by Cluster Size: Efficiency degrades roughly as the logarithm of cluster size. These ranges assume well-optimized data parallelism with gradient compression. Poorly optimized systems can lose 2–3\(\times\) more.

Cluster Size	Scaling Efficiency (\(\eta_{\text{scaling}}\))	Implication
32 GPUs	~90%	Near-linear scaling; communication is negligible
256 GPUs	~70%	Communication starts to erode throughput
1,024 GPUs	~50%	Significant overhead; optimization critical
8,192 GPUs	~35%	Fleet-scale regime; 65% of compute is overhead

Coordination numbers

At fleet scale, coordination—failure recovery, checkpointing, and maintenance—consumes a measurable fraction of wall-clock time. These numbers quantify the costs of keeping a large cluster running.

Failure rates by cluster size

Table 6 shows how cluster MTBF shrinks with scale, using a per-GPU MTTF of 50,000 hours (~5.7 years). The failure probability column shows the likelihood of at least one GPU failure during a 24-hour training window.

Table 6 translates per-GPU reliability into cluster-level failure cadence, which is the number checkpoint planning must absorb.

Table 6: MTBF and Failure Probability by Cluster Size: GPU-only failure model with per-GPU MTTF of 50,000 hours. Real clusters also include NIC, PSU, cable, and switch failures, making these estimates conservative. The probability column uses \(\Pr(\geq 1\ \text{failure}) = 1 - e^{-T/\text{MTBF}}\) for \(T = 24\) hours.

Cluster Size	MTBF (GPU-only)	Minutes	\(\Pr(\text{failure})\) in 24 hours
256 GPUs	195.3 h	11,718.8 min	11.6%
2,048 GPUs	24.4 h	1,464.8 min	62.6%
8,192 GPUs	6.1 h	366.2 min	98%
100,000 GPUs	0.50 h	30 min	100%

The key takeaway: at 8,192 GPUs and above, failure is no longer an edge case. GPU-only MTBF is about 366.2 min at 8,192 GPUs, and the probability of at least one GPU failure within 24 hours is 98 percent. Fault tolerance is not optional at fleet scale; it is a prerequisite for completing any long-running training job. Fault Tolerance covers the mechanisms in detail.

Checkpoint sizes

Checkpointing is the primary recovery mechanism, and its cost depends on the model size. Table 7 shows checkpoint sizes for common mixed-precision Adam training checkpoints (14 bytes per parameter: 2B for BF16 weights and 12B for FP32 master weights + momentum + variance). Gradients are normally transient and recomputed after restore rather than serialized as durable checkpoint state.

Table 7: Checkpoint Sizes by Model Scale: Uses 14 bytes/parameter for common mixed-precision Adam checkpoint state. Write time assumes 100 GB/s aggregate storage bandwidth. Asynchronous checkpointing (Fault Tolerance) can overlap writes with training, reducing the visible overhead.

Model Size	Checkpoint Size	Write Time @ 100 GB/s
7B parameters	98 GB	0.98 s
70B parameters	980 GB	9.8 s
175B parameters	2,450 GB	24.5 s
1T parameters	14 TB	2.3 min

Overhead budgets

Failure recovery and checkpointing are only part of the wall-clock budget. At fleet scale, table 8 separates four recurring categories of overhead that consume time not spent on useful training:

Table 8: Overhead Budgets for Fleet-Scale Training: These are fractions of wall-clock time. At 10,000+ GPUs, failure recovery dominates. The compound effect is additive: total goodput ratio \(\approx 1.0 - (0.05 + 0.03 + 0.10 + 0.05) = 0.77 \approx 77\%\).

Overhead Category	Typical Budget	Lever
Pipeline bubbles	~5%	Increase microbatches per pipeline stage
Checkpointing	~3%	Async checkpointing, faster storage
Failure recovery	~10%	Faster detection, elastic rescheduling
Maintenance windows	~5%	Rolling upgrades, live migration

Power and sustainability numbers

Fleet-scale capacity planning and sustainability reporting require a few power numbers that every fleet engineer should know. At fleet scale, the critical numbers are rack power density, PUE, and the air-cooling limit—they determine where construction is feasible and what the facility load will be. Table 9 collects these reference values:

Table 9: Power and Cooling Reference Numbers: Rack power densities, air-cooling limits, and PUE values that govern fleet-scale capacity planning and sustainability reporting. Each row pairs a typical value with the engineering decision it informs.

Quantity	Typical value	Use
Traditional rack	12 kW	Baseline for non-AI data centers
AI rack (current gen)	70 kW	Liquid cooling required
AI rack (high-density)	100 kW	Direct-to-chip liquid
Air cooling limit	~30 kW per rack	Physics ceiling; above this, liquid is mandatory
PUE (liquid-cooled AI)	~1.06	Best case: facility load ≈ IT load
PUE (best air-cooled)	~1.12	Hyperscale best practice
PUE (industry average)	~1.40	Sanity check for cost/carbon
H100 TDP	700 W per GPU	IT load: 10,000 \(\times\) 700 W = 7 MW

Rule of thumb: IT load (MW) = (number of GPUs \(\times\) TDP per GPU)/\(10^6\); facility load = IT load \(\times\) PUE. A 10,000-GPU H100 cluster at 700 W each is 7 MW IT; at PUE 1.40 that is 9.8 MW facility draw. For carbon and cost, see section 1.4 and Sustainable AI.

Current hardware reference (c. 2025–2026)

These numbers (table 10) reflect the current generation of fleet-scale hardware. Use them for back-of-envelope calculations, but expect them to improve ~2\(\times\) every 2–3 years.

Table 10: Fleet-Scale Hardware Reference (c. 2025–2026): Per-accelerator specifications for the 2025–2026 generations. The B200, MI300X, and Tensor Processing Unit (TPU) v6 (Trillium) represent the frontier of fleet-scale compute density and memory bandwidth.

Spec	NVIDIA B200	AMD MI300X	Google TPU v6
BF16/FP16 Peak	2,250 TFLOP/s	1,307 TFLOP/s	918 TFLOP/s
Memory Bandwidth	8 TB/s	5.30 TB/s	1.60 TB/s
HBM Capacity	192 GB	192 GB	32 GB
Intra-Node Link	1,800 GB/s (NVLink 5.0)	~890 GB/s (Infinity Fab)	~2,000 GB/s (estimated)
TDP	1000 W	750 W	~600 W (estimated)

For scale context, a DGX H100 SuperPOD contains 32 DGX H100 nodes (256 H100 GPUs). Meta announced two 24,576-H100 data-center-scale clusters built on the Grand Teton hardware platform. Google’s TPU v5p pods scale to 8,960 chips. The largest announced clusters (as of 2025) exceed 100,000 accelerators.

Scaling Physics

The numbers in the previous section describe what the hardware can do. Scaling physics describes what happens when more of it is brought to bear. Reasoning starts from the ceiling on a single accelerator—the roofline—and then asks what survives as devices are added, governed by three models: Amdahl’s Law extended to fleet overhead, the communication-computation ratio, and weak scaling behavior.

The single-accelerator roofline

Fleet performance rests on per-accelerator performance, and a single accelerator is bounded by one of two resources: the rate its arithmetic units sustain (peak FLOP/s) or the rate its memory delivers operands (HBM bandwidth). Which one binds a given kernel depends on its arithmetic intensity \(I\), the number of useful operations performed per byte moved from memory, as equation 2 defines:

\[ I = \frac{\text{FLOPs}}{\text{bytes moved}} \tag{2}\]

Intensity is a property of the workload, not the hardware. A large matrix multiply has high intensity because each weight loaded from memory is reused across many multiply-accumulates; an element-wise operation, or the generation of a single token, has low intensity because each loaded value feeds only one or two operations before the next must be fetched.

The roofline model (Williams et al. 2009) turns intensity into a performance ceiling. Equation 3 states that attainable throughput is the lesser of the compute ceiling and the bandwidth-limited slope:

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://doi.org/10.1145/1498765.1498785.

\[ R_{\text{attain}} = \min\!\left(R_{\text{peak}},\; I \times \text{BW}\right) \tag{3}\]

Plotted against intensity, equation 3 traces a roof: a rising bandwidth-limited line on the left and a flat compute ceiling on the right. The two meet at the ridge point, the intensity at which a workload first saturates the arithmetic units, defined by equation 4:

\[ I_{\text{ridge}} = \frac{R_{\text{peak}}}{\text{BW}} \tag{4}\]

A workload with \(I < I_{\text{ridge}}\) is memory bound: it sits on the slope, and faster arithmetic units change nothing because the operands cannot arrive quickly enough. A workload with \(I > I_{\text{ridge}}\) is compute bound: it sits under the flat roof, limited by the arithmetic units themselves.

For an H100—BF16 peak 989 TFLOP/s, HBM bandwidth 3.35 TB/s—the ridge point is 295.2 FLOP/byte. A workload must perform hundreds of operations per byte fetched merely to keep the arithmetic units busy.

Autoregressive decoding falls far short of that line, which is why it is the defining serving bottleneck. Generating one token requires reading every model weight from HBM exactly once—16.1 GB for a 8B-parameter model in BF16—to perform about 16.1 GFLOP of matrix work, an intensity of 1 FLOP/byte. That sits far below the ridge point, so single-stream decode is purely bandwidth bound: throughput is the weight-read rate, about 208.6 tokens/s, no matter how much compute the accelerator advertises. Serving systems batch many requests precisely to raise this intensity, amortizing each weight read across every sequence in the batch and pushing decode back toward the compute roof.

The roofline is a single-accelerator law, but its logic is what scales. At fleet level the binding bandwidth is no longer HBM but the interconnect, and the bytes moved are the gradients and activations crossing the network. That ratio of network bytes to local arithmetic is the communication-computation ratio (section 1.3.3): the roofline of the fleet, and the lens the rest of this section applies.

A training job provisioned with 4,096 GPUs achieves only 2.5× the throughput of a 1,024-GPU run. That observation implies about 31.2 percent scaling efficiency at the larger size, so scaling physics provides the diagnostic tools to determine whether the system is performing as physics allows or whether there is an engineering problem to fix.

Amdahl’s Law at fleet scale

Amdahl’s Law establishes that the maximum speedup of a parallel system is limited by its serial fraction \(s\). At fleet scale, the “serial fraction” is not just sequential code—it includes every operation that forces all \(N\) GPUs to wait:

AllReduce synchronization: All GPUs must complete their gradient computation before any can proceed.
Pipeline bubbles: The warmup and cooldown phases of pipeline parallelism leave stages idle.
Checkpoint writes: Even asynchronous checkpoints contend for storage bandwidth.
Python-level overhead: Single-threaded operations in the training loop (data loading, metric logging).

To see the fleet-scale implications, consider a training workload where 10 percent of wall-clock time is spent in synchronization, communication, and other serial overhead. Amdahl’s Law gives the following speedups:

With 32 GPUs: 7.8× speedup (good efficiency)
With 256 GPUs: 9.7× speedup (diminishing returns begin)
With 1,024 GPUs: 9.9× speedup (approaching the ceiling)
With 8,192 GPUs: 10× speedup (nearly at the Amdahl limit)
With \(N \to \infty\): capped at 10×

With just 10 percent serial overhead, no amount of hardware can deliver more than 10× speedup on this fixed workload. This is why fleet-scale training does not simply add more GPUs to the same problem—it scales the problem (weak scaling) to keep the serial fraction small relative to the total work.

Systems Perspective 1.3: The compound overhead trap

The 10 percent serial fraction above is optimistic. In practice, fleet-scale serial overhead accumulates from multiple sources: 5 percent pipeline bubbles + 3 percent checkpointing + 10 percent failure recovery + 5 percent maintenance. These are not all strictly serial in the Amdahl sense—some overlap with computation—but they illustrate how quickly small per-category overheads compound into significant throughput loss.

The communication-computation ratio

The fundamental question for any distributed training strategy is: does the computation between synchronization points take long enough to hide the communication? equation 5 defines the communication-computation ratio (\(\rho\)) that answers this directly:

\[ \rho = \frac{T_{\text{comm}}(N)}{T_{\text{compute}}/N} \tag{5}\]

When \(\rho < 1\), computation dominates and communication can be overlapped. When \(\rho > 1\), the system is communication-bound—GPUs spend more time waiting for data than computing on it.

Table 11 shows the ratio for three representative scenarios. The contrast between them reveals why parallelism strategy must match the workload.

Table 11: Communication-Computation Ratio (\(\rho\)): When \(\rho \ll 1\), communication can be fully overlapped with computation. When \(\rho \gg 1\), the system is communication-bound and GPUs sit idle waiting for data. Tensor parallelism within a node benefits from NVLink’s high bandwidth, keeping \(\rho\) small.

Scenario	\(T_{\text{comm}}(N)\)	\(T_{\text{compute}}/N\)	\(\rho\)
Data parallel, 7B model, 256	560.4 ms (AllReduce over IB NDR)	~5 s (fwd+bwd)	0.11
Data parallel, 350M model, 256	29.6 ms (AllReduce over IB NDR)	~10 ms (fwd+bwd)	3
Tensor parallel, 8 (NVLink)	~0.04 ms (activation over NVLink)	~1 ms (one layer)	0.036

The 7B model on 256 achieves \(\rho =\) 0.11—communication takes about 11.2 percent as long as computation, which can be partially overlapped. The 350M model on the same cluster has \(\rho =\) 3—communication dominates, making this configuration communication-bound. The solution is either to use fewer GPUs (reduce \(N\) in the AllReduce) or to increase the computation per step (larger batch size, gradient accumulation).

Tensor parallelism within a node achieves \(\rho =\) 0.036, confirming that NVLink bandwidth is sufficient to keep intra-node parallelism compute-bound. This is the quantitative reason why tensor parallelism is confined within nodes while data parallelism spans across them.

Weak scaling at fleet scale

Amdahl’s Law paints a pessimistic picture because it assumes a fixed problem size. In practice, fleet-scale training often follows weak scaling: the problem size (tokens, data, model parameters) grows proportionally with the number of GPUs. Gustafson’s Law captures this more optimistic view (Gustafson 1988).

Gustafson, John L. 1988. “Reevaluating Amdahl’s Law.” Communications of the ACM 31 (5): 532–33. https://doi.org/10.1145/42411.42415.

The key insight for fleet-scale ML is that weak scaling is not just a mathematical convenience—it reflects reality. Engineers do not use 8,192 GPUs to train a 7B model faster; they use them to train a 70B or 700B model in reasonable time. As models and training batches grow, engineers often increase the compute performed between synchronization points. For dense Transformers, compute is commonly estimated as about 6\(\times\) parameters \(\times\) tokens, while gradient communication scales with parameter bytes, so the communication-computation ratio \(\rho\) depends on batch size, sequence length, accumulation, and parallelism layout rather than on parameter count alone.

Systems Perspective 1.4: The compound loss of fleet utilization

A 1,024-GPU H100 cluster has a peak aggregate throughput of 1,012,736 TFLOP/s. After the three multiplicative losses, the effective throughput is:

\[R_{\text{eff}} = N R_{\text{peak}} \times \text{MFU} \times \eta_{\text{scaling}} \times \eta_{\text{goodput}}\]

\[= 1,012,736 TFLOP/s \times 0.50 \times 0.50 \times 0.77 \approx 194,951.7 \text{ TFLOP/s}\]

The cluster delivers 19.2 percent of its peak FLOP/s as useful training work. The remaining 80.8 percent is explained by retained factors of 50 percent MFU, 50 percent scaling efficiency, and 77 percent goodput.

This is not a failure of engineering—it is the physics of fleet-scale computation. Every additional GPU adds less marginal useful work, but the total throughput still far exceeds what a smaller cluster could achieve. The goal is not to reach 100 percent utilization; the goal is to deliver trained models faster than any smaller configuration could.

Thermal and Power Physics

Compute performance ultimately converts to heat, and heat must be removed. At fleet scale, thermal and power constraints are not secondary concerns—they determine where construction is feasible, how densely accelerators can be packed, and what the operating costs will be. A cluster that is architecturally sound but thermally infeasible cannot be built.

A cluster design calling for 10,000 H100 GPUs at 700 W each—7 MW of IT load—and a PUE of 1.40 puts the facility load at 9.8 MW total. This is the electrical load of a small town. Before optimizing software, the physics of power delivery and heat removal must be feasible.

Power density wall

The shift from traditional data center workloads to AI training has created a power density crisis. Table 12 quantifies the gap.

Table 12: Power Density: Traditional vs. AI Workloads: AI racks consume roughly 5.8× the power of traditional racks. Air cooling physically cannot remove heat fast enough above ~30 kW per rack, making liquid cooling mandatory for modern AI clusters.

Configuration	Power per Rack	Cooling Implication
Traditional data center	12 kW	Standard air cooling sufficient
AI cluster (current gen)	70 kW	Liquid cooling required
AI cluster (high-density)	100 kW	Direct-to-chip liquid cooling required
Air cooling limit	~30 kW	Physics ceiling for forced-air convection

The 5.8× increase in rack power density between traditional and AI workloads is not merely an engineering challenge—it represents a fundamental constraint on facility design. Existing data centers built for 12 kW racks cannot simply be “retrofitted” with AI hardware. The electrical distribution, cooling infrastructure, and floor loading must all be redesigned. This is why new AI-focused data centers are being built from the ground up with liquid cooling as the baseline assumption.

The energy hierarchy at scale

Power Usage Effectiveness (PUE) measures facility energy overhead: total facility energy divided by IT equipment energy (The Green Grid 2007; Uptime Institute 2022). A PUE of 1.0 would mean all facility energy is delivered to IT equipment; values above 1.0 represent overhead for cooling, power conversion, lighting, and other facility systems.

The Green Grid. 2007. Green Grid Data Center Power Efficiency Metrics: PUE and DCIE. The Green Grid.

Uptime Institute. 2022. Uptime Institute Global Data Center Survey 2022. Uptime Institute.

Table 13 compares PUE across data center generations and cooling technologies.

Table 13: Power Usage Effectiveness (PUE): Lower is better. Liquid cooling achieves PUE near 1.0 because it removes heat directly from the chip without the intermediate step of heating air. The gap between legacy (1.58) and liquid-cooled (1.06) represents about a 32.9 percent reduction in total facility power at fixed IT load; the non-IT overhead drops from 0.58 MW to 0.06 MW per MW of IT load.

Data Center Type	PUE	Overhead per MW of IT Load
Liquid-cooled AI data center	1.06	60 kW cooling + infrastructure
Best-in-class air-cooled	1.12	120 kW overhead
Industry average	1.40	400 kW overhead
Legacy enterprise	1.58	580 kW overhead

For fleet-scale cost calculations, PUE directly multiplies the electricity bill. A 10 MW IT load in a legacy data center (PUE 1.58) requires 15.8 MW total, while the same load in a liquid-cooled facility (PUE 1.06) requires only 10.6 MW—a savings of 5.2 MW. At typical commercial electricity rates, this difference translates to millions of dollars per year. Sustainable AI covers the full sustainability implications.

Summary

Key Takeaways: Fleet scale changes every constraint

Three paradigms, one hybrid: ML fleets combine HPC’s tight coupling (for training) with WSC’s elastic fault tolerance (for inference), creating a new architectural paradigm that cannot adopt either predecessor’s patterns wholesale.
Bandwidth gap: NVLink provides ~9× more per-direction bandwidth than InfiniBand NDR. This invariant ratio determines parallelism placement: tensor parallelism within nodes, data parallelism across nodes.
Failures become routine at scale: Cluster MTBF scales as \(1/N\). At 8,192 GPUs, expect a GPU failure about every 366.2 minutes, with 98.04 percent probability over 24 hours. Fault tolerance is not optional—it is a prerequisite for completing any fleet-scale training job.
Compound utilization loss: After MFU (~50 percent), scaling efficiency (~50 percent), and goodput after operational overhead (~77 percent) compound, a 1,024-GPU cluster delivers ~19.2 percent of peak FLOP/s as useful work.
Power density demands liquid cooling: AI racks consume 5.8× the power of traditional racks. Air cooling fails above ~30 kW per rack, making liquid cooling a physical requirement for modern AI clusters.
Communication-computation ratio (\(\rho\)) governs scaling strategy: When \(\rho > 1\), GPUs are idle waiting for data—reduce parallelism or increase computation per step. When \(\rho \ll 1\), communication can be fully overlapped.
Weak scaling is the fleet-scale paradigm: Engineers do not use more GPUs to solve the same problem faster (strong scaling); they use more GPUs to solve larger problems in reasonable time. This keeps the serial fraction small and utilization high.