Fleet Foundations
Purpose
What reference numbers and physical laws should every fleet-scale ML engineer carry into distributed system design decisions?
Designing a system for a single machine requires knowledge of memory hierarchy latencies, roofline ridge points, and precision trade-offs. The moment a training job spans two machines, however, a new set of numbers takes over. Network latency replaces cache latency as the dominant concern. Component failure rates compound from negligible to inevitable. Communication overhead erodes the scaling efficiency that justifies buying more accelerators in the first place. This appendix collects the reference numbers and compact models for fleet-scale reasoning: the three system paradigms that underpin this book, the numbers every fleet engineer should know across Compute, Communication, and Coordination, and the scaling physics and thermal constraints that govern cluster design. In C³ terms, these numbers define the physical scale on which compute capacity, communication bandwidth, and coordination overhead become measurable.
How to Use This Appendix
This appendix is designed as a reference. When a fleet-scale design question arises, use it to turn a vague symptom into a specific constraint and then choose the lever that can actually move.
- “How fast can I communicate between nodes?”: Start with the Communication Numbers in section 1.2 for the bandwidth and latency hierarchy.
- “How many GPUs do I actually need?”: Use the Scaling Physics in section 1.3 to understand why doubling GPUs does not halve training time.
- “How often will my cluster fail?”: Check the Coordination Numbers in section 1.2 for MTBF tables and failure probability calculations.
- “Is my cluster power-limited or compute-limited?”: See Thermal and Power Physics in section 1.4 for power density and cooling constraints.
- “What does a typical overhead budget look like?”: The Coordination Numbers include the four overhead categories that erode goodput.
The Three System Paradigms
Machine learning infrastructure at scale inherits design DNA from two distinct computing lineages—and then breaks the assumptions of both. Understanding where ML systems borrow from High-Performance Computing (HPC) and where they borrow from Warehouse-Scale Computing (WSC) is essential for choosing the right design trade-offs. A system architect who treats ML training as “just HPC” will build infrastructure that cannot tolerate failures. One who treats it as “just web services” will build infrastructure that cannot sustain the tight coupling that synchronous training demands.
High-performance computing
HPC systems descend from the supercomputer tradition. Their design philosophy is to maximize FLOP/s on tightly coupled simulations—weather modeling, molecular dynamics, nuclear physics. Every node matters: they use specialized interconnects (InfiniBand), low-latency fabrics, and homogeneous hardware. Fault tolerance follows the checkpoint/restart model: if one node fails, the entire job stops, rolls back to the last checkpoint, and restarts from scratch. Scheduling is batch-oriented (Slurm), with jobs requesting rigid resource shapes (“512 nodes for 24 hours”). Nodes are pets—individually important, individually tracked.
Warehouse-scale computing
WSC systems descend from the web services tradition. Their design philosophy is to maximize queries per second across loosely coupled services—search, email, social media. Hardware is commodity Ethernet, varying generations coexist, and nodes are heterogeneous. Fault tolerance follows the redundancy model: if one node fails, the load balancer reroutes traffic to another replica. The user never notices. Scheduling is dynamic (Kubernetes, Borg), with elastic bin-packing. Nodes are cattle—interchangeable and expendable.
The ML fleet: A hybrid architecture
ML systems require the computational throughput of HPC (to train massive models with synchronous gradient updates) but must operate at the scale and unreliability of WSC (thousands of accelerators running for weeks). This creates a hybrid that borrows selectively from both traditions.
Training workloads are synchronous and bandwidth-hungry like HPC, but long-running and failure-tolerant like WSC. Inference workloads are latency-sensitive like WSC, but computationally heavy like HPC. The network is a fusion: TCP/IP for control planes, InfiniBand or NVLink for data planes. Fault tolerance uses elastic training strategies: training jobs can shrink, expand, pause, or resume without full restarts. Scheduling combines gang allocation (all-or-nothing for training) with dynamic preemption and replacement.
Table 1 summarizes the design trade-offs across the three paradigms. The key insight is that ML fleets cannot simply adopt either HPC or WSC patterns wholesale; they must selectively combine elements from each based on the workload phase.
| Dimension | HPC (Supercomputer) | WSC (Web Cloud) | ML Fleet (AI Cluster) |
|---|---|---|---|
| Philosophy | Maximize FLOP/s | Maximize QPS | Maximize Model Quality per Dollar/Watt |
| Coupling | Tight (MPI) | Loose (RPC/HTTP) | Hybrid (NCCL + RPC) |
| State | Stateful (RAM) | Stateless (DB-backed) | Semi-Stateful (Checkpoints + KV Cache) |
| Network | Latency-optimized | Bandwidth-optimized | Bisection-bandwidth critical |
| Bottleneck | Compute throughput (FLOP/s) | I/O (Disk/Net) | Memory bandwidth (HBM) |
| Fault model | Checkpoint/Restart | Redundancy/Replicas | Elastic shrink/expand |
| Scheduling | Batch (Slurm) | Orchestration (K8s) | Gang + preemption |
| Node model | Pets (tracked) | Cattle (expendable) | Pets during job, cattle between jobs |
Foundations recap
The following provides a compact reference for the key foundational ideas that reappear throughout the distributed systems chapters.
- The iron law (\(T \approx D_{\text{vol}}/\text{BW} + O/(R_{\text{peak}} \cdot \eta_{\text{hw}}) + L_{\text{lat}}\)): Performance is bounded by data movement or compute. At fleet scale, the data movement term expands to include inter-node communication, not just memory bandwidth.
- Roofline Model: Distinguishes compute-bound from memory-bound workloads using arithmetic intensity. At fleet scale, a third ceiling appears: network-bound workloads whose performance is limited by inter-node bandwidth.
- Amdahl’s Law: Caps strong-scaling speedup at \(1/s\) (Amdahl 1967). At fleet scale, the “serial fraction” includes not just sequential code but also synchronization barriers, collective communication, and pipeline bubbles.
- Training state rule: about 14 bytes per parameter for common mixed-precision Adam checkpoint state (BF16 weights, FP32 master weights, and Adam moments), before framework metadata and implementation-specific extras. At fleet scale, this determines how model state is partitioned across nodes (ZeRO, tensor parallelism, pipeline parallelism).
- Little’s Law (\(Q_{\text{req}} = \lambda_{\text{arr}} T_{\text{lat}}\)): Sizes inference infrastructure, where \(\lambda_{\text{arr}}\) is the request arrival rate and \(Q_{\text{req}}\) is the expected concurrent request count (Little 1961). At fleet scale, it determines how many serving replicas are needed behind a load balancer.
The transition from single-machine to fleet-scale reasoning requires extending these models with new dimensions: network topology, failure probability, and coordination overhead. The numbers in the next section provide the quantitative foundation for that extension.
Numbers Every Fleet Engineer Should Know
Just as single-machine analysis depends on a set of core numbers, fleet-scale engineering is governed by a set of predictable ratios and scaling behaviors. The single-machine numbers still apply within each node, but a new set of numbers governs the spaces between nodes. While absolute values evolve with hardware generations, the ratios between communication tiers and the scaling behavior of failure rates remain remarkably stable. Memorize the ratios and scaling trends; use the specific numbers as sanity checks.
Systems Perspective 1.1: Node-level numbers for fleet reasoning
Those node-level baselines make the fleet-level ratios easier to interpret.
Systems Perspective 1.2: Three fleet numbers that matter most
- Bandwidth gap: NVLink bandwidth within a node is ~9× faster than InfiniBand between nodes. This ratio determines where parallelism boundaries belong—model parallelism within a node, data parallelism across nodes.
- MTBF scaling: A cluster’s mean time between failures is the single-component MTBF divided by the number of components (\(1/N\) scaling). At 100,000 GPUs, expect a failure every 30 minutes.
- Effective utilization: After MFU, scaling efficiency, and overhead losses compound, a cluster with 1,024 GPUs delivers roughly 19.2 percent of its peak FLOP/s as useful training work.
Quick reference—table 2 condenses the fleet-scale numbers into one place. Use it for back-of-envelope checks; use the detailed tables in each subsection when designing or debugging.
| Category | Number | Use |
|---|---|---|
| Communication | NVLink ~9× IB NDR | Parallelism boundary (in-node vs. cross-node) |
| Communication | IB NDR ~50 GB/s, ~5 μs one-way | Inter-node bandwidth and latency |
| Compute | MFU 30–50%, \(\eta_{\text{scaling}}\) ~50% @ 1K, ~35% @ 8K | Effective FLOP/s and scaling sanity checks |
| Coordination | MTBF 8K: ~366.2 min; 100K: ~30 min | Failure expectation and checkpoint cadence |
| Coordination | 175B checkpoint ~2450 GB (14 B/param common Adam state) | Recovery and storage sizing |
| Coordination | Goodput ~77% after overheads | Wall-clock utilization |
| Power & sustainability | AI rack ~70 kW; air limit ~30 kW | Cooling feasibility (liquid required above air limit) |
| Power & sustainability | PUE liquid ~1.06, typical ~1.40; H100 700 W -> 10,000 ≈ 7 MW IT \(\times\) PUE | Facility load and carbon (see section 1.4) |
The invariants: Ratios that will not change
These relationships are governed by physics or architecture—they will still be true in 2035.
Network hierarchy ratio
The bandwidth gap between intra-node and inter-node communication is an architectural invariant. Chip-to-chip links (NVLink, ICI) connect through short, wide, dedicated paths on a shared substrate. Inter-node links (InfiniBand, Ethernet) must traverse cables, switches, and protocol stacks. This structural difference guarantees that intra-node bandwidth will always be an order of magnitude higher than inter-node bandwidth.
Currently, NVLink 4.0 provides 9× more bandwidth than InfiniBand NDR. Even as both technologies improve, the ratio persists because both are constrained by the same physics: signaling rates, lane counts, and connector density. This ratio is the single most important number for parallelism strategy: any operation requiring more bandwidth than the inter-node link can provide must be confined within a single node.
Failure scaling law
For \(N\) independent components, each with mean time between failures \(\text{MTBF}_{\text{component}}\) under a constant-failure-rate assumption, equation 1 gives the system MTBF:
\[ \text{MTBF}_{\text{system}} = \frac{\text{MTBF}_{\text{component}}}{N} \tag{1}\]
This is pure arithmetic, not an approximation. Doubling the cluster size halves the time between failures. At 100,000 GPUs with a per-GPU MTTF of 50,000 hours, the cluster experiences a GPU failure every 30 minutes. No fault tolerance strategy can avoid this—the question is how quickly the system recovers.
AllReduce overhead scaling
The bandwidth-optimal Ring AllReduce algorithm transfers \(2(N-1)M/N\) bytes per participant, where \(M\) is the message size and \(N\) is the number of participants (Patarasuk and Yuan 2009). As \(N\) grows large, this approaches \(2M\) per GPU, so the per-GPU bandwidth term is nearly independent of the number of GPUs; aggregate cluster traffic still grows with the number of participants. This is why Ring AllReduce scales well in the bandwidth term. The latency term, however, grows as \(2(N-1) \times \alpha\), making latency the bottleneck for small messages on large rings. This trade-off motivates hierarchical AllReduce strategies that use Ring AllReduce within nodes and Tree AllReduce across nodes.
Communication numbers
Communication defines the boundaries of parallelism. These tables quantify the bandwidth and latency at each tier of the network hierarchy, from the fastest intra-node links to the slowest wide-area connections. The key question for any distributed ML operation is: which tier of the network hierarchy does this communication cross?
Table 3 shows the bandwidth available at each tier. Note the order-of-magnitude drops as communication crosses node boundaries.
Table 3 lists the bandwidth tiers that determine where tensor, pipeline, and data-parallel traffic should stay local.
| Interconnect | Bandwidth | Typical Role |
|---|---|---|
| NVLink 4.0 (H100) | 900 GB/s | Tensor/pipeline parallelism within a node |
| TPU v5p ICI | 1,200 GB/s | Intra-pod model parallelism (Google) |
| PCIe Gen5 x16 | 64 GB/s | CPU-GPU data transfer, NIC attachment |
| IB GXDR (1.6 Tbps) | 200 GB/s | Next-gen inter-node (2026+) |
| IB XDR (800 Gbps) | 100 GB/s | Inter-node standard (2025+) |
| IB NDR (400 Gbps) | 50 GB/s | Current inter-node standard for AI clusters |
| IB HDR (200 Gbps) | 25 GB/s | Previous-gen inter-node |
| RoCE v2 (100 GbE) | 12.5 GB/s | Budget clusters, inference fleets |
Table 4 shows the one-way latency at each tier. For collective operations on small messages, latency—not bandwidth—is the bottleneck.
| Interconnect | One-Way Latency | Implication |
|---|---|---|
| InfiniBand NDR | ~5 μs | Low enough for synchronous AllReduce |
| InfiniBand HDR | ~7 μs | Adequate for most training topologies |
| RoCE v2 | ~10 μs | Acceptable for data parallelism |
| TCP/IP (Ethernet) | ~50 μs | Too slow for synchronous training |
| Cross-data-center | ~40,000 \(\mu\)s (40 ms) | Physics floor; async training only |
Compute numbers
Raw peak FLOP/s is a necessary but misleading metric for fleet capacity planning. Two multiplicative losses—Model FLOPs Utilization (MFU) and scaling efficiency—reduce effective throughput dramatically. Understanding these losses transforms fleet sizing from guesswork into engineering.
Model FLOPs Utilization (MFU) measures what fraction of peak FLOP/s a training workload actually achieves. Well-optimized large-model training on current hardware achieves 30–50 percent MFU. The gap comes from memory stalls, kernel launch overhead, pipeline bubbles, and suboptimal operator fusion. MFU below the low end of that range signals optimization opportunities; MFU above the high end indicates excellent hardware utilization.
Scaling efficiency (\(\eta_{\text{scaling}}\)) measures how much useful computation survives as accelerators are added. Table 5 shows the empirical ranges for well-optimized distributed training.
| Cluster Size | Scaling Efficiency (\(\eta_{\text{scaling}}\)) | Implication |
|---|---|---|
| 32 GPUs | ~90% | Near-linear scaling; communication is negligible |
| 256 GPUs | ~70% | Communication starts to erode throughput |
| 1,024 GPUs | ~50% | Significant overhead; optimization critical |
| 8,192 GPUs | ~35% | Fleet-scale regime; 65% of compute is overhead |
Coordination numbers
At fleet scale, coordination—failure recovery, checkpointing, and maintenance—consumes a measurable fraction of wall-clock time. These numbers quantify the costs of keeping a large cluster running.
Failure rates by cluster size
Table 6 shows how cluster MTBF shrinks with scale, using a per-GPU MTTF of 50,000 hours (~5.7 years). The failure probability column shows the likelihood of at least one GPU failure during a 24-hour training window.
Table 6 translates per-GPU reliability into cluster-level failure cadence, which is the number checkpoint planning must absorb.
| Cluster Size | MTBF (GPU-only) | Minutes | \(\Pr(\text{failure})\) in 24 hours |
|---|---|---|---|
| 256 GPUs | 195.3 h | 11,718.8 min | 11.6% |
| 2,048 GPUs | 24.4 h | 1,464.8 min | 62.6% |
| 8,192 GPUs | 6.1 h | 366.2 min | 98% |
| 100,000 GPUs | 0.50 h | 30 min | 100% |
The key takeaway: at 8,192 GPUs and above, failure is no longer an edge case. GPU-only MTBF is about 366.2 min at 8,192 GPUs, and the probability of at least one GPU failure within 24 hours is 98 percent. Fault tolerance is not optional at fleet scale; it is a prerequisite for completing any long-running training job. Fault Tolerance covers the mechanisms in detail.
Checkpoint sizes
Checkpointing is the primary recovery mechanism, and its cost depends on the model size. Table 7 shows checkpoint sizes for common mixed-precision Adam training checkpoints (14 bytes per parameter: 2B for BF16 weights and 12B for FP32 master weights + momentum + variance). Gradients are normally transient and recomputed after restore rather than serialized as durable checkpoint state.
| Model Size | Checkpoint Size | Write Time @ 100 GB/s |
|---|---|---|
| 7B parameters | 98 GB | 0.98 s |
| 70B parameters | 980 GB | 9.8 s |
| 175B parameters | 2,450 GB | 24.5 s |
| 1T parameters | 14 TB | 2.3 min |
Overhead budgets
Failure recovery and checkpointing are only part of the wall-clock budget. At fleet scale, table 8 separates four recurring categories of overhead that consume time not spent on useful training:
| Overhead Category | Typical Budget | Lever |
|---|---|---|
| Pipeline bubbles | ~5% | Increase microbatches per pipeline stage |
| Checkpointing | ~3% | Async checkpointing, faster storage |
| Failure recovery | ~10% | Faster detection, elastic rescheduling |
| Maintenance windows | ~5% | Rolling upgrades, live migration |
Power and sustainability numbers
Fleet-scale capacity planning and sustainability reporting require a few power numbers that every fleet engineer should know. At fleet scale, the critical numbers are rack power density, PUE, and the air-cooling limit—they determine where construction is feasible and what the facility load will be. Table 9 collects these reference values:
| Quantity | Typical value | Use |
|---|---|---|
| Traditional rack | 12 kW | Baseline for non-AI data centers |
| AI rack (current gen) | 70 kW | Liquid cooling required |
| AI rack (high-density) | 100 kW | Direct-to-chip liquid |
| Air cooling limit | ~30 kW per rack | Physics ceiling; above this, liquid is mandatory |
| PUE (liquid-cooled AI) | ~1.06 | Best case: facility load ≈ IT load |
| PUE (best air-cooled) | ~1.12 | Hyperscale best practice |
| PUE (industry average) | ~1.40 | Sanity check for cost/carbon |
| H100 TDP | 700 W per GPU | IT load: 10,000 \(\times\) 700 W = 7 MW |
Rule of thumb: IT load (MW) = (number of GPUs \(\times\) TDP per GPU)/\(10^6\); facility load = IT load \(\times\) PUE. A 10,000-GPU H100 cluster at 700 W each is 7 MW IT; at PUE 1.40 that is 9.8 MW facility draw. For carbon and cost, see section 1.4 and Sustainable AI.
Current hardware reference (c. 2025–2026)
These numbers (table 10) reflect the current generation of fleet-scale hardware. Use them for back-of-envelope calculations, but expect them to improve ~2\(\times\) every 2–3 years.
| Spec | NVIDIA B200 | AMD MI300X | Google TPU v6 |
|---|---|---|---|
| BF16/FP16 Peak | 2,250 TFLOP/s | 1,307 TFLOP/s | 918 TFLOP/s |
| Memory Bandwidth | 8 TB/s | 5.30 TB/s | 1.60 TB/s |
| HBM Capacity | 192 GB | 192 GB | 32 GB |
| Intra-Node Link | 1,800 GB/s (NVLink 5.0) | ~890 GB/s (Infinity Fab) | ~2,000 GB/s (estimated) |
| TDP | 1000 W | 750 W | ~600 W (estimated) |
For scale context, a DGX H100 SuperPOD contains 32 DGX H100 nodes (256 H100 GPUs). Meta announced two 24,576-H100 data-center-scale clusters built on the Grand Teton hardware platform. Google’s TPU v5p pods scale to 8,960 chips. The largest announced clusters (as of 2025) exceed 100,000 accelerators.
Scaling Physics
The numbers in the previous section describe what the hardware can do. Scaling physics describes what happens when more of it is brought to bear. Reasoning starts from the ceiling on a single accelerator—the roofline—and then asks what survives as devices are added, governed by three models: Amdahl’s Law extended to fleet overhead, the communication-computation ratio, and weak scaling behavior.
The single-accelerator roofline
Fleet performance rests on per-accelerator performance, and a single accelerator is bounded by one of two resources: the rate its arithmetic units sustain (peak FLOP/s) or the rate its memory delivers operands (HBM bandwidth). Which one binds a given kernel depends on its arithmetic intensity \(I\), the number of useful operations performed per byte moved from memory, as equation 2 defines:
\[ I = \frac{\text{FLOPs}}{\text{bytes moved}} \tag{2}\]
Intensity is a property of the workload, not the hardware. A large matrix multiply has high intensity because each weight loaded from memory is reused across many multiply-accumulates; an element-wise operation, or the generation of a single token, has low intensity because each loaded value feeds only one or two operations before the next must be fetched.
The roofline model (Williams et al. 2009) turns intensity into a performance ceiling. Equation 3 states that attainable throughput is the lesser of the compute ceiling and the bandwidth-limited slope:
\[ R_{\text{attain}} = \min\!\left(R_{\text{peak}},\; I \times \text{BW}\right) \tag{3}\]
Plotted against intensity, equation 3 traces a roof: a rising bandwidth-limited line on the left and a flat compute ceiling on the right. The two meet at the ridge point, the intensity at which a workload first saturates the arithmetic units, defined by equation 4:
\[ I_{\text{ridge}} = \frac{R_{\text{peak}}}{\text{BW}} \tag{4}\]
A workload with \(I < I_{\text{ridge}}\) is memory bound: it sits on the slope, and faster arithmetic units change nothing because the operands cannot arrive quickly enough. A workload with \(I > I_{\text{ridge}}\) is compute bound: it sits under the flat roof, limited by the arithmetic units themselves.
For an H100—BF16 peak 989 TFLOP/s, HBM bandwidth 3.35 TB/s—the ridge point is 295.2 FLOP/byte. A workload must perform hundreds of operations per byte fetched merely to keep the arithmetic units busy.
Autoregressive decoding falls far short of that line, which is why it is the defining serving bottleneck. Generating one token requires reading every model weight from HBM exactly once—16.1 GB for a 8B-parameter model in BF16—to perform about 16.1 GFLOP of matrix work, an intensity of 1 FLOP/byte. That sits far below the ridge point, so single-stream decode is purely bandwidth bound: throughput is the weight-read rate, about 208.6 tokens/s, no matter how much compute the accelerator advertises. Serving systems batch many requests precisely to raise this intensity, amortizing each weight read across every sequence in the batch and pushing decode back toward the compute roof.
The roofline is a single-accelerator law, but its logic is what scales. At fleet level the binding bandwidth is no longer HBM but the interconnect, and the bytes moved are the gradients and activations crossing the network. That ratio of network bytes to local arithmetic is the communication-computation ratio (section 1.3.3): the roofline of the fleet, and the lens the rest of this section applies.
A training job provisioned with 4,096 GPUs achieves only 2.5× the throughput of a 1,024-GPU run. That observation implies about 31.2 percent scaling efficiency at the larger size, so scaling physics provides the diagnostic tools to determine whether the system is performing as physics allows or whether there is an engineering problem to fix.
Amdahl’s Law at fleet scale
Amdahl’s Law establishes that the maximum speedup of a parallel system is limited by its serial fraction \(s\). At fleet scale, the “serial fraction” is not just sequential code—it includes every operation that forces all \(N\) GPUs to wait:
- AllReduce synchronization: All GPUs must complete their gradient computation before any can proceed.
- Pipeline bubbles: The warmup and cooldown phases of pipeline parallelism leave stages idle.
- Checkpoint writes: Even asynchronous checkpoints contend for storage bandwidth.
- Python-level overhead: Single-threaded operations in the training loop (data loading, metric logging).
To see the fleet-scale implications, consider a training workload where 10 percent of wall-clock time is spent in synchronization, communication, and other serial overhead. Amdahl’s Law gives the following speedups:
- With 32 GPUs: 7.8× speedup (good efficiency)
- With 256 GPUs: 9.7× speedup (diminishing returns begin)
- With 1,024 GPUs: 9.9× speedup (approaching the ceiling)
- With 8,192 GPUs: 10× speedup (nearly at the Amdahl limit)
- With \(N \to \infty\): capped at 10×
With just 10 percent serial overhead, no amount of hardware can deliver more than 10× speedup on this fixed workload. This is why fleet-scale training does not simply add more GPUs to the same problem—it scales the problem (weak scaling) to keep the serial fraction small relative to the total work.
Systems Perspective 1.3: The compound overhead trap
The communication-computation ratio
The fundamental question for any distributed training strategy is: does the computation between synchronization points take long enough to hide the communication? equation 5 defines the communication-computation ratio (\(\rho\)) that answers this directly:
\[ \rho = \frac{T_{\text{comm}}(N)}{T_{\text{compute}}/N} \tag{5}\]
When \(\rho < 1\), computation dominates and communication can be overlapped. When \(\rho > 1\), the system is communication-bound—GPUs spend more time waiting for data than computing on it.
Table 11 shows the ratio for three representative scenarios. The contrast between them reveals why parallelism strategy must match the workload.
| Scenario | \(T_{\text{comm}}(N)\) | \(T_{\text{compute}}/N\) | \(\rho\) |
|---|---|---|---|
| Data parallel, 7B model, 256 | 560.4 ms (AllReduce over IB NDR) | ~5 s (fwd+bwd) | 0.11 |
| Data parallel, 350M model, 256 | 29.6 ms (AllReduce over IB NDR) | ~10 ms (fwd+bwd) | 3 |
| Tensor parallel, 8 (NVLink) | ~0.04 ms (activation over NVLink) | ~1 ms (one layer) | 0.036 |
The 7B model on 256 achieves \(\rho =\) 0.11—communication takes about 11.2 percent as long as computation, which can be partially overlapped. The 350M model on the same cluster has \(\rho =\) 3—communication dominates, making this configuration communication-bound. The solution is either to use fewer GPUs (reduce \(N\) in the AllReduce) or to increase the computation per step (larger batch size, gradient accumulation).
Tensor parallelism within a node achieves \(\rho =\) 0.036, confirming that NVLink bandwidth is sufficient to keep intra-node parallelism compute-bound. This is the quantitative reason why tensor parallelism is confined within nodes while data parallelism spans across them.
Weak scaling at fleet scale
Amdahl’s Law paints a pessimistic picture because it assumes a fixed problem size. In practice, fleet-scale training often follows weak scaling: the problem size (tokens, data, model parameters) grows proportionally with the number of GPUs. Gustafson’s Law captures this more optimistic view (Gustafson 1988).
The key insight for fleet-scale ML is that weak scaling is not just a mathematical convenience—it reflects reality. Engineers do not use 8,192 GPUs to train a 7B model faster; they use them to train a 70B or 700B model in reasonable time. As models and training batches grow, engineers often increase the compute performed between synchronization points. For dense Transformers, compute is commonly estimated as about 6\(\times\) parameters \(\times\) tokens, while gradient communication scales with parameter bytes, so the communication-computation ratio \(\rho\) depends on batch size, sequence length, accumulation, and parallelism layout rather than on parameter count alone.
Systems Perspective 1.4: The compound loss of fleet utilization
\[R_{\text{eff}} = N R_{\text{peak}} \times \text{MFU} \times \eta_{\text{scaling}} \times \eta_{\text{goodput}}\]
\[= 1,012,736 TFLOP/s \times 0.50 \times 0.50 \times 0.77 \approx 194,951.7 \text{ TFLOP/s}\]
The cluster delivers 19.2 percent of its peak FLOP/s as useful training work. The remaining 80.8 percent is explained by retained factors of 50 percent MFU, 50 percent scaling efficiency, and 77 percent goodput.
This is not a failure of engineering—it is the physics of fleet-scale computation. Every additional GPU adds less marginal useful work, but the total throughput still far exceeds what a smaller cluster could achieve. The goal is not to reach 100 percent utilization; the goal is to deliver trained models faster than any smaller configuration could.
Thermal and Power Physics
Compute performance ultimately converts to heat, and heat must be removed. At fleet scale, thermal and power constraints are not secondary concerns—they determine where construction is feasible, how densely accelerators can be packed, and what the operating costs will be. A cluster that is architecturally sound but thermally infeasible cannot be built.
A cluster design calling for 10,000 H100 GPUs at 700 W each—7 MW of IT load—and a PUE of 1.40 puts the facility load at 9.8 MW total. This is the electrical load of a small town. Before optimizing software, the physics of power delivery and heat removal must be feasible.
Power density wall
The shift from traditional data center workloads to AI training has created a power density crisis. Table 12 quantifies the gap.
| Configuration | Power per Rack | Cooling Implication |
|---|---|---|
| Traditional data center | 12 kW | Standard air cooling sufficient |
| AI cluster (current gen) | 70 kW | Liquid cooling required |
| AI cluster (high-density) | 100 kW | Direct-to-chip liquid cooling required |
| Air cooling limit | ~30 kW | Physics ceiling for forced-air convection |
The 5.8× increase in rack power density between traditional and AI workloads is not merely an engineering challenge—it represents a fundamental constraint on facility design. Existing data centers built for 12 kW racks cannot simply be “retrofitted” with AI hardware. The electrical distribution, cooling infrastructure, and floor loading must all be redesigned. This is why new AI-focused data centers are being built from the ground up with liquid cooling as the baseline assumption.
The energy hierarchy at scale
Power Usage Effectiveness (PUE) measures facility energy overhead: total facility energy divided by IT equipment energy (The Green Grid 2007; Uptime Institute 2022). A PUE of 1.0 would mean all facility energy is delivered to IT equipment; values above 1.0 represent overhead for cooling, power conversion, lighting, and other facility systems.
Table 13 compares PUE across data center generations and cooling technologies.
| Data Center Type | PUE | Overhead per MW of IT Load |
|---|---|---|
| Liquid-cooled AI data center | 1.06 | 60 kW cooling + infrastructure |
| Best-in-class air-cooled | 1.12 | 120 kW overhead |
| Industry average | 1.40 | 400 kW overhead |
| Legacy enterprise | 1.58 | 580 kW overhead |
For fleet-scale cost calculations, PUE directly multiplies the electricity bill. A 10 MW IT load in a legacy data center (PUE 1.58) requires 15.8 MW total, while the same load in a liquid-cooled facility (PUE 1.06) requires only 10.6 MW—a savings of 5.2 MW. At typical commercial electricity rates, this difference translates to millions of dollars per year. Sustainable AI covers the full sustainability implications.
Summary
Key Takeaways: Fleet scale changes every constraint
- Three paradigms, one hybrid: ML fleets combine HPC’s tight coupling (for training) with WSC’s elastic fault tolerance (for inference), creating a new architectural paradigm that cannot adopt either predecessor’s patterns wholesale.
- Bandwidth gap: NVLink provides ~9× more bandwidth than InfiniBand. This invariant ratio determines parallelism placement: tensor parallelism within nodes, data parallelism across nodes.
- Failures become routine at scale: Cluster MTBF scales as \(1/N\). At 8,192 GPUs, expect a GPU failure about every 366.2 minutes, with 98.04 percent probability over 24 hours. Fault tolerance is not optional—it is a prerequisite for completing any fleet-scale training job.
- Compound utilization loss: After MFU (~50 percent), scaling efficiency (~50 percent), and goodput after operational overhead (~77 percent) compound, a 1,024-GPU cluster delivers ~19.2 percent of peak FLOP/s as useful work.
- Power density demands liquid cooling: AI racks consume 5.8× the power of traditional racks. Air cooling fails above ~30 kW per rack, making liquid cooling a physical requirement for modern AI clusters.
- Communication-computation ratio (\(\rho\)) governs scaling strategy: When \(\rho > 1\), GPUs are idle waiting for data—reduce parallelism or increase computation per step. When \(\rho \ll 1\), communication can be fully overlapped.
- Weak scaling is the fleet-scale paradigm: Engineers do not use more GPUs to solve the same problem faster (strong scaling); they use more GPUs to solve larger problems in reasonable time. This keeps the serial fraction small and utilization high.