Communication
Purpose
What communication primitives bind the fleet together, and how do we predict their cost before running a single experiment?
Distributed ML training spends a surprising fraction of wall time not computing, but communicating. A large distributed training run may dedicate a substantial fraction of every iteration to synchronizing gradients, redistributing activations, or shuffling expert tokens. When that communication overhead is the bottleneck, no amount of faster arithmetic will help. The question shifts from “how many FLOPs?” to “how many bytes, across what topology, at what latency?” This appendix collects the communication cost models that answer those questions quantitatively: startup-latency and bandwidth models, the cost of every collective operation used in distributed training, pipeline bubble overhead, and the economics of gradient compression. These models support the distributed strategies, collective algorithms, fault tolerance mechanisms, and performance analysis used throughout the volume. In C³ terms, the appendix develops the communication axis, the cost of moving bytes across a fleet.
How to Use This Appendix
This appendix is designed as a reference. Reach for it when translating a distributed training symptom (“AllReduce is slow,” “pipeline bubbles are killing throughput,” “should we compress gradients?”) into a quantitative diagnosis.
Conventions used here follow the book-wide notation. We use \(\alpha\) for per-message startup latency, \(\beta\) for link bandwidth in bytes per second, \(n\) for point-to-point message size, \(n^*\) for the \(\alpha\)-\(\beta\) crossover point, \(N\) for the number of participating GPUs, and \(M\) for collective payload size in bytes.
- When AllReduce is slow: Use section 1.2 to compute the expected time and compare ring vs. tree costs.
- When choosing ring vs. tree: Use section 1.3 and the crossover analysis to match algorithm to message size.
- When pipeline bubbles dominate: Use section 1.4 to compute bubble fraction and determine the required microbatch count.
- When gradient compression is proposed: Use section 1.5 to run the break-even calculation before committing engineering effort.
The \(\alpha\)-\(\beta\) Communication Model
Systems Perspective 1.1: Why this matters
Every point-to-point transfer on a network follows the same fundamental pattern: the sender pays a fixed startup cost to initiate the message, then transmits the payload at a rate determined by the link’s bandwidth. This two-parameter model captures the essential physics of network communication.
The \(\alpha\)-\(\beta\) model, given by equation 1, predicts the time to transfer a point-to-point message of \(n\) bytes:
\[ T(n) = \alpha + \frac{n}{\beta} \tag{1}\]
where:
- \(\alpha\) is the startup latency (seconds)—the fixed cost of initiating a message, including software overhead, NIC processing, and routing setup
- \(\beta\) is the link bandwidth (bytes/s)—the sustained data rate after startup
- \(n\) is the message size (bytes)
The model decomposes every transfer into exactly two costs: a per-message tax (\(\alpha\)) paid regardless of size, and a per-byte tax (\(n/\beta\)) that scales linearly with the payload. This decomposition drives all the algorithm selection decisions in section 1.3.
Table 1 lists the \(\alpha\) and \(\beta\) values for interconnects commonly found in ML training clusters. These numbers are the starting point for every communication cost estimate in this appendix.
| Interconnect | \(\alpha\) (Latency) | \(\beta\) (Bandwidth) | Typical Role |
|---|---|---|---|
| NVLink 4.0 (H100) | 500 ns | 900 GB/s | Intra-node GPU-to-GPU |
| PCIe Gen5 | 1000 ns | 64 GB/s | CPU-GPU, NIC-GPU |
| InfiniBand NDR (400 Gbps) | 5 μs | 50 GB/s | Inter-node (high-end clusters) |
| InfiniBand HDR (200 Gbps) | 7 μs | 25 GB/s | Inter-node (previous generation) |
| RoCE v2 (100 GbE) | 10 μs | 12.5 GB/s | Inter-node (Ethernet clusters) |
| TCP/IP (Ethernet) | 50 μs | Varies | Control plane, non-RDMA fallback |
Latency-dominated vs. bandwidth-dominated regimes
The \(\alpha\)-\(\beta\) model reveals two distinct operating regimes. For small messages, the startup latency \(\alpha\) dominates—the link spends most of its time starting transfers, not sustaining them. For large messages, the bandwidth term \(n/\beta\) dominates, and the startup cost becomes negligible.
The crossover message size \(n^*\), defined by equation 2, is the point where both terms contribute equally:
\[ n^* = \alpha \times \beta \tag{2}\]
Below \(n^*\), communication is latency-dominated: sending more small messages wastes time on repeated startups. Above \(n^*\), communication is bandwidth-dominated: the link is fully utilized, and the only way to go faster is higher bandwidth.
Napkin Math 1.1: Worked example: Crossover on InfiniBand NDR
Question: At what message size does the bandwidth term equal the latency term?
Math:
\(n^* = \alpha \times \beta =\) 5 μs \(\times\) 50 GB/s \(=\) 250 KB
Analysis: Messages smaller than 250 KB are latency-dominated on IB NDR. Gradient tensors from a single layer of a large model are typically 10–100 MB—well above this threshold. Control messages, heartbeats, and barrier synchronizations, however, are well below it.
Systems insight: This is why NCCL and Gloo batch small gradient tensors into larger “buckets” before launching AllReduce. Sending each tensor individually would pay the 5 μs startup cost hundreds of times per iteration instead of a handful.
The LogGP model extension
The \(\alpha\)-\(\beta\) model assumes a message is a single, atomic transfer. In practice, networks pipeline messages: a sender can inject a new packet before the previous one has been fully received. The LogGP model (Alexandrov et al. 1995) extends LogP with a long-message gap parameter:
- \(g\) (gap): The minimum interval between consecutive message injections at the sender. This models the NIC’s injection rate limit.
- \(o_{\text{send}}\), \(o_{\text{recv}}\) (overhead): The CPU time consumed by the processor to initiate or complete a message, during which it cannot do useful compute.
For a long message of \(n\) bytes, LogGP models transfer time as approximately: \[T_{\text{long}}(n) \approx o_{\text{send}} + L_{\text{LogGP}} + (n - 1)G + o_{\text{recv}}\]
where \(L_{\text{LogGP}}\) is the network latency, \(o_{\text{send}}\) and \(o_{\text{recv}}\) are sender and receiver overheads, \(G\) is the gap per byte for long messages, and \(g\) controls the minimum interval between consecutive message injections. Informal alpha-beta-style approximations can be useful for ML profiling, but the standard LogGP model keeps these gap parameters distinct from a simple bandwidth term.
In most ML training scenarios, the \(\alpha\)-\(\beta\) model is sufficient because gradient tensors are large enough that pipelining effects are secondary to raw bandwidth. The LogGP model becomes important when analyzing the overhead of many small control messages or when modeling the interaction between communication and computation overlap—situations that arise in fine-grained pipeline parallelism and asynchronous gradient updates.
The \(\alpha\)-\(\beta\) model describes point-to-point transfers. Collective operations—where all GPUs participate—compose multiple point-to-point transfers, and their cost formulas derive directly from the model above.
Collective Operation Complexity
Systems Perspective 1.2: Why this matters
Collective operations move data between all \(N\) GPUs simultaneously. Each collective has a characteristic bandwidth term (how much data flows) and latency term (how many sequential message steps are required). The formulas below use the bandwidth-optimal ring algorithm unless otherwise noted; section 1.3 discusses when other algorithms are preferable.
AllReduce
AllReduce is the workhorse of data-parallel training: every GPU starts with a local gradient tensor of size \(M\) bytes and ends with the globally reduced (summed) result. The ring algorithm decomposes AllReduce into a ReduceScatter phase followed by an AllGather phase. Ring AllReduce time is given by equation 3:
\[ T_{\text{ring}} = 2 \cdot \frac{N - 1}{N} \cdot \frac{M}{\beta} + 2(N - 1) \cdot \alpha \tag{3}\]
The factor \(2(N-1)/N\) in the bandwidth term approaches 2 as \(N\) grows—each byte effectively traverses the ring twice (once for reduce-scatter, once for all-gather). The latency term grows linearly with \(N\), which makes the ring algorithm expensive in latency for large GPU counts.
For comparison, tree AllReduce has the cost shown in equation 4:
\[ T_{\text{tree}} = 2 \log_2 N \cdot \frac{M}{\beta} + 2 \log_2 N \cdot \alpha \tag{4}\]
The tree algorithm trades bandwidth efficiency for latency efficiency: the latency term grows as \(\log_2 N\) instead of \(N\), but the bandwidth term sends the full message \(M\) at each tree level instead of \(M/N\) chunks.
Napkin Math 1.2: Worked example: Ring vs. tree AllReduce
Question: How long does AllReduce take with ring vs. tree?
Ring AllReduce:
\(T_{\text{ring}} \approx\) 42.4 ms
Tree AllReduce:
\(T_{\text{tree}} \approx\) 320.1 ms
Analysis: For this 1 GB message, ring AllReduce takes 42.4 ms, and tree takes 320.1 ms—the ring is significantly faster because it distributes the bandwidth load across all links. The tree’s logarithmic latency advantage is negligible compared to its bandwidth penalty for large messages.
Systems insight: For the large gradient tensors typical of data-parallel training, ring (or ring-based) algorithms dominate. Tree algorithms are useful only for small messages or when latency—not bandwidth—is the bottleneck.
AllGather
AllGather is the complement of ReduceScatter: each GPU starts with a \(M/N\)-sized shard and ends with the complete \(M\)-byte tensor. It is the core communication primitive in Fully Sharded Data Parallelism (FSDP) for reconstructing parameters before each forward pass, and in tensor parallelism for reassembling split activations.
Ring AllGather, given by equation 5:
\[ T_{\text{allgather}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{5}\]
AllGather is exactly half the cost of AllReduce (one ring pass instead of two) because it does not include a reduction step.
ReduceScatter
ReduceScatter is the dual of AllGather: each GPU starts with a full \(M\)-byte tensor and ends with a \(M/N\)-sized shard that contains the globally reduced values for that shard. It is the gradient synchronization primitive in FSDP and ZeRO, replacing the full AllReduce with a cheaper operation when each GPU only needs its own parameter shard’s gradient.
Ring ReduceScatter, given by equation 6:
\[ T_{\text{reducescatter}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{6}\]
The symmetry is not a coincidence: ring AllReduce decomposes into one ReduceScatter followed by one AllGather, and the costs add up exactly to equation 3.
AllToAll
AllToAll is the most general collective: each GPU sends a distinct \(M/N\)-sized chunk to every other GPU, and receives a distinct chunk from each. It is the routing primitive for Mixture-of-Experts (MoE) architectures, where tokens must be dispatched to the correct expert and the results gathered back.
Ring AllToAll, given by equation 7:
\[ T_{\text{alltoall}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{7}\]
Although the formula matches AllGather and ReduceScatter in form, the communication pattern is fundamentally different: AllToAll is a personalized exchange (each GPU sends different data to each peer), which makes it harder to overlap with computation and more sensitive to network congestion.
Broadcast
Broadcast sends a single \(M\)-byte tensor from one root GPU to all \(N - 1\) others. It is used for distributing initial model weights, updated learning rates, and configuration changes during training.
Tree Broadcast, given by equation 8:
\[ T_{\text{broadcast}} = \log_2 N \left(\alpha + \frac{M}{\beta}\right) \tag{8}\]
Broadcast is inherently asymmetric (one sender, many receivers), which makes a tree topology natural. In an unsegmented tree broadcast, the full \(M\)-byte payload is forwarded at each tree level, so both the latency and bandwidth terms appear along the \(\log_2 N\)-level critical path. Segmented or scatter-allgather broadcast variants can reduce the bandwidth term, but they require a different algorithm-specific model.
Collective complexity summary
With the individual formulas in place, table 2 provides a compact reference for comparing the cost structure of each collective operation.
| Operation | Bandwidth Term | Latency Term | Primary Use Case |
|---|---|---|---|
| AllReduce (Ring) | \(2 \cdot \frac{N-1}{N} \cdot \frac{M}{\beta}\) | \(2(N-1) \cdot \alpha\) | Data-parallel gradient sync |
| AllReduce (Tree) | \(2\log_2 N \cdot \frac{M}{\beta}\) | \(2\log_2 N \cdot \alpha\) | Small-message sync |
| AllGather (Ring) | \(\frac{N-1}{N} \cdot \frac{M}{\beta}\) | \((N-1) \cdot \alpha\) | FSDP parameter reconstruction |
| ReduceScatter (Ring) | \(\frac{N-1}{N} \cdot \frac{M}{\beta}\) | \((N-1) \cdot \alpha\) | FSDP/ZeRO gradient sharding |
| AllToAll (Ring) | \(\frac{N-1}{N} \cdot \frac{M}{\beta}\) | \((N-1) \cdot \alpha\) | MoE expert routing |
| Broadcast (Tree) | \(\log_2 N \cdot \frac{M}{\beta}\) | \(\log_2 N \cdot \alpha\) | Weight distribution, config sync |
Two patterns emerge from this table. First, bandwidth-optimal ring algorithms all share the \((N-1)/N\) factor, which approaches 1 for large \(N\). Per-rank communication therefore approaches a constant multiple of \(M\): about \(2M\) for AllReduce and about \(M\) for AllGather, ReduceScatter, or AllToAll. Aggregate traffic still scales with the number of ranks. Second, the latency term grows linearly with \(N\) for ring algorithms but logarithmically for tree algorithms, creating the algorithm selection trade-off analyzed next.
Algorithm Selection
Systems Perspective 1.3: Why this matters
Ring vs. tree vs. recursive halving-doubling
The ring algorithm minimizes bandwidth usage but pays a latency cost that grows linearly with \(N\). The tree algorithm minimizes latency but wastes bandwidth because it sends the full message at each level. Recursive halving-doubling splits the difference: it uses halving (like a tree) for the reduce phase and doubling for the gather phase, achieving near-optimal bandwidth with \(\mathcal{O}(\log N)\) latency steps.
The decision boundary between ring and tree depends on message size:
- \(M \gg n^*\) (large gradients): Use ring. The bandwidth term dominates, and ring’s \((N-1)/N\) factor is optimal.
- \(M \ll n^*\) (barriers, scalars): Use tree. The latency term dominates, and tree’s \(\log_2 N\) steps are far fewer than ring’s \(2(N-1)\).
- \(M \approx n^*\) (medium tensors): Use recursive halving-doubling for balanced performance, or test empirically.
Table 3 provides guidance for choosing the right algorithm.
| Regime | Ring | Tree | Recursive Halving-Doubling |
|---|---|---|---|
| Large messages (\(M > 10 \times n^*\)) | Best—optimal bandwidth utilization | Poor—sends \(M\) at every tree level | Good—near-optimal bandwidth, \(\mathcal{O}(\log N)\) latency |
| Small messages (\(M < n^*/10\)) | Poor—\(2(N-1)\) startup costs dominate | Best—\(\log_2 N\) startup costs | Good—\(\mathcal{O}(\log N)\) startup costs |
| Medium messages (\(M \approx n^*\)) | Acceptable | Acceptable | Best—balances both terms |
| Power-of-2 \(N\) | Works for any \(N\) | Best when \(N = 2^k\) | Requires \(N = 2^k\) |
Hierarchical AllReduce
Modern clusters have a two-level topology: GPUs within a node are connected by NVLink (900 GB/s), while nodes are connected by InfiniBand (50 GB/s). Hierarchical AllReduce exploits this asymmetry in three phases:
- Intra-node reduce (NVLink): Each node reduces its 8 local GPUs to a single result using NVLink’s 900 GB/s bandwidth. This is fast because NVLink’s per-direction rate is 9× that of an InfiniBand NDR port.
- Inter-node AllReduce (InfiniBand): The per-node results are all-reduced across nodes. Only one representative per 8 local GPUs participates, so the inter-node ring has one-eighth as many participants as a flat GPU-level ring.
- Intra-node broadcast (NVLink): Each node broadcasts the global result to its 8 local GPUs.
The hierarchical approach reduces the inter-node ring size from all GPUs to one representative per 8 local GPUs, cutting the latency-step term by about 8\(\times\) and confining the bandwidth-hungry phases to the fast intra-node links. NCCL selects among topology-aware algorithms such as Ring, Tree, CollNet variants, NVLS/NVLSTree, and PAT based on the collective, message size, topology, architecture, and version; hierarchical communication is common on multi-node systems but is not a universal default.
Understanding the cost structure of collectives enables reasoning about data-parallel overhead. Pipeline parallelism, however, introduces a different kind of overhead: idle time caused by the sequential dependency between pipeline stages. The next section quantifies that cost.
Pipeline Arithmetic
Systems Perspective 1.4: Why this matters
Bubble fraction
In a pipeline with \(p\) stages and \(m\) microbatches, equation 9 defines the bubble fraction as the proportion of total GPU-time spent idle:
\[ b_{\text{pipe}} = \frac{p - 1}{p - 1 + m} \tag{9}\]
This formula applies to both GPipe (where all forward passes complete before any backward pass begins) and 1F1B (one-forward-one-backward interleaving). The 1F1B schedule has the same asymptotic bubble fraction but uses less peak memory because it retires microbatches earlier.
The key insight: the bubble fraction depends on the ratio of stages to microbatches. Adding more microbatches (for a fixed number of stages) amortizes the pipeline fill and drain across more useful work.
Rule of thumb: Keeping bubble overhead at or below 20 percent requires \(m \geq 4(p - 1)\); strictly below 20 percent requires \(m > 4(p - 1)\). The practical rule \(m \geq 4p\) is a convenient conservative target.
Table 4 shows how bubble fraction varies with pipeline depth and microbatch count.
| Pipeline Stages (\(p\)) | \(m\) = 8 | \(m\) = 16 | \(m\) = 32 | \(m\) = 64 |
|---|---|---|---|---|
| \(p\) = 4 | 27.3% | 15.8% | 8.6% | 4.5% |
| \(p\) = 8 | 46.7% | 30.4% | 17.9% | 9.9% |
Interleaved scheduling
Interleaved scheduling (used in Megatron-LM) assigns \(V\) virtual stages to each physical GPU instead of one. Each GPU processes \(V\) nonconsecutive chunks of the model, which means the pipeline has \(p \times V\) virtual stages but the bubble overhead uses the original \(p\), as equation 10 shows:
\[ b_{\text{interleaved}} = \frac{p - 1}{p - 1 + m \times V} \tag{10}\]
The microbatch term in the denominator is multiplied by \(V\) compared to equation 9, so the bubble reduction approaches \(V\) when \(m\) dominates \(p - 1\). For \(p\) = 8, \(m\) = 16, and \(V\) = 4, the bubble fraction drops from 30.4 percent to 9.9 percent, a 3.1× reduction that comes at the cost of more frequent, smaller communication between stages.
Interleaved scheduling highlights a recurring theme in distributed ML: communication granularity is a tunable knob. Smaller, more frequent transfers improve overlap and reduce idle time, but increase the aggregate latency overhead. The same trade-off appears in gradient compression, which is the final cost model in this appendix.
Compression Economics
Systems Perspective 1.5: Why this matters
The compression equation
Gradient compression replaces the original \(M\)-byte gradient with a compressed representation of size \(M/r_{\text{comp}}\), where \(r_{\text{comp}}\) is the compression ratio. Equation 11 expresses the total time with compression:
\[ T_{\text{compressed}} = T_{\text{encode}} + T_{\text{transfer}}\!\left(\frac{M}{r_{\text{comp}}}\right) + T_{\text{decode}} \tag{11}\]
Compression is beneficial when \(T_{\text{compressed}} < T_{\text{transfer}}(M)\). Rearranging this inequality yields the break-even condition in equation 12:
\[ T_{\text{encode}} + T_{\text{decode}} < T_{\text{transfer}}(M) - T_{\text{transfer}}\!\left(\frac{M}{r_{\text{comp}}}\right) \tag{12}\]
In the bandwidth-dominated regime (\(M \gg n^*\)), the communication time scales linearly with message size, and the right-hand side simplifies to approximately \(T_{\text{transfer}}(M) \times (1 - 1/r_{\text{comp}})\). For high compression ratios (\(r_{\text{comp}} \gg 1\)), this approaches \(T_{\text{transfer}}(M)\)—meaning compression is worthwhile as long as the codec overhead is less than the uncompressed communication time.
Napkin Math 1.3: Worked example: Is top-k compression worth it?
Question: Does compression reduce the total AllReduce time?
Math:
- Uncompressed AllReduce: 42.4 ms
- Compressed communication: 3.2 ms
- Total with compression: 5.2 ms
- Speedup: 8.2×
Result: Compression delivers an 8.2× speedup in communication time. The 2 ms codec overhead is a small fraction of the 42.4 ms uncompressed baseline.
Systems insight: At this scale (256 GPUs, 1 GB gradients), gradient compression is clearly profitable. However, the convergence impact must be validated: Top-k sparsification discards information, and the model may require more iterations to reach the same accuracy, potentially negating the per-iteration speedup.
Table 5 shows how the break-even codec overhead varies with compression ratio and uncompressed AllReduce time. Higher compression ratios tolerate larger codec overheads.
| Compression Ratio (\(r_{\text{comp}}\)) | AllReduce = 10 ms (Max Codec Overhead) | AllReduce = 40 ms (Max Codec Overhead) | AllReduce = 100 ms (Max Codec Overhead) |
|---|---|---|---|
| 4\(\times\) | 7.5 | 30 | 75 |
| 16\(\times\) | 9.4 | 37.5 | 93.8 |
| 64\(\times\) | 9.8 | 39.4 | 98.4 |
| 256\(\times\) | 9.96 | 39.8 | 99.6 |
Summary
Key Takeaways: Communication costs are predictable
- The \(\alpha\)-\(\beta\) model: The model (\(T(n) = \alpha + n/\beta\)) decomposes every network transfer into a fixed startup cost and a payload-proportional bandwidth cost. The crossover message size \(n^* = \alpha \times \beta\) determines which cost dominates.
- Ring-based collectives: AllReduce, AllGather, ReduceScatter, and AllToAll achieve bandwidth-optimal communication with a \((N-1)/N\) factor, but their latency term grows linearly with GPU count. Tree-based algorithms trade bandwidth for logarithmic latency scaling.
- Algorithm selection: Message size relative to \(n^*\) determines the algorithm: ring for large messages (gradients), tree for small messages (barriers), and hierarchical two-level schemes for multi-node clusters.
- Pipeline bubble fraction: \(b_{\text{pipe}} = (p-1)/(p-1+m)\) is the fundamental overhead of pipeline parallelism. The practical rule \(m \geq 4p\) keeps bubbles below 20 percent, and interleaved scheduling with \(V\) virtual stages makes the bubble reduction approach \(V\) when the microbatch term dominates.
- Gradient compression: Gradient compression is profitable when the codec overhead (encode + decode) is less than the communication time saved. At high compression ratios in the bandwidth-dominated regime, nearly the entire uncompressed AllReduce time is available as overhead budget—but convergence impact must always be validated.
- Four parameters predict communication cost: The point-to-point and collective communication models in this appendix derive from two hardware parameters (\(\alpha\) and \(\beta\)) and two workload parameters (message payload \(n\) or \(M\), plus participant count \(N\) for collectives). Measuring these quantities for a cluster yields a complete first-order prediction of distributed training overhead, with pipeline and compression sections introducing additional parameters as noted earlier.