Communication

Purpose

What communication primitives bind the fleet together, and how do we predict their cost before running a single experiment?

Distributed ML training spends a surprising fraction of wall time not computing, but communicating. A large distributed training run may dedicate a substantial fraction of every iteration to synchronizing gradients, redistributing activations, or shuffling expert tokens. When that communication overhead is the bottleneck, no amount of faster arithmetic will help. The question shifts from “how many FLOPs?” to “how many bytes, across what topology, at what latency?” This appendix collects the communication cost models that answer those questions quantitatively: startup-latency and bandwidth models, the cost of every collective operation used in distributed training, pipeline bubble overhead, and the economics of gradient compression. These models support the distributed strategies, collective algorithms, fault tolerance mechanisms, and performance analysis used throughout the volume. In C³ terms, the appendix develops the communication axis, the cost of moving bytes across a fleet.

How to Use This Appendix

This appendix is designed as a reference. Reach for it when translating a distributed training symptom (“AllReduce is slow,” “pipeline bubbles are killing throughput,” “should we compress gradients?”) into a quantitative diagnosis.

Conventions used here follow the book-wide notation. We use \(\alpha\) for per-message startup latency, \(\beta\) for link bandwidth in bytes per second, \(n\) for point-to-point message size, \(n^*\) for the \(\alpha\)-\(\beta\) crossover point, \(N\) for the number of participating GPUs, and \(M\) for collective payload size in bytes.

When AllReduce is slow: Use section 1.2 to compute the expected time and compare ring vs. tree costs.
When choosing ring vs. tree: Use section 1.3 and the crossover analysis to match algorithm to message size.
When pipeline bubbles dominate: Use section 1.4 to compute bubble fraction and determine the required microbatch count.
When gradient compression is proposed: Use section 1.5 to run the break-even calculation before committing engineering effort.

The \(\alpha\)-\(\beta\) Communication Model

Systems Perspective 1.1: Why this matters

Estimating how long a gradient synchronization will take across 256 GPUs must happen before committing to a cluster topology. The \(\alpha\)-\(\beta\) model is the “Roofline for networks”: it gives a first-order prediction of transfer time from just two hardware parameters and the message size.

Every point-to-point transfer on a network follows the same fundamental pattern: the sender pays a fixed startup cost to initiate the message, then transmits the payload at a rate determined by the link’s bandwidth. This two-parameter model captures the essential physics of network communication.

The \(\alpha\)-\(\beta\) model, given by equation 1, predicts the time to transfer a point-to-point message of \(n\) bytes:

\[ T(n) = \alpha + \frac{n}{\beta} \tag{1}\]

where:

\(\alpha\) is the startup latency (seconds)—the fixed cost of initiating a message, including software overhead, NIC processing, and routing setup
\(\beta\) is the link bandwidth (bytes/s)—the sustained data rate after startup
\(n\) is the message size (bytes)

The model decomposes every transfer into exactly two costs: a per-message tax (\(\alpha\)) paid regardless of size, and a per-byte tax (\(n/\beta\)) that scales linearly with the payload. This decomposition drives all the algorithm selection decisions in section 1.3.

Table 1 lists the \(\alpha\) and \(\beta\) values for interconnects commonly found in ML training clusters. These numbers are the starting point for every communication cost estimate in this appendix.

Table 1: \(\alpha\)-\(\beta\) Parameters by Interconnect: Startup latency and sustained bandwidth for interconnects used in ML clusters. The \(\beta\) column uses one-way or per-direction bandwidth, so the H100 NVLink entry is half of the 900 GB/s bidirectional aggregate specification. NVLink and PCIe operate within a node; InfiniBand and RoCE operate between nodes. TCP latency is dominated by kernel software stack overhead.

Interconnect	\(\alpha\) (Latency)	\(\beta\) (Bandwidth)	Typical Role
NVLink 4.0 (H100)	500 ns	450 GB/s	Intra-node GPU-to-GPU
PCIe Gen5	1000 ns	64 GB/s	CPU-GPU, NIC-GPU
InfiniBand NDR (400 Gbps)	5 μs	50 GB/s	Inter-node (high-end clusters)
InfiniBand HDR (200 Gbps)	7 μs	25 GB/s	Inter-node (previous generation)
RoCE v2 (100 GbE)	10 μs	12.5 GB/s	Inter-node (Ethernet clusters)
TCP/IP (Ethernet)	50 μs	Varies	Control plane, non-RDMA fallback

Latency-dominated vs. bandwidth-dominated regimes

The \(\alpha\)-\(\beta\) model reveals two distinct operating regimes. For small messages, the startup latency \(\alpha\) dominates—the link spends most of its time starting transfers, not sustaining them. For large messages, the bandwidth term \(n/\beta\) dominates, and the startup cost becomes negligible.

The crossover message size \(n^*\), defined by equation 2, is the point where both terms contribute equally:

\[ n^* = \alpha \times \beta \tag{2}\]

Below \(n^*\), communication is latency-dominated: sending more small messages wastes time on repeated startups. Above \(n^*\), communication is bandwidth-dominated: the link is fully utilized, and the only way to go faster is higher bandwidth.

Napkin Math 1.1: Worked example: Crossover on InfiniBand NDR

Setup: InfiniBand NDR has \(\alpha\) = 5 μs and \(\beta\) = 50 GB/s.

Question: At what message size does the bandwidth term equal the latency term?

Math:

\(n^* = \alpha \times \beta =\) 5 μs \(\times\) 50 GB/s \(=\) 250 KB

Analysis: Messages smaller than 250 KB are latency-dominated on IB NDR. Gradient tensors from a single layer of a large model are typically 10–100 MB—well above this threshold. Control messages, heartbeats, and barrier synchronizations, however, are well below it.

Systems insight: This is why NCCL and Gloo batch small gradient tensors into larger “buckets” before launching AllReduce. Sending each tensor individually would pay the 5 μs startup cost hundreds of times per iteration instead of a handful.

The LogGP model extension

The \(\alpha\)-\(\beta\) model assumes a message is a single, atomic transfer. In practice, networks pipeline messages: a sender can inject a new packet before the previous one has been fully received. The LogGP model (Alexandrov et al. 1995) extends LogP with a long-message gap parameter:

Alexandrov, Albert D., Mihai F. Ionescu, Klaus E. Schauser, and Chris J. Scheiman. 1995. “LogGP: Incorporating Long Messages into the LogP Model—One Step Closer Towards a Realistic Model for Parallel Computation.” Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures - SPAA ’95 30: 95–105. https://doi.org/10.1145/215399.215427.

\(g\) (gap): The minimum interval between consecutive message injections at the sender. This models the NIC’s injection rate limit.
\(o_{\text{send}}\), \(o_{\text{recv}}\) (overhead): The CPU time consumed by the processor to initiate or complete a message, during which it cannot do useful compute.

For a long message of \(n\) bytes, LogGP models transfer time as approximately: \[T_{\text{long}}(n) \approx o_{\text{send}} + L_{\text{LogGP}} + (n - 1)G + o_{\text{recv}}\]

where \(L_{\text{LogGP}}\) is the network latency, \(o_{\text{send}}\) and \(o_{\text{recv}}\) are sender and receiver overheads, \(G\) is the gap per byte for long messages, and \(g\) controls the minimum interval between consecutive message injections. Informal alpha-beta-style approximations can be useful for ML profiling, but the standard LogGP model keeps these gap parameters distinct from a simple bandwidth term.

In most ML training scenarios, the \(\alpha\)-\(\beta\) model is sufficient because gradient tensors are large enough that pipelining effects are secondary to raw bandwidth. The LogGP model becomes important when analyzing the overhead of many small control messages or when modeling the interaction between communication and computation overlap—situations that arise in fine-grained pipeline parallelism and asynchronous gradient updates.

The \(\alpha\)-\(\beta\) model describes point-to-point transfers. Collective operations—where all GPUs participate—compose multiple point-to-point transfers, and their cost formulas derive directly from the model above.

Collective Operation Complexity

Systems Perspective 1.2: Why this matters

Every distributed training step ends with a collective synchronization. The cost of that collective—not the cost of the matrix multiplies—often determines whether scaling from 8 GPUs to 256 is feasible. Understanding the exact complexity of each collective reveals which parallelism strategy (data, tensor, pipeline, expert) will dominate wall time at a given scale.

Collective operations move data between all \(N\) GPUs simultaneously. Each collective has a characteristic bandwidth term (how much data flows) and latency term (how many sequential message steps are required). The formulas below use the bandwidth-optimal ring algorithm unless otherwise noted; section 1.3 discusses when other algorithms are preferable.

AllReduce

AllReduce is the workhorse of data-parallel training: every GPU starts with a local gradient tensor of size \(M\) bytes and ends with the globally reduced (summed) result. The ring algorithm decomposes AllReduce into a ReduceScatter phase followed by an AllGather phase. Ring AllReduce time is given by equation 3:

\[ T_{\text{ring}} = 2 \cdot \frac{N - 1}{N} \cdot \frac{M}{\beta} + 2(N - 1) \cdot \alpha \tag{3}\]

The factor \(2(N-1)/N\) in the bandwidth term approaches 2 as \(N\) grows—each byte effectively traverses the ring twice (once for reduce-scatter, once for all-gather). The latency term grows linearly with \(N\), which makes the ring algorithm expensive in latency for large GPU counts.

For comparison, tree AllReduce has the cost shown in equation 4:

\[ T_{\text{tree}} = 2 \log_2 N \cdot \frac{M}{\beta} + 2 \log_2 N \cdot \alpha \tag{4}\]

The tree algorithm trades bandwidth efficiency for latency efficiency: the latency term grows as \(\log_2 N\) instead of \(N\), but the bandwidth term sends the full message \(M\) at each tree level instead of \(M/N\) chunks.

Napkin Math 1.2: Worked example: Ring vs. tree AllReduce

Setup: 256 H100 GPUs connected via InfiniBand NDR (\(\alpha\) = 5 μs, \(\beta\) = 50 GB/s). The gradient tensor is 1 GB (a 500M-parameter model in FP16).

Question: How long does AllReduce take with ring vs. tree?

Ring AllReduce:

\(T_{\text{ring}} \approx\) 42.4 ms

Tree AllReduce:

\(T_{\text{tree}} \approx\) 320.1 ms

Analysis: For this 1 GB message, ring AllReduce takes 42.4 ms, and tree takes 320.1 ms—the ring is significantly faster because it distributes the bandwidth load across all links. The tree’s logarithmic latency advantage is negligible compared to its bandwidth penalty for large messages.

Systems insight: For the large gradient tensors typical of data-parallel training, ring (or ring-based) algorithms dominate. Tree algorithms are useful only for small messages or when latency—not bandwidth—is the bottleneck.

AllGather

AllGather is the complement of ReduceScatter: each GPU starts with a \(M/N\)-sized shard and ends with the complete \(M\)-byte tensor. It is the core communication primitive in Fully Sharded Data Parallelism (FSDP) for reconstructing parameters before each forward pass, and in tensor parallelism for reassembling split activations.

Ring AllGather, given by equation 5:

\[ T_{\text{allgather}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{5}\]

AllGather is exactly half the cost of AllReduce (one ring pass instead of two) because it does not include a reduction step.

ReduceScatter

ReduceScatter is the dual of AllGather: each GPU starts with a full \(M\)-byte tensor and ends with a \(M/N\)-sized shard that contains the globally reduced values for that shard. It is the gradient synchronization primitive in FSDP and ZeRO, replacing the full AllReduce with a cheaper operation when each GPU only needs its own parameter shard’s gradient.

Ring ReduceScatter, given by equation 6:

\[ T_{\text{reducescatter}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{6}\]

The symmetry is not a coincidence: ring AllReduce decomposes into one ReduceScatter followed by one AllGather, and the costs add up exactly to equation 3.

AllToAll

AllToAll is the most general collective: each GPU sends a distinct \(M/N\)-sized chunk to every other GPU, and receives a distinct chunk from each. It is the routing primitive for Mixture-of-Experts (MoE) architectures, where tokens must be dispatched to the correct expert and the results gathered back.

Ring AllToAll, given by equation 7:

\[ T_{\text{alltoall}} = \frac{N - 1}{N} \cdot \frac{M}{\beta} + (N - 1) \cdot \alpha \tag{7}\]

Although the formula matches AllGather and ReduceScatter in form, the communication pattern is fundamentally different: AllToAll is a personalized exchange (each GPU sends different data to each peer), which makes it harder to overlap with computation and more sensitive to network congestion.

Broadcast

Broadcast sends a single \(M\)-byte tensor from one root GPU to all \(N - 1\) others. It is used for distributing initial model weights, updated learning rates, and configuration changes during training.

Tree Broadcast, given by equation 8:

\[ T_{\text{broadcast}} = \log_2 N \left(\alpha + \frac{M}{\beta}\right) \tag{8}\]

Broadcast is inherently asymmetric (one sender, many receivers), which makes a tree topology natural. In an unsegmented tree broadcast, the full \(M\)-byte payload is forwarded at each tree level, so both the latency and bandwidth terms appear along the \(\log_2 N\)-level critical path. Segmented or scatter-allgather broadcast variants can reduce the bandwidth term, but they require a different algorithm-specific model.

Collective complexity summary

With the individual formulas in place, table 2 provides a compact reference for comparing the cost structure of each collective operation.

Table 2: Collective Operation Cost Summary: Bandwidth and latency terms for ring and tree algorithms. The bandwidth term determines cost for large messages (gradients, parameters); the latency term determines cost for small messages (scalars, barriers). \(N\) = GPU count, \(M\) = total message size in bytes.

Operation	Bandwidth Term	Latency Term	Primary Use Case
AllReduce (Ring)	\(2 \cdot \frac{N-1}{N} \cdot \frac{M}{\beta}\)	\(2(N-1) \cdot \alpha\)	Data-parallel gradient sync
AllReduce (Tree)	\(2\log_2 N \cdot \frac{M}{\beta}\)	\(2\log_2 N \cdot \alpha\)	Small-message sync
AllGather (Ring)	\(\frac{N-1}{N} \cdot \frac{M}{\beta}\)	\((N-1) \cdot \alpha\)	FSDP parameter reconstruction
ReduceScatter (Ring)	\(\frac{N-1}{N} \cdot \frac{M}{\beta}\)	\((N-1) \cdot \alpha\)	FSDP/ZeRO gradient sharding
AllToAll (Ring)	\(\frac{N-1}{N} \cdot \frac{M}{\beta}\)	\((N-1) \cdot \alpha\)	MoE expert routing
Broadcast (Tree)	\(\log_2 N \cdot \frac{M}{\beta}\)	\(\log_2 N \cdot \alpha\)	Weight distribution, config sync

Two patterns emerge from this table. First, bandwidth-optimal ring algorithms all share the \((N-1)/N\) factor, which approaches 1 for large \(N\). Per-rank communication therefore approaches a constant multiple of \(M\): about \(2M\) for AllReduce and about \(M\) for AllGather, ReduceScatter, or AllToAll. Aggregate traffic still scales with the number of ranks. Second, the latency term grows linearly with \(N\) for ring algorithms but logarithmically for tree algorithms, creating the algorithm selection trade-off analyzed next.

Algorithm Selection

Systems Perspective 1.3: Why this matters

NCCL automatically selects a collective algorithm based on message size and GPU topology, but understanding why it chooses ring for large messages and tree for small ones helps diagnose when the auto-selection is suboptimal—and how to structure communication to stay in the efficient regime.

Ring vs. tree vs. recursive halving-doubling

The ring algorithm minimizes bandwidth usage but pays a latency cost that grows linearly with \(N\). The tree algorithm minimizes latency but wastes bandwidth because it sends the full message at each level. Recursive halving-doubling splits the difference: it uses halving (like a tree) for the reduce phase and doubling for the gather phase, achieving near-optimal bandwidth with \(\mathcal{O}(\log N)\) latency steps.

The decision boundary between ring and tree depends on message size:

\(M \gg n^*\) (large gradients): Use ring. The bandwidth term dominates, and ring’s \((N-1)/N\) factor is optimal.
\(M \ll n^*\) (barriers, scalars): Use tree. The latency term dominates, and tree’s \(\log_2 N\) steps are far fewer than ring’s \(2(N-1)\).
\(M \approx n^*\) (medium tensors): Use recursive halving-doubling for balanced performance, or test empirically.

Table 3 provides guidance for choosing the right algorithm.

Table 3: Algorithm Selection Guide: Strengths and weaknesses of ring, tree, and recursive halving-doubling for different message sizes and GPU counts. \(n^* = \alpha \times \beta\) is the crossover message size from equation 2.

Regime	Ring	Tree	Recursive Halving-Doubling
*Large messages (\(M > 10 \times n^\))**	Best—optimal bandwidth utilization	Poor—sends \(M\) at every tree level	Good—near-optimal bandwidth, \(\mathcal{O}(\log N)\) latency
*Small messages (\(M < n^/10\))**	Poor—\(2(N-1)\) startup costs dominate	Best—\(\log_2 N\) startup costs	Good—\(\mathcal{O}(\log N)\) startup costs
*Medium messages (\(M \approx n^\))**	Acceptable	Acceptable	Best—balances both terms
Power-of-2 \(N\)	Works for any \(N\)	Best when \(N = 2^k\)	Requires \(N = 2^k\)

Hierarchical AllReduce

Modern clusters have a two-level topology: GPUs within a node are connected by NVLink (450 GB/s), while nodes are connected by InfiniBand (50 GB/s). Hierarchical AllReduce exploits this asymmetry in three phases:

Intra-node reduce (NVLink): Each node reduces its 8 local GPUs to a single result using NVLink’s 450 GB/s bandwidth. This is fast because NVLink’s per-direction rate is 9× that of an InfiniBand NDR port.
Inter-node AllReduce (InfiniBand): The per-node results are all-reduced across nodes. Only one representative per 8 local GPUs participates, so the inter-node ring has one-eighth as many participants as a flat GPU-level ring.
Intra-node broadcast (NVLink): Each node broadcasts the global result to its 8 local GPUs.

The hierarchical approach reduces the inter-node ring size from all GPUs to one representative per 8 local GPUs, cutting the latency-step term by about 8\(\times\) and confining the bandwidth-hungry phases to the fast intra-node links. NCCL selects among topology-aware algorithms such as Ring, Tree, CollNet variants, NVLS/NVLSTree, and PAT based on the collective, message size, topology, architecture, and version; hierarchical communication is common on multi-node systems but is not a universal default.

Understanding the cost structure of collectives enables reasoning about data-parallel overhead. Pipeline parallelism, however, introduces a different kind of overhead: idle time caused by the sequential dependency between pipeline stages. The next section quantifies that cost.

Pipeline Arithmetic

Systems Perspective 1.4: Why this matters

Pipeline parallelism splits a model across multiple GPUs so that each GPU holds only a fraction of the layers. The trade-off is bubble time: idle cycles where GPUs wait for activations or gradients from upstream or downstream stages. Quantifying the bubble fraction reveals whether pipeline parallelism is worth the complexity at a given scale.

Bubble fraction

In a pipeline with \(p\) stages and \(m\) microbatches, equation 9 defines the bubble fraction as the proportion of total GPU-time spent idle:

\[ b_{\text{pipe}} = \frac{p - 1}{p - 1 + m} \tag{9}\]

This formula applies to both GPipe (where all forward passes complete before any backward pass begins) and 1F1B (one-forward-one-backward interleaving). The 1F1B schedule has the same asymptotic bubble fraction but uses less peak memory because it retires microbatches earlier.

The key insight: the bubble fraction depends on the ratio of stages to microbatches. Adding more microbatches (for a fixed number of stages) amortizes the pipeline fill and drain across more useful work.

Rule of thumb: Keeping bubble overhead at or below 20 percent requires \(m \geq 4(p - 1)\); strictly below 20 percent requires \(m > 4(p - 1)\). The practical rule \(m \geq 4p\) is a convenient conservative target.

Table 4 shows how bubble fraction varies with pipeline depth and microbatch count.

Table 4: Pipeline Bubble Fraction: Percentage of GPU-time lost to pipeline bubbles for various stage counts and microbatch counts. Values below 20 percent (the typical target) require \(m \geq 4p\).

Pipeline Stages (\(p\))	\(m\) = 8	\(m\) = 16	\(m\) = 32	\(m\) = 64
\(p\) = 4	27.3%	15.8%	8.6%	4.5%
\(p\) = 8	46.7%	30.4%	17.9%	9.9%

Interleaved scheduling

Interleaved scheduling (used in Megatron-LM) assigns \(V\) virtual stages to each physical GPU instead of one. Each GPU processes \(V\) nonconsecutive chunks of the model, which means the pipeline has \(p \times V\) virtual stages but the bubble overhead uses the original \(p\), as equation 10 shows:

\[ b_{\text{interleaved}} = \frac{p - 1}{p - 1 + m \times V} \tag{10}\]

The microbatch term in the denominator is multiplied by \(V\) compared to equation 9, so the bubble reduction approaches \(V\) when \(m\) dominates \(p - 1\). For \(p\) = 8, \(m\) = 16, and \(V\) = 4, the bubble fraction drops from 30.4 percent to 9.9 percent, a 3.1× reduction that comes at the cost of more frequent, smaller communication between stages.

Interleaved scheduling highlights a recurring theme in distributed ML: communication granularity is a tunable knob. Smaller, more frequent transfers improve overlap and reduce idle time, but increase the aggregate latency overhead. The same trade-off appears in gradient compression, which is the final cost model in this appendix.

Compression Economics

Systems Perspective 1.5: Why this matters

Gradient compression promises to reduce communication volume by 10–1000\(\times\), making AllReduce nearly free. Compression, however, has overhead (encoding and decoding time), and it only helps if that overhead is smaller than the communication time it saves. A quick break-even analysis prevents wasted effort on compression schemes that will not actually speed up training.

The compression equation

Gradient compression replaces the original \(M\)-byte gradient with a compressed representation of size \(M/r_{\text{comp}}\), where \(r_{\text{comp}}\) is the compression ratio. Equation 11 expresses the total time with compression:

\[ T_{\text{compressed}} = T_{\text{encode}} + T_{\text{transfer}}\!\left(\frac{M}{r_{\text{comp}}}\right) + T_{\text{decode}} \tag{11}\]

Compression is beneficial when \(T_{\text{compressed}} < T_{\text{transfer}}(M)\). Rearranging this inequality yields the break-even condition in equation 12:

\[ T_{\text{encode}} + T_{\text{decode}} < T_{\text{transfer}}(M) - T_{\text{transfer}}\!\left(\frac{M}{r_{\text{comp}}}\right) \tag{12}\]

In the bandwidth-dominated regime (\(M \gg n^*\)), the communication time scales linearly with message size, and the right-hand side simplifies to approximately \(T_{\text{transfer}}(M) \times (1 - 1/r_{\text{comp}})\). For high compression ratios (\(r_{\text{comp}} \gg 1\)), this approaches \(T_{\text{transfer}}(M)\)—meaning compression is worthwhile as long as the codec overhead is less than the uncompressed communication time.

Napkin Math 1.3: Worked example: Is top-k compression worth it?

Setup: Ring AllReduce of a 1 GB gradient across 256 H100 GPUs on InfiniBand NDR, taking 42.4 ms uncompressed. A Top-k sparsification scheme achieves 64× compression with 2 ms total encode + decode overhead.

Question: Does compression reduce the total AllReduce time?

Math:

Uncompressed AllReduce: 42.4 ms
Compressed communication: 3.2 ms
Total with compression: 5.2 ms
Speedup: 8.2×

Result: Compression delivers an 8.2× speedup in communication time. The 2 ms codec overhead is a small fraction of the 42.4 ms uncompressed baseline.

Systems insight: At this scale (256 GPUs, 1 GB gradients), gradient compression is clearly profitable. However, the convergence impact must be validated: Top-k sparsification discards information, and the model may require more iterations to reach the same accuracy, potentially negating the per-iteration speedup.

Table 5 shows how the break-even codec overhead varies with compression ratio and uncompressed AllReduce time. Higher compression ratios tolerate larger codec overheads.

Table 5: Compression Break-Even Thresholds: Maximum tolerable codec overhead (encode + decode) for compression to reduce total time, computed as \(T_{\text{transfer}}(M) \times (1 - 1/r_{\text{comp}})\). At high compression ratios, nearly the entire uncompressed AllReduce time is available as codec budget.

Compression Ratio (\(r_{\text{comp}}\))	AllReduce = 10 ms (Max Codec Overhead)	AllReduce = 40 ms (Max Codec Overhead)	AllReduce = 100 ms (Max Codec Overhead)
4\(\times\)	7.5	30	75
16\(\times\)	9.4	37.5	93.8
64\(\times\)	9.8	39.4	98.4
256\(\times\)	9.96	39.8	99.6

Summary

Key Takeaways: Communication costs are predictable

The \(\alpha\)-\(\beta\) model: The model (\(T(n) = \alpha + n/\beta\)) decomposes every network transfer into a fixed startup cost and a payload-proportional bandwidth cost. The crossover message size \(n^* = \alpha \times \beta\) determines which cost dominates.
Ring-based collectives: AllReduce, AllGather, ReduceScatter, and AllToAll achieve bandwidth-optimal communication with a \((N-1)/N\) factor, but their latency term grows linearly with GPU count. Tree-based algorithms trade bandwidth for logarithmic latency scaling.
Algorithm selection: Message size relative to \(n^*\) determines the algorithm: ring for large messages (gradients), tree for small messages (barriers), and hierarchical two-level schemes for multi-node clusters.
Pipeline bubble fraction: \(b_{\text{pipe}} = (p-1)/(p-1+m)\) is the fundamental overhead of pipeline parallelism. The practical rule \(m \geq 4p\) keeps bubbles below 20 percent, and interleaved scheduling with \(V\) virtual stages makes the bubble reduction approach \(V\) when the microbatch term dominates.
Gradient compression: Gradient compression is profitable when the codec overhead (encode + decode) is less than the communication time saved. At high compression ratios in the bandwidth-dominated regime, nearly the entire uncompressed AllReduce time is available as overhead budget—but convergence impact must always be validated.
Four parameters predict communication cost: The point-to-point and collective communication models in this appendix derive from two hardware parameters (\(\alpha\) and \(\beta\)) and two workload parameters (message payload \(n\) or \(M\), plus participant count \(N\) for collectives). Measuring these quantities for a cluster yields a complete first-order prediction of distributed training overhead, with pipeline and compression sections introducing additional parameters as noted earlier.