At 1024 GPUs, communication overhead is the cost everyone talks about. But is it the dominant one?
fleet
advanced
Sweep from 8 to 1024 GPUs and discover which overhead actually dominates at scale. The answer is not what most engineers expect.
The Question
You have a 70B-parameter model and a budget for 1024 H100 GPUs. Communication overhead (AllReduce, pipeline bubbles) will eat some of your scaling efficiency — everyone knows that. But is communication really the dominant cost at scale? Or is something else quietly stealing more of your training time?
Calculate scaling efficiency across GPU counts from 8 to 1024
Predict cluster Mean Time Between Failures (MTBF) using the reliability model
Compare communication overhead vs. checkpoint overhead over a 30-day training run
Explain why reliability — not communication — is the dominant cost at scale
TipBackground: Cluster Reliability and the Young-Daly Formula
At scale, hardware failures are not rare events — they are routine. If a single GPU has a Mean Time To Failure (MTTF) of 50,000 hours, a cluster of N GPUs has a collective MTBF of approximately MTTF / N. At 512 GPUs, that is ~97 hours. At 1024 GPUs, it drops further.
To survive failures, training jobs checkpoint periodically: save the full model state (weights + optimizer) to storage. The Young-Daly formula gives the optimal checkpoint interval that minimizes total wasted time:
where \(\delta\) is the time to write one checkpoint and \(M\) is the cluster MTBF. This assumes failures arrive as a Poisson process (exponentially distributed inter-arrival times) and that on failure, work since the last checkpoint is lost — on average, half the checkpoint interval. The cost of checkpointing is not just the write time — it is also this rollback time.
1. Setup
import mlsysimfrom mlsysim import Engine
2. Single-Node Baseline
Let’s start with the simplest case: 8 GPUs in a single DGX node. No network fabric, no cross-node communication. This is our ceiling for scaling efficiency.
── Single-Node Baseline ────────────────────
GPUs: 8
Compute latency: 6,642.8 ms
Communication: 19.61 ms
Scaling efficiency: 99.7%
With 8 GPUs on a single node, NVLink provides enough bandwidth that communication overhead is small. This is our baseline — near-perfect scaling.
Note: This tutorial uses the three dimensions of 3D parallelism:
TP (Tensor Parallelism) = 8: splits each layer’s weight matrices across 8 GPUs within a node, communicating via NVLink
PP (Pipeline Parallelism) = 1: no pipeline stages (the model is not split across sequential stages), so no pipeline bubbles
DP (Data Parallelism) = N/8: each node processes different data batches, synchronizing gradients via AllReduce over InfiniBand NDR (400 Gbps)
Setting PP=1 isolates the communication vs. reliability comparison cleanly. Real production systems often use PP > 1 for models too large to fit on one node with TP alone, but this introduces pipeline bubble overhead (idle time proportional to 1/PP stages).
3. Scale Sweep: 8 to 1024 GPUs
Now let’s sweep across cluster sizes. At each scale, we measure communication overhead (AllReduce + pipeline bubbles) as a fraction of total step time.
Communication overhead grows with GPU count, but scaling efficiency degrades gradually. This is Amdahl’s Law in action: if the serial fraction (communication) is \(f\), the maximum speedup at \(N\) GPUs is \(1 / (f + (1-f)/N)\). At 1024 GPUs, even a 5% serial fraction caps ideal speedup at ~20×, not 1024×. But if communication were the whole story, large-scale training would be predictable and manageable.
If communication were the only overhead, we could predict scaling efficiency from network bandwidth alone and call it a day. The numbers above look manageable — even at 1024 GPUs, communication takes a modest fraction of each step.
But this analysis assumed something we have not examined: that the cluster stays up for the entire training run. Communication is not the whole story.
4. The Reliability Reveal
Let’s now ask a different question: how often does a 512-GPU or 1024-GPU cluster experience a hardware failure? The ReliabilityModel models cluster-level MTBF.
from mlsysim import ReliabilityModelrel_solver = ReliabilityModel()rows = []rel_results = {}for n_gpus in gpu_counts: n_nodes = n_gpus //8 fleet = Fleet( name=f"{n_gpus}-GPU", node=Systems.Nodes.DGX_H100, count=n_nodes, fabric=Systems.Fabrics.InfiniBand_NDR )# 30-day training run, 60-second checkpoint write time r = rel_solver.solve(fleet=fleet, job_duration_hours=30*24, checkpoint_time_s=60.0) rel_results[n_gpus] = r mtbf_h = r.fleet_mtbf.to("hour").magnitude rows.append([ n_gpus, n_nodes,f"{mtbf_h:.1f} h",f"{r.expected_failures:.1f}",f"{r.optimal_checkpoint_interval.to('hour').magnitude:.1f} h" ])table(["GPUs", "Nodes", "Cluster MTBF", "Failures/30d", "Optimal Ckpt"], rows)
GPUs Nodes Cluster MTBF Failures/30d Optimal Ckpt
─────────────────────────────────────────────────────
8 1 4285.7 h 0.2 12.0 h
32 4 1071.4 h 0.7 6.0 h
64 8 535.7 h 1.3 4.2 h
128 16 267.9 h 2.7 3.0 h
256 32 133.9 h 5.4 2.1 h
512 64 67.0 h 10.8 1.5 h
1024 128 33.5 h 21.5 1.1 h
At 512+ GPUs, something fails roughly every day. The optimal checkpoint interval drops to about 1-2 hours. Every checkpoint pauses training, and every failure rolls back to the last checkpoint, discarding all work in between.
5. The Hidden Cost: Communication vs. Checkpoints
Now the key comparison. Over a 30-day training run, how much total time is lost to communication overhead vs. checkpoint overhead?
from mlsysim import CheckpointModelckpt_solver = CheckpointModel()job_hours =30*24# 30 daysrows = []for n_gpus in [64, 256, 512, 1024]: dist_r = results[n_gpus] rel_r = rel_results[n_gpus]# Communication overhead per step as fraction of step time comm_fraction =1.0- dist_r.scaling_efficiency comm_loss_hours = comm_fraction * job_hours# Checkpoint overhead: write time per checkpoint + rollback from failures ckpt_r = ckpt_solver.solve( model=model, hardware=Systems.Nodes.DGX_H100.accelerator, optimizer="adam", checkpoint_interval_hours=rel_r.optimal_checkpoint_interval.to("hour").magnitude ) ckpt_write_s = ckpt_r.write_time_seconds.to("second").magnitude ckpt_interval_h = rel_r.optimal_checkpoint_interval.to("hour").magnitude# Number of checkpoints in 30 days n_checkpoints = job_hours / ckpt_interval_h if ckpt_interval_h >0else0 total_ckpt_write_h = n_checkpoints * ckpt_write_s /3600# Rollback cost: each failure loses ~half the checkpoint interval on average avg_rollback_h = ckpt_interval_h /2 total_rollback_h = rel_r.expected_failures * avg_rollback_h total_ckpt_loss_h = total_ckpt_write_h + total_rollback_h ratio = total_ckpt_loss_h / comm_loss_hours if comm_loss_hours >0else0 rows.append([n_gpus, f"{comm_loss_hours:.1f}", f"{total_ckpt_loss_h:.1f}", f"{ratio:.1f}x"])table(["GPUs", "Comm Loss (h)", "Ckpt Loss (h)", "Ckpt/Comm"], rows)
GPUs Comm Loss (h) Ckpt Loss (h) Ckpt/Comm
─────────────────────────────────────────────
64 5.8 9.5 1.6x
256 61.3 19.0 0.3x
512 69.2 26.9 0.4x
1024 73.1 38.1 0.5x
ImportantKey Insight
At scale, reliability overhead dominates communication overhead. At 512+ GPUs, the cluster MTBF drops low enough that frequent checkpointing and failure rollbacks consume far more training time than AllReduce communication. A seemingly well-tuned distributed training job can lose 10-30% of its wall-clock time to checkpoint I/O and wasted work from rollbacks — costs that are invisible if you only measure communication efficiency. The first question at scale is not “how fast is my network?” but “how often does something break?”
6. The Checkpoint Size Problem
The reason checkpoint overhead is so painful is that modern models produce enormous checkpoint files. Let’s quantify the checkpoint size for Llama-3 70B with the Adam optimizer:
A single checkpoint for a 70B model with Adam optimizer states is hundreds of gigabytes. At hourly intervals, the I/O burden is substantial. At the optimal Young-Daly interval for large clusters, the burden becomes a dominant factor.
Your Turn
CautionExercises
Exercise 1: Predict before you compute. At 2048 GPUs (256 DGX nodes), predict the cluster MTBF. Write your estimate, then build the fleet and run ReliabilityModel. How many failures do you expect in a 30-day run? Were you close?
Exercise 2: What checkpoint interval minimizes total overhead? For a 512-GPU cluster, sweep checkpoint intervals from 0.5 hours to 8 hours. For each interval, calculate: (a) total checkpoint write time over 30 days, and (b) total rollback time (expected failures x half the interval). Plot or print the total overhead. Does the minimum match the Young-Daly optimal interval?
Exercise 3: What if GPU reliability doubles? Imagine next-generation GPUs with MTTF of 100,000 hours instead of 50,000 hours. Build a custom Node with nics_per_node=8, psus_per_node=2 and re-run the reliability analysis for 1024 GPUs. How does the MTBF change? At what GPU count does the “doubled reliability” cluster match the original cluster’s MTBF at 512 GPUs?
Self-check: If single-GPU MTTF is 50,000 hours and a node has 8 GPUs, 8 NICs, and 2 PSUs, what is the approximate node MTTF? (Hint: MTTF_node = 1 / sum(1/MTTF_i) for each component.)
Key Takeaways
TipSummary
Communication overhead grows gradually with GPU count but remains manageable with good fabric
Cluster MTBF drops inversely with component count: 1024 GPUs means failures every ~20-50 hours
Checkpoint overhead dominates at scale: write time + rollback time exceeds communication loss
The Young-Daly formula gives the optimal checkpoint interval, balancing write cost against rollback risk
Reliability is a first-order systems variable: no amount of network tuning compensates for a cluster that breaks every day