Scaling to 1000 GPUs

At 1024 GPUs, communication overhead is the cost everyone talks about. But is it the dominant one?

fleet
advanced
Sweep from 8 to 1024 GPUs and discover which overhead actually dominates at scale. The answer is not what most engineers expect.

The Question

You have a 70B-parameter model and a budget for 1024 H100 GPUs. Communication overhead (AllReduce, pipeline bubbles) will eat some of your scaling efficiency — everyone knows that. But is communication really the dominant cost at scale? Or is something else quietly stealing more of your training time?

NotePrerequisites

Complete Tutorial 0: Hello, Roofline and Tutorial 1: The Memory Wall. You should understand memory-bound vs. compute-bound regimes and the roofline model.

NoteWhat You Will Learn
  • Calculate scaling efficiency across GPU counts from 8 to 1024
  • Predict cluster Mean Time Between Failures (MTBF) using the reliability model
  • Compare communication overhead vs. checkpoint overhead over a 30-day training run
  • Explain why reliability — not communication — is the dominant cost at scale
TipBackground: Cluster Reliability and the Young-Daly Formula

At scale, hardware failures are not rare events — they are routine. If a single GPU has a Mean Time To Failure (MTTF) of 50,000 hours, a cluster of N GPUs has a collective MTBF of approximately MTTF / N. At 512 GPUs, that is ~97 hours. At 1024 GPUs, it drops further.

To survive failures, training jobs checkpoint periodically: save the full model state (weights + optimizer) to storage. The Young-Daly formula gives the optimal checkpoint interval that minimizes total wasted time:

\[T_{\text{opt}} = \sqrt{2 \cdot \delta \cdot M}\]

where \(\delta\) is the time to write one checkpoint and \(M\) is the cluster MTBF. This assumes failures arrive as a Poisson process (exponentially distributed inter-arrival times) and that on failure, work since the last checkpoint is lost — on average, half the checkpoint interval. The cost of checkpointing is not just the write time — it is also this rollback time.


1. Setup

import mlsysim
from mlsysim import Engine

2. Single-Node Baseline

Let’s start with the simplest case: 8 GPUs in a single DGX node. No network fabric, no cross-node communication. This is our ceiling for scaling efficiency.

from mlsysim import DistributedModel, Models, Systems
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.core.constants import Q_
from mlsysim.show import table, info

model = Models.Llama3_70B
solver = DistributedModel()

# Single DGX H100 node: 8 GPUs, NVLink interconnect
single_node = Fleet(
    name="Single Node",
    node=Systems.Nodes.DGX_H100,
    count=1,
    fabric=Systems.Fabrics.InfiniBand_NDR
)

baseline = solver.solve(
    model=model, fleet=single_node,
    batch_size=64, precision="fp16", tp_size=8, pp_size=1
)

info("Single-Node Baseline",
     GPUs=single_node.total_accelerators,
     Compute_latency=baseline.node_profile.latency.to('ms'),
     Communication=baseline.communication_latency.to('ms'),
     Scaling_efficiency=f"{baseline.scaling_efficiency:.1%}")
── Single-Node Baseline ────────────────────
GPUs:                8
Compute latency:     6,642.8 ms
Communication:       19.61 ms
Scaling efficiency:  99.7%

With 8 GPUs on a single node, NVLink provides enough bandwidth that communication overhead is small. This is our baseline — near-perfect scaling.

Note: This tutorial uses the three dimensions of 3D parallelism:

  • TP (Tensor Parallelism) = 8: splits each layer’s weight matrices across 8 GPUs within a node, communicating via NVLink
  • PP (Pipeline Parallelism) = 1: no pipeline stages (the model is not split across sequential stages), so no pipeline bubbles
  • DP (Data Parallelism) = N/8: each node processes different data batches, synchronizing gradients via AllReduce over InfiniBand NDR (400 Gbps)

Setting PP=1 isolates the communication vs. reliability comparison cleanly. Real production systems often use PP > 1 for models too large to fit on one node with TP alone, but this introduces pipeline bubble overhead (idle time proportional to 1/PP stages).


3. Scale Sweep: 8 to 1024 GPUs

Now let’s sweep across cluster sizes. At each scale, we measure communication overhead (AllReduce + pipeline bubbles) as a fraction of total step time.

gpu_counts = [8, 32, 64, 128, 256, 512, 1024]

rows = []
results = {}
for n_gpus in gpu_counts:
    n_nodes = n_gpus // 8
    fleet = Fleet(
        name=f"{n_gpus}-GPU Cluster",
        node=Systems.Nodes.DGX_H100,
        count=n_nodes,
        fabric=Systems.Fabrics.InfiniBand_NDR
    )
    # Use TP=8 within node, DP across nodes
    r = solver.solve(
        model=model, fleet=fleet,
        batch_size=max(64, n_gpus),
        precision="fp16", tp_size=8, pp_size=1
    )
    results[n_gpus] = r
    rows.append([
        n_gpus, n_nodes,
        f"{r.communication_latency.to('ms').magnitude:.1f}",
        f"{r.pipeline_bubble_latency.to('ms').magnitude:.1f}",
        f"{r.scaling_efficiency:.1%}"
    ])

table(["GPUs", "Nodes", "Comm (ms)", "Bubble (ms)", "Efficiency"], rows)
GPUs  Nodes  Comm (ms)  Bubble (ms)  Efficiency
───────────────────────────────────────────────
8         1       19.6          0.0       99.7%
32        4       49.0          0.0       99.3%
64        8       53.9          0.0       99.2%
128      16      441.3          0.0       93.8%
256      32      617.8          0.0       91.5%
512      64      706.1          0.0       90.4%
1024    128      750.3          0.0       89.9%

Communication overhead grows with GPU count, but scaling efficiency degrades gradually. This is Amdahl’s Law in action: if the serial fraction (communication) is \(f\), the maximum speedup at \(N\) GPUs is \(1 / (f + (1-f)/N)\). At 1024 GPUs, even a 5% serial fraction caps ideal speedup at ~20×, not 1024×. But if communication were the whole story, large-scale training would be predictable and manageable.

If communication were the only overhead, we could predict scaling efficiency from network bandwidth alone and call it a day. The numbers above look manageable — even at 1024 GPUs, communication takes a modest fraction of each step.

But this analysis assumed something we have not examined: that the cluster stays up for the entire training run. Communication is not the whole story.


4. The Reliability Reveal

Let’s now ask a different question: how often does a 512-GPU or 1024-GPU cluster experience a hardware failure? The ReliabilityModel models cluster-level MTBF.

from mlsysim import ReliabilityModel

rel_solver = ReliabilityModel()

rows = []
rel_results = {}
for n_gpus in gpu_counts:
    n_nodes = n_gpus // 8
    fleet = Fleet(
        name=f"{n_gpus}-GPU",
        node=Systems.Nodes.DGX_H100,
        count=n_nodes,
        fabric=Systems.Fabrics.InfiniBand_NDR
    )
    # 30-day training run, 60-second checkpoint write time
    r = rel_solver.solve(fleet=fleet, job_duration_hours=30*24, checkpoint_time_s=60.0)
    rel_results[n_gpus] = r
    mtbf_h = r.fleet_mtbf.to("hour").magnitude
    rows.append([
        n_gpus, n_nodes,
        f"{mtbf_h:.1f} h",
        f"{r.expected_failures:.1f}",
        f"{r.optimal_checkpoint_interval.to('hour').magnitude:.1f} h"
    ])

table(["GPUs", "Nodes", "Cluster MTBF", "Failures/30d", "Optimal Ckpt"], rows)
GPUs  Nodes  Cluster MTBF  Failures/30d  Optimal Ckpt
─────────────────────────────────────────────────────
8         1      4285.7 h           0.2        12.0 h
32        4      1071.4 h           0.7         6.0 h
64        8       535.7 h           1.3         4.2 h
128      16       267.9 h           2.7         3.0 h
256      32       133.9 h           5.4         2.1 h
512      64        67.0 h          10.8         1.5 h
1024    128        33.5 h          21.5         1.1 h

At 512+ GPUs, something fails roughly every day. The optimal checkpoint interval drops to about 1-2 hours. Every checkpoint pauses training, and every failure rolls back to the last checkpoint, discarding all work in between.


5. The Hidden Cost: Communication vs. Checkpoints

Now the key comparison. Over a 30-day training run, how much total time is lost to communication overhead vs. checkpoint overhead?

from mlsysim import CheckpointModel

ckpt_solver = CheckpointModel()
job_hours = 30 * 24  # 30 days

rows = []
for n_gpus in [64, 256, 512, 1024]:
    dist_r = results[n_gpus]
    rel_r = rel_results[n_gpus]

    # Communication overhead per step as fraction of step time
    comm_fraction = 1.0 - dist_r.scaling_efficiency
    comm_loss_hours = comm_fraction * job_hours

    # Checkpoint overhead: write time per checkpoint + rollback from failures
    ckpt_r = ckpt_solver.solve(
        model=model, hardware=Systems.Nodes.DGX_H100.accelerator,
        optimizer="adam", checkpoint_interval_hours=rel_r.optimal_checkpoint_interval.to("hour").magnitude
    )
    ckpt_write_s = ckpt_r.write_time_seconds.to("second").magnitude
    ckpt_interval_h = rel_r.optimal_checkpoint_interval.to("hour").magnitude

    # Number of checkpoints in 30 days
    n_checkpoints = job_hours / ckpt_interval_h if ckpt_interval_h > 0 else 0
    total_ckpt_write_h = n_checkpoints * ckpt_write_s / 3600

    # Rollback cost: each failure loses ~half the checkpoint interval on average
    avg_rollback_h = ckpt_interval_h / 2
    total_rollback_h = rel_r.expected_failures * avg_rollback_h

    total_ckpt_loss_h = total_ckpt_write_h + total_rollback_h
    ratio = total_ckpt_loss_h / comm_loss_hours if comm_loss_hours > 0 else 0

    rows.append([n_gpus, f"{comm_loss_hours:.1f}", f"{total_ckpt_loss_h:.1f}", f"{ratio:.1f}x"])

table(["GPUs", "Comm Loss (h)", "Ckpt Loss (h)", "Ckpt/Comm"], rows)
GPUs  Comm Loss (h)  Ckpt Loss (h)  Ckpt/Comm
─────────────────────────────────────────────
64              5.8            9.5       1.6x
256            61.3           19.0       0.3x
512            69.2           26.9       0.4x
1024           73.1           38.1       0.5x
ImportantKey Insight

At scale, reliability overhead dominates communication overhead. At 512+ GPUs, the cluster MTBF drops low enough that frequent checkpointing and failure rollbacks consume far more training time than AllReduce communication. A seemingly well-tuned distributed training job can lose 10-30% of its wall-clock time to checkpoint I/O and wasted work from rollbacks — costs that are invisible if you only measure communication efficiency. The first question at scale is not “how fast is my network?” but “how often does something break?”


6. The Checkpoint Size Problem

The reason checkpoint overhead is so painful is that modern models produce enormous checkpoint files. Let’s quantify the checkpoint size for Llama-3 70B with the Adam optimizer:

ckpt_result = ckpt_solver.solve(
    model=model,
    hardware=Systems.Nodes.DGX_H100.accelerator,
    optimizer="adam",
    checkpoint_interval_hours=1.0
)

info("Checkpoint Analysis",
     Model=model.name,
     Checkpoint_size=ckpt_result.checkpoint_size.to('GB'),
     Write_time=ckpt_result.write_time_seconds.to('second'),
     MFU_penalty_1h=f"{ckpt_result.mfu_penalty_pct:.2%}",
     Storage_bottleneck=ckpt_result.storage_bottleneck)
── Checkpoint Analysis ─────────────────────
Model:               Llama-3.1-70B
Checkpoint size:     988.4 GB
Write time:          141.2 s
MFU penalty 1h:      3.92%
Storage bottleneck:  True

A single checkpoint for a 70B model with Adam optimizer states is hundreds of gigabytes. At hourly intervals, the I/O burden is substantial. At the optimal Young-Daly interval for large clusters, the burden becomes a dominant factor.


Your Turn

CautionExercises

Exercise 1: Predict before you compute. At 2048 GPUs (256 DGX nodes), predict the cluster MTBF. Write your estimate, then build the fleet and run ReliabilityModel. How many failures do you expect in a 30-day run? Were you close?

Exercise 2: What checkpoint interval minimizes total overhead? For a 512-GPU cluster, sweep checkpoint intervals from 0.5 hours to 8 hours. For each interval, calculate: (a) total checkpoint write time over 30 days, and (b) total rollback time (expected failures x half the interval). Plot or print the total overhead. Does the minimum match the Young-Daly optimal interval?

Exercise 3: What if GPU reliability doubles? Imagine next-generation GPUs with MTTF of 100,000 hours instead of 50,000 hours. Build a custom Node with nics_per_node=8, psus_per_node=2 and re-run the reliability analysis for 1024 GPUs. How does the MTBF change? At what GPU count does the “doubled reliability” cluster match the original cluster’s MTBF at 512 GPUs?

Self-check: If single-GPU MTTF is 50,000 hours and a node has 8 GPUs, 8 NICs, and 2 PSUs, what is the approximate node MTTF? (Hint: MTTF_node = 1 / sum(1/MTTF_i) for each component.)


Key Takeaways

TipSummary
  • Communication overhead grows gradually with GPU count but remains manageable with good fabric
  • Cluster MTBF drops inversely with component count: 1024 GPUs means failures every ~20-50 hours
  • Checkpoint overhead dominates at scale: write time + rollback time exceeds communication loss
  • The Young-Daly formula gives the optimal checkpoint interval, balancing write cost against rollback risk
  • Reliability is a first-order systems variable: no amount of network tuning compensates for a cluster that breaks every day

Next Steps

Back to top