Scaling to 1000 GPUs

At 1024 GPUs, communication overhead is the cost everyone talks about. But is it the dominant one?

fleet

advanced

Sweep from 8 to 1024 GPUs and discover which overhead actually dominates at scale. The answer is not what most engineers expect.

The Question

You have a 70B-parameter model and a budget for 1024 H100 GPUs. Communication overhead (AllReduce, pipeline bubbles) will eat some of your scaling efficiency — everyone knows that. But is communication really the dominant cost at scale? Or is something else quietly stealing more of your training time?

Prerequisites

Complete Tutorial 0: Hello, Roofline and Tutorial 1: The Memory Wall. You should understand memory-bound vs. compute-bound regimes and the roofline model.

What You Will Learn

Calculate scaling efficiency across GPU counts from 8 to 1024
Predict cluster Mean Time Between Failures (MTBF) using the reliability model
Compare communication overhead vs. checkpoint overhead over a 30-day training run
Explain why reliability — not communication — is the dominant cost at scale

Background: Cluster Reliability and the Young-Daly Formula

At scale, hardware failures are not rare events — they are routine. If a single GPU has a Mean Time To Failure (MTTF) of 50,000 hours, a cluster of N GPUs has a collective MTBF of approximately MTTF / N. At 512 GPUs, that is ~97 hours. At 1024 GPUs, it drops further.

To survive failures, training jobs checkpoint periodically: save the full model state (weights + optimizer) to storage. The Young-Daly formula gives the optimal checkpoint interval that minimizes total wasted time:

\[T_{\text{opt}} = \sqrt{2 \cdot \delta \cdot M}\]

where $\delta$ is the time to write one checkpoint and $M$ is the cluster MTBF. This assumes failures arrive as a Poisson process (exponentially distributed inter-arrival times) and that on failure, work since the last checkpoint is lost — on average, half the checkpoint interval. The cost of checkpointing is not just the write time — it is also this rollback time.

1. Setup

import mlsysim
from mlsysim import Engine

2. Single-Node Baseline

Let’s start with the simplest case: 8 GPUs in a single DGX node. No network fabric, no cross-node communication. This is our ceiling for scaling efficiency.

from mlsysim import Models, Systems
from mlsysim.solvers import DistributedModel
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.core.units import Q_
from mlsysim.show import table, info

model = Models.Language.Llama3_70B
solver = DistributedModel()

# Single DGX H100 node: 8 GPUs, NVLink interconnect
single_node = Fleet(
    name="Single Node",
    node=Systems.Nodes.DGX_H100,
    count=1,
    fabric=Systems.Fabrics.InfiniBand_NDR
)

baseline = solver.solve(
    model=model, fleet=single_node,
    batch_size=64, precision="fp16", tp_size=8, pp_size=1
)

info("Single-Node Baseline",
     GPUs=single_node.total_accelerators,
     Compute_latency=baseline.node_profile.latency.to('ms'),
     Communication=baseline.communication_latency.to('ms'),
     Scaling_efficiency=f"{baseline.scaling_efficiency:.1%}")

── Single-Node Baseline ────────────────────
GPUs:                8
Compute latency:     16,328.7 ms
Communication:       1,336.8 ms
Scaling efficiency:  92.4%

With 8 GPUs on a single node, NVLink provides enough bandwidth that communication overhead is small. This is our baseline — near-perfect scaling.

Note: This tutorial uses the three dimensions of 3D parallelism:

TP (Tensor Parallelism) = 8: splits each layer’s weight matrices across 8 GPUs within a node, communicating via NVLink
PP (Pipeline Parallelism) = 1: no pipeline stages (the model is not split across sequential stages), so no pipeline bubbles
DP (Data Parallelism) = N/8: each node processes different data batches, synchronizing gradients via AllReduce over InfiniBand NDR (400 Gbps)

Setting PP=1 isolates the communication vs. reliability comparison cleanly. Real production systems often use PP > 1 for models too large to fit on one node with TP alone, but this introduces pipeline bubble overhead (idle time proportional to 1/PP stages).

3. Scale Sweep: 8 to 1024 GPUs

Now let’s sweep across cluster sizes. At each scale, we measure communication overhead (AllReduce + pipeline bubbles) as a fraction of total step time.

gpu_counts = [8, 32, 64, 128, 256, 512, 1024]

rows = []
results = {}
for n_gpus in gpu_counts:
    n_nodes = n_gpus // 8
    fleet = Fleet(
        name=f"{n_gpus}-GPU Cluster",
        node=Systems.Nodes.DGX_H100,
        count=n_nodes,
        fabric=Systems.Fabrics.InfiniBand_NDR
    )
    # Use TP=8 within node, DP across nodes
    r = solver.solve(
        model=model, fleet=fleet,
        batch_size=max(64, n_gpus),
        precision="fp16", tp_size=8, pp_size=1
    )
    results[n_gpus] = r
    rows.append([
        n_gpus, n_nodes,
        f"{r.communication_latency.to('ms').magnitude:.1f}",
        f"{r.pipeline_bubble_latency.to('ms').magnitude:.1f}",
        f"{r.scaling_efficiency:.1%}"
    ])

table(["GPUs", "Nodes", "Comm (ms)", "Bubble (ms)", "Efficiency"], rows)

GPUs  Nodes  Comm (ms)  Bubble (ms)  Efficiency
───────────────────────────────────────────────
8         1     1336.8          0.0       92.4%
32        4      393.4          0.0       93.6%
64        8      236.2          0.0       94.4%
128      16      280.4          0.0       93.4%
256      32      302.4          0.0       92.9%
512      64      313.5          0.0       92.7%
1024    128      319.1          0.0       92.6%

Communication overhead grows with GPU count, but scaling efficiency degrades gradually. This is Amdahl’s Law in action: if the serial fraction (communication) is $f$, the maximum speedup at $N$ GPUs is $1 / (f + (1-f)/N)$. At 1024 GPUs, even a 5% serial fraction caps ideal speedup at ~20×, not 1024×. But if communication were the whole story, large-scale training would be predictable and manageable.

If communication were the only overhead, we could predict scaling efficiency from network bandwidth alone and call it a day. The numbers above look manageable — even at 1024 GPUs, communication takes a modest fraction of each step.

But this analysis assumed something we have not examined: that the cluster stays up for the entire training run. Communication is not the whole story.

4. The Reliability Reveal

Let’s now ask a different question: how often does a 512-GPU or 1024-GPU cluster experience a hardware failure? The ReliabilityModel models cluster-level MTBF.

from mlsysim.solvers import ReliabilityModel

rel_solver = ReliabilityModel()

rows = []
rel_results = {}
for n_gpus in gpu_counts:
    n_nodes = n_gpus // 8
    fleet = Fleet(
        name=f"{n_gpus}-GPU",
        node=Systems.Nodes.DGX_H100,
        count=n_nodes,
        fabric=Systems.Fabrics.InfiniBand_NDR
    )
    # 30-day training run, 60-second checkpoint write time
    r = rel_solver.solve(fleet=fleet, job_duration_hours=30*24, checkpoint_time_s=60.0)
    rel_results[n_gpus] = r
    mtbf_h = r.fleet_mtbf.to("hour").magnitude
    rows.append([
        n_gpus, n_nodes,
        f"{mtbf_h:.1f} h",
        f"{r.expected_failures:.1f}",
        f"{r.optimal_checkpoint_interval.to('hour').magnitude:.1f} h"
    ])

table(["GPUs", "Nodes", "Cluster MTBF", "Failures/30d", "Optimal Ckpt"], rows)

GPUs  Nodes  Cluster MTBF  Failures/30d  Optimal Ckpt
─────────────────────────────────────────────────────
8         1      4285.7 h           0.2        12.0 h
32        4      1071.4 h           0.7         6.0 h
64        8       535.7 h           1.3         4.2 h
128      16       267.9 h           2.7         3.0 h
256      32       133.9 h           5.4         2.1 h
512      64        67.0 h          10.8         1.5 h
1024    128        33.5 h          21.5         1.1 h

At 512+ GPUs, something fails roughly every day. The optimal checkpoint interval drops to about 1-2 hours. Every checkpoint pauses training, and every failure rolls back to the last checkpoint, discarding all work in between.

5. The Hidden Cost: Communication vs. Checkpoints

Now the key comparison. Over a 30-day training run, how much total time is lost to communication overhead vs. checkpoint overhead?

from mlsysim.solvers import CheckpointModel

ckpt_solver = CheckpointModel()
job_hours = 30 * 24  # 30 days

rows = []
for n_gpus in [64, 256, 512, 1024]:
    dist_r = results[n_gpus]
    rel_r = rel_results[n_gpus]

    # Communication overhead per step as fraction of step time
    comm_fraction = 1.0 - dist_r.scaling_efficiency
    comm_loss_hours = comm_fraction * job_hours

    # Checkpoint overhead: write time per checkpoint + rollback from failures
    ckpt_r = ckpt_solver.solve(
        model=model, hardware=Systems.Nodes.DGX_H100.accelerator,
        optimizer="adam", checkpoint_interval_hours=rel_r.optimal_checkpoint_interval.to("hour").magnitude
    )
    ckpt_write_s = ckpt_r.write_time_seconds.to("second").magnitude
    ckpt_interval_h = rel_r.optimal_checkpoint_interval.to("hour").magnitude

    # Number of checkpoints in 30 days
    n_checkpoints = job_hours / ckpt_interval_h if ckpt_interval_h > 0 else 0
    total_ckpt_write_h = n_checkpoints * ckpt_write_s / 3600

    # Rollback cost: each failure loses ~half the checkpoint interval on average
    avg_rollback_h = ckpt_interval_h / 2
    total_rollback_h = rel_r.expected_failures * avg_rollback_h

    total_ckpt_loss_h = total_ckpt_write_h + total_rollback_h
    ratio = total_ckpt_loss_h / comm_loss_hours if comm_loss_hours > 0 else 0

    rows.append([n_gpus, f"{comm_loss_hours:.1f}", f"{total_ckpt_loss_h:.1f}", f"{ratio:.1f}x"])

table(["GPUs", "Comm Loss (h)", "Ckpt Loss (h)", "Ckpt/Comm"], rows)

GPUs  Comm Loss (h)  Ckpt Loss (h)  Ckpt/Comm
─────────────────────────────────────────────
64             40.4            9.5       0.2x
256            50.9           19.0       0.4x
512            52.7           26.9       0.5x
1024           53.5           38.1       0.7x

Key Insight

At scale, reliability overhead dominates communication overhead. At 512+ GPUs, the cluster MTBF drops low enough that frequent checkpointing and failure rollbacks consume far more training time than AllReduce communication. A seemingly well-tuned distributed training job can lose 10-30% of its wall-clock time to checkpoint I/O and wasted work from rollbacks — costs that are invisible if you only measure communication efficiency. The first question at scale is not “how fast is my network?” but “how often does something break?”

6. The Checkpoint Size Problem

The reason checkpoint overhead is so painful is that modern models produce enormous checkpoint files. Let’s quantify the checkpoint size for Llama-3 70B with the Adam optimizer:

ckpt_result = ckpt_solver.solve(
    model=model,
    hardware=Systems.Nodes.DGX_H100.accelerator,
    optimizer="adam",
    checkpoint_interval_hours=1.0
)

info("Checkpoint Analysis",
     Model=model.name,
     Checkpoint_size=ckpt_result.checkpoint_size.to('GB'),
     Write_time=ckpt_result.write_time_seconds.to('second'),
     MFU_penalty_1h=f"{ckpt_result.mfu_penalty_pct:.2%}",
     Storage_bottleneck=ckpt_result.storage_bottleneck)

── Checkpoint Analysis ─────────────────────
Model:               Llama-3.1-70B
Checkpoint size:     988.4 GB
Write time:          141.2 s
MFU penalty 1h:      3.92%
Storage bottleneck:  True

A single checkpoint for a 70B model with Adam optimizer states is hundreds of gigabytes. At hourly intervals, the I/O burden is substantial. At the optimal Young-Daly interval for large clusters, the burden becomes a dominant factor.

Your Turn

Exercises

Exercise 1: Predict before you compute. At 2048 GPUs (256 DGX nodes), predict the cluster MTBF. Write your estimate, then build the fleet and run ReliabilityModel. How many failures do you expect in a 30-day run? Were you close?

Exercise 2: What checkpoint interval minimizes total overhead? For a 512-GPU cluster, sweep checkpoint intervals from 0.5 hours to 8 hours. For each interval, calculate: (a) total checkpoint write time over 30 days, and (b) total rollback time (expected failures x half the interval). Plot or print the total overhead. Does the minimum match the Young-Daly optimal interval?

Exercise 3: What if GPU reliability doubles? Imagine next-generation GPUs with MTTF of 100,000 hours instead of 50,000 hours. Build a custom Node with nics_per_node=8, psus_per_node=2 and re-run the reliability analysis for 1024 GPUs. How does the MTBF change? At what GPU count does the “doubled reliability” cluster match the original cluster’s MTBF at 512 GPUs?

Self-check: If single-GPU MTTF is 50,000 hours and a node has 8 GPUs, 8 NICs, and 2 PSUs, what is the approximate node MTTF? (Hint: MTTF_node = 1 / sum(1/MTTF_i) for each component.)

Key Takeaways

Summary

Communication overhead grows gradually with GPU count but remains manageable with good fabric
Cluster MTBF drops inversely with component count: 1024 GPUs means failures every ~20-50 hours
Checkpoint overhead dominates at scale: write time + rollback time exceeds communication loss
The Young-Daly formula gives the optimal checkpoint interval, balancing write cost against rollback risk
Reliability is a first-order systems variable: no amount of network tuning compensates for a cluster that breaks every day

Next Steps

Geography is a Systems Variable – Discover that location changes carbon footprint by 40$\times$
The $9M Question – Quantify the infrastructure cost of chain-of-thought reasoning
Sensitivity Analysis – Use sensitivity sweeps to find which parameter matters most
Full Stack Audit – Combine all solvers into a complete system analysis