Reliability Foundations
Purpose
How does failure become a statistical certainty at fleet scale, and what does the math require for recovery?
Individual accelerators fail rarely, but fleets multiply component failure rates until failures become a continuous operating condition. At cluster scale, the expected time between failures drops from years to hours, and recovery becomes part of normal execution rather than an exceptional incident. This appendix collects the reference calculations for reasoning quantitatively about failure, recovery, and availability at scale, providing the mathematical tools behind fault tolerance strategies, fleet orchestration policies, and operational practice. In C³ terms, it develops the coordination axis, where checkpointing, recovery, and availability determine how much of the fleet’s compute and communication becomes useful work.
How to Use This Appendix
This appendix is designed as a reference, intended for moving from intuition (“failures happen more often at scale”) to quantitative engineering decisions (“how often should one checkpoint?” or “how many spare nodes are needed?”).
The worked examples in this appendix use a canonical 10,000-GPU training fleet unless noted otherwise.
- “How often will something fail?”—Start with section 1.1 and the MTBF cascade in section 1.1.2.
- “How often should I checkpoint?”—Use the Young-Daly model in section 1.2.1 and the worked example in section 1.2.3.
- “How much time do I lose to recovery?”—See the recovery anatomy in section 1.3.1 and the goodput analysis in section 1.3.2.
- “Should I use redundancy or checkpointing?”—Compare strategies in section 1.4.1 and availability stacking in section 1.4.2.
Failure Probability at Scale
Consider a training run on a 10,000-GPU cluster running for three weeks. What is the probability that at least one GPU fails during that time? The answer—effectively 100 percent—determines whether the system needs fault tolerance as a core design requirement or merely a nice-to-have. The calculations in this section make that determination precise.
Individual hardware components are remarkably reliable. A data center-grade GPU operates for tens of thousands of hours before failing. The physics of large-scale systems, however, works against reliability: every additional component is another opportunity for failure, and the aggregate failure rate scales linearly with component count. This section develops the arithmetic that transforms component-level reliability into system-level failure predictions.
Component failure rates
Reliability engineers characterize components using two complementary metrics. The Failure in Time (FIT) rate counts failures per \(10^9\) device-hours of operation—a unit chosen because individual components fail so rarely that failures-per-hour would produce inconveniently small numbers. The reciprocal quantity, Mean Time To Failure (MTTF), gives the average lifetime in hours as equation 1:
\[ \text{MTTF} = \frac{10^9}{\text{FIT}} \tag{1}\]
Table 1 lists reference FIT rates and MTTF values for components found in a typical GPU training node, grounded in large-fleet and warehouse-scale experience (Kokolis et al. 2025; Zu et al. 2024; Barroso et al. 2019). These values assume the steady-state “useful life” phase of the bathtub curve, where the failure rate is approximately constant—neither dominated by infant mortality (early life) nor wear-out (end of life) (Klutke et al. 2003).
| Component | FIT Rate | MTTF | MTTF (years) | Typical Failure Mode |
|---|---|---|---|---|
| GPU | 20,000 | 50,000 hours | 5.7 years | Die defect, thermal fatigue |
| HBM | 5,000 | 200,000 hours | 22.8 years | Bit-flip accumulation, TSV |
| NIC | 6,666.7 | 150,000 hours | 17.1 years | Transceiver degradation |
| PSU | 10,000 | 100,000 hours | 11.4 years | Capacitor aging |
| PCIe Switch | 5,000 | 200,000 hours | 22.8 years | Solder joint, ESD damage |
| Optical Cable | 20,000 | 50,000 hours | 5.7 years | Fiber bend, connector wear |
| ToR Switch | 3,333.3 | 300,000 hours | 34.2 years | ASIC failure, fan bearing |
Each component in isolation appears highly reliable—a GPU lasts 5.7 years on average. The trouble begins when we ask how a node behaves with many such components operating simultaneously.
The MTBF cascade
A compute node is a series system: if any component fails, the node fails. For independent components with constant failure rates, equation 2 sums the individual rates into the node-level failure rate:
\[ \frac{1}{\text{MTBF}_{\text{node}}} = \frac{n_{\text{gpu}}}{\text{MTTF}_{\text{gpu}}} + \frac{n_{\text{nic}}}{\text{MTTF}_{\text{nic}}} + \frac{n_{\text{psu}}}{\text{MTTF}_{\text{psu}}} + \cdots \tag{2}\]
Think of each component as a ticking clock counting down to failure. A node with 8 GPUs, 2 NICs, and 2 PSUs has 12 independent clocks—the node fails when the first clock reaches zero. More clocks mean a shorter expected wait.
For a cluster of \(N_{\text{nodes}}\) identical nodes, the same logic applies one level up, as equation 3 shows:
\[ \text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{node}}}{N_{\text{nodes}}} \tag{3}\]
This is the MTBF cascade: reliability degrades linearly with component count at each level, and the levels compound. A node with 5,172.4 hours MTBF sounds reliable. A cluster of 1,250 such nodes has an MTBF of just 4.14 hours—a failure every few hours is the expected steady state.
Table 2 shows how cluster MTBF shrinks as fleet size grows.
| Cluster GPUs | Nodes | Cluster MTBF | Expected Failures/Day |
|---|---|---|---|
| 256 | 32 | 161.6 hours | 0.1 |
| 1,024 | 128 | 40.4 hours | 0.6 |
| 2,048 | 256 | 20.2 hours | 1.2 |
| 8,192 | 1,024 | 5.1 hours | 4.8 |
| 10,000 | 1,250 | 4.1 hours | 5.8 |
| 100,000 | 12,500 | 24.8 minutes | 58 |
The table makes a visceral point: the transition from “hundreds of GPUs” to “tens of thousands” is not merely a quantitative change but a qualitative one. At 256 GPUs, a full day may pass between failures. At 10,000 GPUs, the cluster expects multiple failures per shift. At 100,000 GPUs, failures are a continuous background condition—the system is never fully healthy.
Probability of failure during a job
Knowing the MTBF tells us the average time between failures, but training jobs have fixed durations. The question practitioners ask is: what is the probability that my job will be interrupted at least once?
Under the exponential failure model (constant failure rate), equation 4 gives the probability of at least one failure during a job of duration \(T_{\text{job}}\):
\[ \Pr(\geq 1\ \text{failure}) = 1 - e^{-T_{\text{job}} / \text{MTBF}} \tag{4}\]
When \(T_{\text{job}} \gg \text{MTBF}\), this probability approaches 1 rapidly. Table 3 shows the concrete numbers for various cluster sizes and job durations.
| Cluster GPUs | 1 Day (24 h) | 1 Week (168 h) | 30 Days (720 h) |
|---|---|---|---|
| 256 | 13.8% | 64.6% | 98.8% |
| 1,024 | 44.8% | 98.4% | > 99.9% |
| 2,048 | 69.5% | > 99.9% | > 99.9% |
| 8,192 | 99.1% | > 99.9% | > 99.9% |
| 10,000 | 99.7% | > 99.9% | > 99.9% |
| 100,000 | > 99.9% | > 99.9% | > 99.9% |
The message is stark: for any cluster above a few thousand GPUs running jobs longer than a day, the probability of experiencing at least one failure is effectively 100 percent. This is why Fault Tolerance treats fault tolerance not as a defensive measure but as a fundamental architectural requirement.
The exponential failure model assumes a constant failure rate, which holds during the steady-state useful-life phase. During burn-in (first few hundred hours) and wear-out (approaching end-of-life), failure rates are higher. In practice, fleet operators observe that newly deployed nodes exhibit 2–3\(\times\) higher failure rates in their first week, making burn-in testing essential before admitting nodes to production clusters.
The inevitability of failure during long training jobs leads directly to the next question: if we will lose progress, how do we minimize how much?
Checkpoint Optimization
Every checkpoint saves progress but costs time. Checkpoint too rarely and a failure destroys hours of training. Checkpoint too frequently and the overhead of writing checkpoints itself becomes the bottleneck. The Young-Daly formula gives the mathematically optimal balance point, and it depends on just two measurable quantities: how long a checkpoint takes to write and how often failures occur (Young 1974; Daly 2006).
The Young-Daly model
The optimal checkpoint interval balances two competing costs. Writing a checkpoint takes time \(T_{\text{write}}\) (the checkpoint cost), during which no useful training occurs. The longer the interval between checkpoints, however, the more work is lost when a failure strikes—on average, half the interval. Equation 5 states the Young-Daly formula that minimizes the expected total overhead:
\[ \tau_{\text{opt}} = \sqrt{2 \times T_{\text{write}} \times \text{MTBF}_{\text{system}}} \tag{5}\]
where \(T_{\text{write}}\) is the checkpoint write time in seconds and \(\text{MTBF}_{\text{system}}\) is the cluster or job-level system MTBF in seconds.
The intuition behind equation 5 is geometric-mean-like: when checkpoints are cheap relative to the MTBF (\(T_{\text{write}} \ll \text{MTBF}_{\text{system}}\), the common case), the optimal interval sits between the two time scales. If checkpoints took zero time, the optimal cadence is every step. If the system never failed, no checkpoints would be required. The square root interpolates between these extremes.
The formula assumes that failures follow an exponential distribution (memoryless property) and that checkpoint cost \(T_{\text{write}}\) is small compared to \(\text{MTBF}_{\text{system}}\). Both assumptions hold well for production training clusters: the exponential model fits observed failure data, and modern checkpointing systems write to fast parallel storage in tens of seconds, while MTBF is measured in hours.
Checkpoint sizing
Checkpoint size determines the write time \(T_{\text{write}}\) that feeds into the Young-Daly formula. For common mixed-precision training checkpoints with the Adaptive Moment Estimation (Adam) optimizer, each parameter requires approximately 14 bytes of persistent state:
- 2 bytes for BF16 model weights
- 4 bytes for FP32 master weights
- 4 bytes for FP32 first moment (Adam \(m\))
- 4 bytes for FP32 second moment (Adam \(v\))
Equation 6 aggregates these per-parameter costs into the total checkpoint footprint:
\[ \text{Checkpoint Size} = P \times 14 \text{ bytes/parameter} \tag{6}\]
Here, \(P\) is the number of trainable model parameters whose persistent optimizer and weight state must be serialized.
Table 4 shows how checkpoint state moves from manageable gigabytes to frontier-scale terabytes as model size grows.
| Model Size | Checkpoint Size | Write Time at 100 GB/s |
|---|---|---|
| 7B | 98 GB | 0.98 s |
| 13B | 182 GB | 2 s |
| 70B | 980 GB | 10 s |
| 175B | 2450 GB | 24 s |
| 1T | 14000 GB | 140 s |
As table 4 shows, at frontier scale (175B+ parameters), checkpoint sizes reach the terabyte range. This makes checkpoint write time a significant cost that directly affects the Young-Daly optimal interval. The checkpoint strategies in Fault Tolerance discuss techniques for reducing \(T_{\text{write}}\)—asynchronous checkpointing, incremental deltas, and distributed storage—all of which improve the Young-Daly result by shrinking the numerator under the square root.
Worked example: Optimal checkpoint interval
Combining the checkpoint write time with the 10,000-GPU MTBF gives a concrete checkpoint cadence.
Example 1.1: Young-Daly: 175B model on a 10,000-GPU cluster
Step 1: Checkpoint write time (\(T_{\text{write}}\)). \[T_{\text{write}} = \frac{\text{Checkpoint Size}}{\text{Write Bandwidth}} = \frac{2,450 \text{ GB}}{100 \text{ GB/s}} = 24.5 \text{ s}\]
Step 2: Apply the Young-Daly formula (equation 5). \[\tau_{\text{opt}} = \sqrt{2 \times T_{\text{write}} \times \text{MTBF}_{\text{system}}} = \sqrt{2 \times 24.5 \text{ s} \times 4.14 \text{ h} \times 3{,}600 \text{ s/h}} = 14.2 \text{ min}\]
Interpretation. The optimal checkpoint interval is approximately 14.2 minutes. The overhead from checkpointing alone is \(T_{\text{write}} / \tau_{\text{opt}} \approx\) 2.9 percent of training time.
Systems insight: If the cluster were doubled to 20,000 GPUs, the MTBF would halve, and the optimal interval would shrink to 10.1 minutes—checkpointing more frequently because failures happen more often. This illustrates the fundamental tension at scale: larger clusters are faster but demand more frequent interruption to protect progress.
The boundary conditions of the Young-Daly formula merit attention. When \(T_{\text{write}}\) approaches \(\text{MTBF}_{\text{system}}\) (checkpoint cost approaches MTBF), checkpoint and rework overheads become so large that checkpoint/restart alone may fail to maintain useful forward progress. In such cases, redundancy or elastic training becomes necessary, as discussed in section 1.4.1.
Recovery Budgets
When a failure occurs, the system does not instantly resume training. Detection, rescheduling, reloading state, and replaying lost work each consume time. Understanding this recovery anatomy reveals which phase dominates and where to invest engineering effort.
The anatomy of recovery time
Recovery is not a single event but a pipeline of phases, each with its own time budget. Equation 7 sums them into total recovery time:
\[ T_{\text{recovery}} = T_{\text{detect}} + T_{\text{reschedule}} + T_{\text{reload}} + T_{\text{replay}} \tag{7}\]
The terms are all durations: failure detection, replacement scheduling, checkpoint reload, and replay of lost training work since the last checkpoint. Their sum is the wall-clock time before the job returns to productive training.
Table 5 breaks recovery into the phases that determine how long the cluster remains below full productivity after a failure.
| Phase | Typical Duration | What Happens |
|---|---|---|
| \(T_{\text{detect}}\) | 30 s | Heartbeat timeout expires; worker declared failed (confirmation adds to the budget in Fault Tolerance) |
| \(T_{\text{reschedule}}\) | 60 s | Replacement node allocated from spare pool |
| \(T_{\text{reload}}\) | 24.5 s | Checkpoint read from storage into GPU memory |
| \(T_{\text{replay}}\) | ~7.1 min | Recompute training steps since last checkpoint |
| Total \(T_{\text{recovery}}\) | ~9 min | System fully productive again |
As table 5 illustrates, the key insight is that \(T_{\text{replay}}\) typically dominates, and it is directly controlled by the checkpoint interval: on average, half the interval must be replayed. This creates a reinforcing loop with the Young-Daly formula—shorter intervals mean less replay but more checkpoint overhead, and the formula finds the minimum of this sum.
The other phases offer engineering optimization targets. \(T_{\text{detect}}\) can be reduced with more aggressive heartbeat intervals (at the cost of false positives). \(T_{\text{reschedule}}\) depends on having hot spare nodes preallocated, a fleet orchestration decision covered in Fleet Orchestration. \(T_{\text{reload}}\) scales with checkpoint size and storage bandwidth, motivating the checkpoint compression and sharding techniques discussed in Fault Tolerance.
Goodput vs. rawput
Not all time spent on a training cluster produces useful progress. Rawput is the total number of training steps executed (including steps that will be discarded after a failure). Goodput is the number of training steps that actually contribute to the final model:
\[ \text{Goodput Ratio} = \frac{\text{Useful Steps}}{\text{Total Executed Steps}} \approx \frac{T_{\text{useful}}}{T_{\text{wall}}} \]
The gap between rawput and goodput comes from three sources:
- Checkpoint overhead (\(\sim\) 2.9 percent): Training pauses during each checkpoint write.
- Recovery overhead (\(\sim\) 3.6 percent): Time lost to detection, rescheduling, reloading, and replay after each failure.
- Wasted work: Training steps computed between the last checkpoint and the failure, which must be discarded and recomputed.
At a 10,000-GPU scale, published reports from Meta, Google, and others consistently show 10–25 percent total overhead from failures and checkpointing combined. A cluster nominally capable of completing a training run in 30 days therefore requires 33–38 days of wall-clock time. The fleet orchestration strategies in Fleet Orchestration and ML Operations at Scale focus on narrowing this gap—every percentage point of overhead recovered translates directly to dollars saved and training time shortened.
Strategy Selection
Checkpoint/restart is not the only fault tolerance strategy. For serving workloads where downtime is measured in lost revenue, redundancy provides a fundamentally different trade-off. For elastic training, the system can shrink around failures rather than stopping. Choosing the right strategy depends on the workload’s tolerance for latency, cost, and complexity.
Checkpoint/restart vs. redundancy vs. elastic training
The three canonical strategies represent different points in the trade-off space between cost, complexity, and recovery speed.
Checkpoint/restart periodically saves full system state and rolls back to the last checkpoint after failure. It is the workhorse of large-scale training: conceptually simple, well-understood, and effective when MTBF is much larger than checkpoint cost. The weakness is that recovery requires stopping all workers and replaying lost computation.
Redundancy maintains duplicate copies of state or computation. If one replica fails, another immediately takes over. This is the dominant strategy for inference serving, where even seconds of downtime are unacceptable. The cost is 2–3\(\times\) the compute resources, which is prohibitive for training but justified for revenue-critical serving.
Elastic training allows the training job to continue with fewer workers when a failure occurs, rather than stopping entirely. Workers are added back when replacement nodes become available. This minimizes wall-clock interruption but requires frameworks that support dynamic world-size changes (for example, TorchElastic), and it introduces complexity in learning rate adjustment and gradient normalization. Table 6 summarizes the trade-offs.
| Criterion | Checkpoint/Restart | Redundancy | Elastic Training |
|---|---|---|---|
| Recovery latency | Minutes (replay) | Milliseconds (failover) | Seconds (reconfigure) |
| Resource overhead | ~3–13% (storage + IO) | 100–200% (replicas) | ~5–10% (spare capacity) |
| Workload fit | Training (batch) | Serving (online) | Training (long-running) |
| Implementation | Simple | Moderate | Complex |
| State management | Periodic snapshots | Continuous replication | Distributed with resharding |
| Failure mode | Job pauses, replays | Transparent to user | Throughput dip, continues |
The availability stacking formula
For serving workloads, availability is typically expressed as a percentage: 99 percent (“two nines”), 99.9 percent (“three nines”), and so on. Redundancy improves availability by running \(k\) independent replicas. The system is unavailable only when all replicas are simultaneously down, which equation 8 captures:
\[ A_{\text{system}} = 1 - (1 - A)^k \tag{8}\]
where \(A\) is the availability of a single replica and \(k\) is the number of replicas.
Table 7 quantifies the availability gain from adding independent serving replicas.
| Replicas \(k\) | System Availability | Nines | Downtime per Year |
|---|---|---|---|
| 1 | 99.00% | 2.0 | 87.6 hours |
| 2 | 99.9900% | 4.0 | 52.6 minutes |
| 3 | 99.9999% | 6.0 | 31.5 seconds |
As table 7 shows, the power of stacking is dramatic: two replicas of a 99 percent-available system yield 99.99 percent availability, reducing annual downtime from roughly 87 hours to under an hour. This is why inference serving systems almost universally deploy multiple replicas behind a load balancer—the cost of an extra replica is small compared to the business value of four-nines availability.
The independence assumption is critical, however. Correlated failures—power outages affecting an entire rack, software bugs triggered by a specific input, or network partitions isolating a failure domain—defeat availability stacking. This is why Fault Tolerance emphasizes failure domain isolation: replicas must be placed in different racks, different power zones, and ideally different data centers to ensure that their failure modes are truly independent.
Summary
Key Takeaways: Failure as a physical constraint
- Failure rate scales linearly with component count: A single GPU fails once per 5.7 years; a 10,000-GPU cluster experiences a failure every 4.14 hours. At fleet scale, failure is a continuous background condition, not an exceptional event.
- The MTBF cascade compounds through system levels: Node MTBF is determined by the weakest component type; cluster MTBF divides by node count. Table 2 provides the reference numbers for capacity planning.
- Job failure probability approaches certainty quickly: For clusters above a few thousand GPUs running multi-day jobs, the probability of at least one failure exceeds 99 percent. Fault tolerance is not optional at this scale—it is a prerequisite for completing any training run.
- Young-Daly checkpoint interval: The formula \(\tau_{\text{opt}} = \sqrt{2 T_{\text{write}} \, \text{MTBF}_{\text{system}}}\) optimizes checkpoint frequency by balancing the cost of writing checkpoints against the cost of lost work. It requires only two measurable inputs: checkpoint write time and system MTBF.
- Recovery has four phases: Detection, rescheduling, reloading, and replay. Replay typically dominates and is controlled by the checkpoint interval. Each phase offers distinct optimization opportunities.
- Strategy selection depends on workload type: Checkpoint/restart suits batch training. Redundancy suits latency-sensitive serving. Elastic training bridges the two but adds complexity.
- Availability stacks exponentially with independent replicas: Independent replicas improve availability exponentially, but correlated failures collapse that benefit. Failure domain isolation is the prerequisite that makes redundancy effective.