Reliability Foundations

Purpose

How does failure become a statistical certainty at fleet scale, and what does the math require for recovery?

Individual accelerators fail rarely, but fleets multiply component failure rates until failures become a continuous operating condition. At cluster scale, the expected time between failures drops from years to hours, and recovery becomes part of normal execution rather than an exceptional incident. This appendix collects the reference calculations for reasoning quantitatively about failure, recovery, and availability at scale, providing the mathematical tools behind fault tolerance strategies, fleet orchestration policies, and operational practice. In C³ terms, it develops the coordination axis, where checkpointing, recovery, and availability determine how much of the fleet’s compute and communication becomes useful work.

How to Use This Appendix

This appendix is designed as a reference, intended for moving from intuition (“failures happen more often at scale”) to quantitative engineering decisions (“how often should one checkpoint?” or “how many spare nodes are needed?”).

The worked examples in this appendix use a canonical 10,000-GPU training fleet unless noted otherwise.

Failure Probability at Scale

Consider a training run on a 10,000-GPU cluster running for three weeks. What is the probability that at least one GPU fails during that time? The answer—effectively 100 percent—determines whether the system needs fault tolerance as a core design requirement or merely a nice-to-have. The calculations in this section make that determination precise.

Individual hardware components are remarkably reliable. A data center-grade GPU operates for tens of thousands of hours before failing. The physics of large-scale systems, however, works against reliability: every additional component is another opportunity for failure, and the aggregate failure rate scales linearly with component count. This section develops the arithmetic that transforms component-level reliability into system-level failure predictions.

Component failure rates

Reliability engineers characterize components using two complementary metrics. The Failure in Time (FIT) rate counts failures per \(10^9\) device-hours of operation—a unit chosen because individual components fail so rarely that failures-per-hour would produce inconveniently small numbers. The reciprocal quantity, Mean Time To Failure (MTTF), gives the average lifetime in hours as equation 1:

\[ \text{MTTF} = \frac{10^9}{\text{FIT}} \tag{1}\]

Table 1 lists reference FIT rates and MTTF values for components found in a typical GPU training node, grounded in large-fleet and warehouse-scale experience (Kokolis et al. 2025; Zu et al. 2024; Barroso et al. 2019). These values assume the steady-state “useful life” phase of the bathtub curve, where the failure rate is approximately constant—neither dominated by infant mortality (early life) nor wear-out (end of life) (Klutke et al. 2003).

Klutke, G., P. C. Kiessler, and M. A. Wortman. 2003. “A Critical Look at the Bathtub Curve.” IEEE Transactions on Reliability 52 (1): 125–29. https://doi.org/10.1109/tr.2002.804492.
Table 1: Component Failure Rates: Order-of-magnitude reference FIT/MTTF in the steady-state useful-life phase. Informed by large-GPU research-cluster analysis Kokolis et al. (2025), TPUv4 supercomputer resiliency and operations Zu et al. (2024), and warehouse-scale machine design Barroso et al. (2019).
Kokolis, Apostolos, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zach DeVito, Shubho Sengupta, Kalyan Saladi, and Carole-Jean Wu. 2025. “Revisiting Reliability in Large-Scale Machine Learning Research Clusters.” 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 1259–74. https://doi.org/10.1109/hpca61900.2025.00096.
Zu, Y., A. Ghaffarkhah, H.-V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, et al. 2024. “Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer.” 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 761–74.
Barroso, Luiz André, Urs Hölzle, and Parthasarathy Ranganathan. 2019. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Springer International Publishing. https://doi.org/10.1007/978-3-031-01761-2.
Component FIT Rate MTTF MTTF (years) Typical Failure Mode
GPU 20,000 50,000 hours 5.7 years Die defect, thermal fatigue
HBM 5,000 200,000 hours 22.8 years Bit-flip accumulation, TSV
NIC 6,666.7 150,000 hours 17.1 years Transceiver degradation
PSU 10,000 100,000 hours 11.4 years Capacitor aging
PCIe Switch 5,000 200,000 hours 22.8 years Solder joint, ESD damage
Optical Cable 20,000 50,000 hours 5.7 years Fiber bend, connector wear
ToR Switch 3,333.3 300,000 hours 34.2 years ASIC failure, fan bearing

Each component in isolation appears highly reliable—a GPU lasts 5.7 years on average. The trouble begins when we ask how a node behaves with many such components operating simultaneously.

The MTBF cascade

A compute node is a series system: if any component fails, the node fails. For independent components with constant failure rates, equation 2 sums the individual rates into the node-level failure rate:

\[ \frac{1}{\text{MTBF}_{\text{node}}} = \frac{n_{\text{gpu}}}{\text{MTTF}_{\text{gpu}}} + \frac{n_{\text{nic}}}{\text{MTTF}_{\text{nic}}} + \frac{n_{\text{psu}}}{\text{MTTF}_{\text{psu}}} + \cdots \tag{2}\]

Think of each component as a ticking clock counting down to failure. A node with 8 GPUs, 2 NICs, and 2 PSUs has 12 independent clocks—the node fails when the first clock reaches zero. More clocks mean a shorter expected wait.

For a cluster of \(N_{\text{nodes}}\) identical nodes, the same logic applies one level up, as equation 3 shows:

\[ \text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{node}}}{N_{\text{nodes}}} \tag{3}\]

This is the MTBF cascade: reliability degrades linearly with component count at each level, and the levels compound. A node with 5,172.4 hours MTBF sounds reliable. A cluster of 1,250 such nodes has an MTBF of just 4.14 hours—a failure every few hours is the expected steady state.

Table 2 shows how cluster MTBF shrinks as fleet size grows.

Table 2: Cluster MTBF by Scale: As cluster size grows, the aggregate MTBF shrinks proportionally. At 10,000 GPUs, failures occur every few hours; at 100,000 GPUs, they occur continuously. Node configuration: 8 GPUs, 2 NICs, 2 PSUs per node.
Cluster GPUs Nodes Cluster MTBF Expected Failures/Day
256 32 161.6 hours 0.1
1,024 128 40.4 hours 0.6
2,048 256 20.2 hours 1.2
8,192 1,024 5.1 hours 4.8
10,000 1,250 4.1 hours 5.8
100,000 12,500 24.8 minutes 58

The table makes a visceral point: the transition from “hundreds of GPUs” to “tens of thousands” is not merely a quantitative change but a qualitative one. At 256 GPUs, a full day may pass between failures. At 10,000 GPUs, the cluster expects multiple failures per shift. At 100,000 GPUs, failures are a continuous background condition—the system is never fully healthy.

Probability of failure during a job

Knowing the MTBF tells us the average time between failures, but training jobs have fixed durations. The question practitioners ask is: what is the probability that my job will be interrupted at least once?

Under the exponential failure model (constant failure rate), equation 4 gives the probability of at least one failure during a job of duration \(T_{\text{job}}\):

\[ \Pr(\geq 1\ \text{failure}) = 1 - e^{-T_{\text{job}} / \text{MTBF}} \tag{4}\]

When \(T_{\text{job}} \gg \text{MTBF}\), this probability approaches 1 rapidly. Table 3 shows the concrete numbers for various cluster sizes and job durations.

Table 3: Probability of At Least One Failure: For large clusters and multi-day jobs, failure is a near-certainty. Any system operating in the bottom-right region of this table must treat fault tolerance as a core design requirement, not an optimization.
Cluster GPUs 1 Day (24 h) 1 Week (168 h) 30 Days (720 h)
256 13.8% 64.6% 98.8%
1,024 44.8% 98.4% > 99.9%
2,048 69.5% > 99.9% > 99.9%
8,192 99.1% > 99.9% > 99.9%
10,000 99.7% > 99.9% > 99.9%
100,000 > 99.9% > 99.9% > 99.9%

The message is stark: for any cluster above a few thousand GPUs running jobs longer than a day, the probability of experiencing at least one failure is effectively 100 percent. This is why Fault Tolerance treats fault tolerance not as a defensive measure but as a fundamental architectural requirement.

The exponential failure model assumes a constant failure rate, which holds during the steady-state useful-life phase. During burn-in (first few hundred hours) and wear-out (approaching end-of-life), failure rates are higher. In practice, fleet operators observe that newly deployed nodes exhibit 2–3\(\times\) higher failure rates in their first week, making burn-in testing essential before admitting nodes to production clusters.

The inevitability of failure during long training jobs leads directly to the next question: if we will lose progress, how do we minimize how much?


Checkpoint Optimization

Every checkpoint saves progress but costs time. Checkpoint too rarely and a failure destroys hours of training. Checkpoint too frequently and the overhead of writing checkpoints itself becomes the bottleneck. The Young-Daly formula gives the mathematically optimal balance point, and it depends on just two measurable quantities: how long a checkpoint takes to write and how often failures occur (Young 1974; Daly 2006).

Young, John W. 1974. “A First Order Approximation to the Optimum Checkpoint Interval.” Communications of the ACM 17 (9): 530–31. https://doi.org/10.1145/361147.361115.
Daly, J. T. 2006. “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps.” Future Generation Computer Systems 22 (3): 303–12. https://doi.org/10.1016/j.future.2004.11.016.

The Young-Daly model

The optimal checkpoint interval balances two competing costs. Writing a checkpoint takes time \(T_{\text{write}}\) (the checkpoint cost), during which no useful training occurs. The longer the interval between checkpoints, however, the more work is lost when a failure strikes—on average, half the interval. Equation 5 states the Young-Daly formula that minimizes the expected total overhead:

\[ \tau_{\text{opt}} = \sqrt{2 \times T_{\text{write}} \times \text{MTBF}_{\text{system}}} \tag{5}\]

where \(T_{\text{write}}\) is the checkpoint write time in seconds and \(\text{MTBF}_{\text{system}}\) is the cluster or job-level system MTBF in seconds.

The intuition behind equation 5 is geometric-mean-like: when checkpoints are cheap relative to the MTBF (\(T_{\text{write}} \ll \text{MTBF}_{\text{system}}\), the common case), the optimal interval sits between the two time scales. If checkpoints took zero time, the optimal cadence is every step. If the system never failed, no checkpoints would be required. The square root interpolates between these extremes.

The formula assumes that failures follow an exponential distribution (memoryless property) and that checkpoint cost \(T_{\text{write}}\) is small compared to \(\text{MTBF}_{\text{system}}\). Both assumptions hold well for production training clusters: the exponential model fits observed failure data, and modern checkpointing systems write to fast parallel storage in tens of seconds, while MTBF is measured in hours.

Checkpoint sizing

Checkpoint size determines the write time \(T_{\text{write}}\) that feeds into the Young-Daly formula. For common mixed-precision training checkpoints with the Adaptive Moment Estimation (Adam) optimizer, each parameter requires approximately 14 bytes of persistent state:

  • 2 bytes for BF16 model weights
  • 4 bytes for FP32 master weights
  • 4 bytes for FP32 first moment (Adam \(m\))
  • 4 bytes for FP32 second moment (Adam \(v\))

Equation 6 aggregates these per-parameter costs into the total checkpoint footprint:

\[ \text{Checkpoint Size} = P \times 14 \text{ bytes/parameter} \tag{6}\]

Here, \(P\) is the number of trainable model parameters whose persistent optimizer and weight state must be serialized.

Table 4 shows how checkpoint state moves from manageable gigabytes to frontier-scale terabytes as model size grows.

Table 4: Checkpoint Sizes for Mixed-Precision Adam Training: Each parameter requires about 14 bytes of persistent checkpoint state (model weights + FP32 master weights + optimizer moments); gradients are normally transient and recomputed after restore rather than serialized as durable checkpoint state. Write times assume 100 GB/s aggregate storage bandwidth.
Model Size Checkpoint Size Write Time at 100 GB/s
7B 98 GB 0.98 s
13B 182 GB 2 s
70B 980 GB 10 s
175B 2450 GB 24 s
1T 14000 GB 140 s

As table 4 shows, at frontier scale (175B+ parameters), checkpoint sizes reach the terabyte range. This makes checkpoint write time a significant cost that directly affects the Young-Daly optimal interval. The checkpoint strategies in Fault Tolerance discuss techniques for reducing \(T_{\text{write}}\)—asynchronous checkpointing, incremental deltas, and distributed storage—all of which improve the Young-Daly result by shrinking the numerator under the square root.

Worked example: Optimal checkpoint interval

Combining the checkpoint write time with the 10,000-GPU MTBF gives a concrete checkpoint cadence.

Example 1.1: Young-Daly: 175B model on a 10,000-GPU cluster
Setup: Consider training a 175B-parameter model on a 10,000-GPU cluster. The cluster MTBF is 4.14 hours (table 2). The checkpoint size is 2,450 GB, and the parallel storage system writes at 100 GB/s.

Step 1: Checkpoint write time (\(T_{\text{write}}\)). \[T_{\text{write}} = \frac{\text{Checkpoint Size}}{\text{Write Bandwidth}} = \frac{2,450 \text{ GB}}{100 \text{ GB/s}} = 24.5 \text{ s}\]

Step 2: Apply the Young-Daly formula (equation 5). \[\tau_{\text{opt}} = \sqrt{2 \times T_{\text{write}} \times \text{MTBF}_{\text{system}}} = \sqrt{2 \times 24.5 \text{ s} \times 4.14 \text{ h} \times 3{,}600 \text{ s/h}} = 14.2 \text{ min}\]

Interpretation. The optimal checkpoint interval is approximately 14.2 minutes. The overhead from checkpointing alone is \(T_{\text{write}} / \tau_{\text{opt}} \approx\) 2.9 percent of training time.

Systems insight: If the cluster were doubled to 20,000 GPUs, the MTBF would halve, and the optimal interval would shrink to 10.1 minutes—checkpointing more frequently because failures happen more often. This illustrates the fundamental tension at scale: larger clusters are faster but demand more frequent interruption to protect progress.

The boundary conditions of the Young-Daly formula merit attention. When \(T_{\text{write}}\) approaches \(\text{MTBF}_{\text{system}}\) (checkpoint cost approaches MTBF), checkpoint and rework overheads become so large that checkpoint/restart alone may fail to maintain useful forward progress. In such cases, redundancy or elastic training becomes necessary, as discussed in section 1.4.1.


Recovery Budgets

When a failure occurs, the system does not instantly resume training. Detection, rescheduling, reloading state, and replaying lost work each consume time. Understanding this recovery anatomy reveals which phase dominates and where to invest engineering effort.

The anatomy of recovery time

Recovery is not a single event but a pipeline of phases, each with its own time budget. Equation 7 sums them into total recovery time:

\[ T_{\text{recovery}} = T_{\text{detect}} + T_{\text{reschedule}} + T_{\text{reload}} + T_{\text{replay}} \tag{7}\]

The terms are all durations: failure detection, replacement scheduling, checkpoint reload, and replay of lost training work since the last checkpoint. Their sum is the wall-clock time before the job returns to productive training.

Table 5 breaks recovery into the phases that determine how long the cluster remains below full productivity after a failure.

Table 5: Recovery Time Breakdown: Each phase contributes to the total time between failure and full-speed resumption. For the 10K-GPU, 175B-model scenario, replay dominates because it recomputes work lost since the last checkpoint.
Phase Typical Duration What Happens
\(T_{\text{detect}}\) 30 s Heartbeat timeout expires; worker declared failed (confirmation adds to the budget in Fault Tolerance)
\(T_{\text{reschedule}}\) 60 s Replacement node allocated from spare pool
\(T_{\text{reload}}\) 24.5 s Checkpoint read from storage into GPU memory
\(T_{\text{replay}}\) ~7.1 min Recompute training steps since last checkpoint
Total \(T_{\text{recovery}}\) ~9 min System fully productive again

As table 5 illustrates, the key insight is that \(T_{\text{replay}}\) typically dominates, and it is directly controlled by the checkpoint interval: on average, half the interval must be replayed. This creates a reinforcing loop with the Young-Daly formula—shorter intervals mean less replay but more checkpoint overhead, and the formula finds the minimum of this sum.

The other phases offer engineering optimization targets. \(T_{\text{detect}}\) can be reduced with more aggressive heartbeat intervals (at the cost of false positives). \(T_{\text{reschedule}}\) depends on having hot spare nodes preallocated, a fleet orchestration decision covered in Fleet Orchestration. \(T_{\text{reload}}\) scales with checkpoint size and storage bandwidth, motivating the checkpoint compression and sharding techniques discussed in Fault Tolerance.

Goodput vs. rawput

Not all time spent on a training cluster produces useful progress. Rawput is the total number of training steps executed (including steps that will be discarded after a failure). Goodput is the number of training steps that actually contribute to the final model:

\[ \text{Goodput Ratio} = \frac{\text{Useful Steps}}{\text{Total Executed Steps}} \approx \frac{T_{\text{useful}}}{T_{\text{wall}}} \]

The gap between rawput and goodput comes from three sources:

  1. Checkpoint overhead (\(\sim\) 2.9 percent): Training pauses during each checkpoint write.
  2. Recovery overhead (\(\sim\) 3.6 percent): Time lost to detection, rescheduling, reloading, and replay after each failure.
  3. Wasted work: Training steps computed between the last checkpoint and the failure, which must be discarded and recomputed.

At a 10,000-GPU scale, published reports from Meta, Google, and others consistently show 10–25 percent total overhead from failures and checkpointing combined. A cluster nominally capable of completing a training run in 30 days therefore requires 33–38 days of wall-clock time. The fleet orchestration strategies in Fleet Orchestration and ML Operations at Scale focus on narrowing this gap—every percentage point of overhead recovered translates directly to dollars saved and training time shortened.

Systems Perspective 1.1: The hidden cost of scale
A common misconception is that doubling cluster size halves training time. In practice, doubling from 5,000 to 10,000 GPUs halves the MTBF, roughly doubling the failure-related overhead. The effective speedup is less than 2\(\times\), and at extreme scale, adding more GPUs can actually increase wall-clock time if the fault tolerance mechanisms cannot keep pace. This is the reliability analogue of Amdahl’s Law: the serial overhead of recovery bounds the benefit of parallelism.


Strategy Selection

Checkpoint/restart is not the only fault tolerance strategy. For serving workloads where downtime is measured in lost revenue, redundancy provides a fundamentally different trade-off. For elastic training, the system can shrink around failures rather than stopping. Choosing the right strategy depends on the workload’s tolerance for latency, cost, and complexity.

Checkpoint/restart vs. redundancy vs. elastic training

The three canonical strategies represent different points in the trade-off space between cost, complexity, and recovery speed.

Checkpoint/restart periodically saves full system state and rolls back to the last checkpoint after failure. It is the workhorse of large-scale training: conceptually simple, well-understood, and effective when MTBF is much larger than checkpoint cost. The weakness is that recovery requires stopping all workers and replaying lost computation.

Redundancy maintains duplicate copies of state or computation. If one replica fails, another immediately takes over. This is the dominant strategy for inference serving, where even seconds of downtime are unacceptable. The cost is 2–3\(\times\) the compute resources, which is prohibitive for training but justified for revenue-critical serving.

Elastic training allows the training job to continue with fewer workers when a failure occurs, rather than stopping entirely. Workers are added back when replacement nodes become available. This minimizes wall-clock interruption but requires frameworks that support dynamic world-size changes (for example, TorchElastic), and it introduces complexity in learning rate adjustment and gradient normalization. Table 6 summarizes the trade-offs.

Table 6: Fault Tolerance Strategy Comparison: Each strategy excels in a different regime. Real-world systems often combine strategies: checkpoint/restart for training with redundancy for the metadata service and checkpoint storage layer.
Criterion Checkpoint/Restart Redundancy Elastic Training
Recovery latency Minutes (replay) Milliseconds (failover) Seconds (reconfigure)
Resource overhead ~3–13% (storage + IO) 100–200% (replicas) ~5–10% (spare capacity)
Workload fit Training (batch) Serving (online) Training (long-running)
Implementation Simple Moderate Complex
State management Periodic snapshots Continuous replication Distributed with resharding
Failure mode Job pauses, replays Transparent to user Throughput dip, continues

The availability stacking formula

For serving workloads, availability is typically expressed as a percentage: 99 percent (“two nines”), 99.9 percent (“three nines”), and so on. Redundancy improves availability by running \(k\) independent replicas. The system is unavailable only when all replicas are simultaneously down, which equation 8 captures:

\[ A_{\text{system}} = 1 - (1 - A)^k \tag{8}\]

where \(A\) is the availability of a single replica and \(k\) is the number of replicas.

Table 7 quantifies the availability gain from adding independent serving replicas.

Table 7: Availability Stacking with Independent Replicas: Starting from a single-replica availability of 99 percent, each additional replica dramatically reduces expected downtime. Assumes replica failures are independent.
Replicas \(k\) System Availability Nines Downtime per Year
1 99.00% 2.0 87.6 hours
2 99.9900% 4.0 52.6 minutes
3 99.9999% 6.0 31.5 seconds

As table 7 shows, the power of stacking is dramatic: two replicas of a 99 percent-available system yield 99.99 percent availability, reducing annual downtime from roughly 87 hours to under an hour. This is why inference serving systems almost universally deploy multiple replicas behind a load balancer—the cost of an extra replica is small compared to the business value of four-nines availability.

The independence assumption is critical, however. Correlated failures—power outages affecting an entire rack, software bugs triggered by a specific input, or network partitions isolating a failure domain—defeat availability stacking. This is why Fault Tolerance emphasizes failure domain isolation: replicas must be placed in different racks, different power zones, and ideally different data centers to ensure that their failure modes are truly independent.


Summary

Key Takeaways: Failure as a physical constraint
  • Failure rate scales linearly with component count: A single GPU fails once per 5.7 years; a 10,000-GPU cluster experiences a failure every 4.14 hours. At fleet scale, failure is a continuous background condition, not an exceptional event.
  • The MTBF cascade compounds through system levels: Node MTBF is determined by the weakest component type; cluster MTBF divides by node count. Table 2 provides the reference numbers for capacity planning.
  • Job failure probability approaches certainty quickly: For clusters above a few thousand GPUs running multi-day jobs, the probability of at least one failure exceeds 99 percent. Fault tolerance is not optional at this scale—it is a prerequisite for completing any training run.
  • Young-Daly checkpoint interval: The formula \(\tau_{\text{opt}} = \sqrt{2 T_{\text{write}} \, \text{MTBF}_{\text{system}}}\) optimizes checkpoint frequency by balancing the cost of writing checkpoints against the cost of lost work. It requires only two measurable inputs: checkpoint write time and system MTBF.
  • Recovery has four phases: Detection, rescheduling, reloading, and replay. Replay typically dominates and is controlled by the checkpoint interval. Each phase offers distinct optimization opportunities.
  • Strategy selection depends on workload type: Checkpoint/restart suits batch training. Redundancy suits latency-sensitive serving. Elastic training bridges the two but adds complexity.
  • Availability stacks exponentially with independent replicas: Independent replicas improve availability exponentially, but correlated failures collapse that benefit. Failure domain isolation is the prerequisite that makes redundancy effective.
Back to top