solvers.ReliabilityModel
solvers.ReliabilityModel()Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
This model handles the reliability modeling of massive clusters, helping determine the ‘Goodput’ of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time.
Literature Source: 1. Young (1974), “A First-Order Approximation to the Optimum Checkpoint Interval.” 2. Daly (2006), “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart-Dump Strategy.”
Methods
| Name | Description |
|---|---|
| solve | Calculates reliability and checkpointing metrics for a fleet. |
solve
solvers.ReliabilityModel.solve(
fleet,
job_duration_hours,
checkpoint_time_s=60.0,
avg_recovery_time_s=300.0,
)Calculates reliability and checkpointing metrics for a fleet.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| fleet | Fleet | The hardware cluster configuration. | required |
| job_duration_hours | float | Total job duration in hours. | required |
| checkpoint_time_s | float | Time to write one checkpoint in seconds (default 60s). | 60.0 |
| avg_recovery_time_s | float | Average time to recover from a failure in seconds (default 300s). Includes checkpoint reload, process restart, and re-warmup. | 300.0 |