solvers.ReliabilityModel

solvers.ReliabilityModel()

Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.

This model handles the reliability modeling of massive clusters, helping determine the ‘Goodput’ of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time.

Literature Source: 1. Young (1974), “A First-Order Approximation to the Optimum Checkpoint Interval.” 2. Daly (2006), “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart-Dump Strategy.”

Methods

Name	Description
solve	Calculates reliability and checkpointing metrics for a fleet.

solve

solvers.ReliabilityModel.solve(
    fleet,
    job_duration_hours,
    checkpoint_time_s=60.0,
    avg_recovery_time_s=300.0,
)

Calculates reliability and checkpointing metrics for a fleet.

Parameters

Name	Type	Description	Default
fleet	Fleet	The hardware cluster configuration.	required
job_duration_hours	float	Total job duration in hours.	required
checkpoint_time_s	float	Time to write one checkpoint in seconds (default 60s).	`60.0`
avg_recovery_time_s	float	Average time to recover from a failure in seconds (default 300s). Includes checkpoint reload, process restart, and re-warmup.	`300.0`