solvers.CheckpointModel
solvers.CheckpointModel()Analyzes the storage constraints and I/O burst penalties of saving model states.
Training massive models requires saving hundreds of gigabytes (Weights + Optimizer States) to persistent storage. This model calculates the time spent blocked on I/O, subtracting from the cluster’s Model FLOPs Utilization.
Literature Source: 1. Eisenman et al. (2022), “Check-N-Run: A Checkpointing System for Training Large Language Models.”
Methods
| Name | Description |
|---|---|
| solve | Solves for checkpoint size, write time, and resulting MFU penalty. |
solve
solvers.CheckpointModel.solve(
model,
hardware,
optimizer='adam',
checkpoint_interval_hours=4.0,
n_writers=1,
filesystem_limit_gbs=500.0,
)Solves for checkpoint size, write time, and resulting MFU penalty.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| n_writers | int | Number of parallel checkpoint writers (default 1). Distributed checkpointing (e.g., FSDP) shards the write across workers. | 1 |
| filesystem_limit_gbs | float | Maximum aggregate filesystem write bandwidth in GB/s (default 500). Prevents over-optimistic scaling when n_writers is large. | 500.0 |