solvers.CheckpointModel

solvers.CheckpointModel()

Analyzes the storage constraints and I/O burst penalties of saving model states.

Training massive models requires saving hundreds of gigabytes (Weights + Optimizer States) to persistent storage. This model calculates the time spent blocked on I/O, subtracting from the cluster’s Model FLOPs Utilization.

Literature Source: 1. Eisenman et al. (2022), “Check-N-Run: A Checkpointing System for Training Large Language Models.”

Methods

Name Description
solve Solves for checkpoint size, write time, and resulting MFU penalty.

solve

solvers.CheckpointModel.solve(
    model,
    hardware,
    optimizer='adam',
    checkpoint_interval_hours=4.0,
    n_writers=1,
    filesystem_limit_gbs=500.0,
)

Solves for checkpoint size, write time, and resulting MFU penalty.

Parameters

Name Type Description Default
n_writers int Number of parallel checkpoint writers (default 1). Distributed checkpointing (e.g., FSDP) shards the write across workers. 1
filesystem_limit_gbs float Maximum aggregate filesystem write bandwidth in GB/s (default 500). Prevents over-optimistic scaling when n_writers is large. 500.0
Back to top