core.solver.CheckpointModel

core.solver.CheckpointModel()

Analyzes the storage constraints and I/O burst penalties of saving model states.

Training massive models requires saving hundreds of gigabytes (weights + optimizer states) to persistent storage. This solver calculates the time spent blocked on I/O, subtracting from the cluster’s Model FLOPs Utilization.

Literature Source: 1. Eisenman et al. (2022), “Check-N-Run: A Checkpointing System for Training Large Language Models.”

Methods

Name Description
solve Solves for checkpoint size, write time, and resulting MFU penalty.

solve

core.solver.CheckpointModel.solve(
    model,
    hardware,
    optimizer='adam',
    checkpoint_interval_hours=4.0,
)

Solves for checkpoint size, write time, and resulting MFU penalty.

Parameters

Name Type Description Default
model Workload The model architecture. required
hardware HardwareNode The target hardware for storage bandwidth. required
optimizer str Optimizer type (‘adam’ or ‘sgd’), determines bytes per parameter. 'adam'
checkpoint_interval_hours float Hours between checkpoints. 4.0

Returns

Name Type Description
CheckpointResult Checkpoint size (GB), write time, storage bottleneck flag, and MFU penalty percentage.
Back to top