core.solver.CheckpointModel
core.solver.CheckpointModel()Analyzes the storage constraints and I/O burst penalties of saving model states.
Training massive models requires saving hundreds of gigabytes (weights + optimizer states) to persistent storage. This solver calculates the time spent blocked on I/O, subtracting from the cluster’s Model FLOPs Utilization.
Literature Source: 1. Eisenman et al. (2022), “Check-N-Run: A Checkpointing System for Training Large Language Models.”
Methods
| Name | Description |
|---|---|
| solve | Solves for checkpoint size, write time, and resulting MFU penalty. |
solve
core.solver.CheckpointModel.solve(
model,
hardware,
optimizer='adam',
checkpoint_interval_hours=4.0,
)Solves for checkpoint size, write time, and resulting MFU penalty.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture. | required |
| hardware | HardwareNode | The target hardware for storage bandwidth. | required |
| optimizer | str | Optimizer type (‘adam’ or ‘sgd’), determines bytes per parameter. | 'adam' |
| checkpoint_interval_hours | float | Hours between checkpoints. | 4.0 |
Returns
| Name | Type | Description |
|---|---|---|
| CheckpointResult | Checkpoint size (GB), write time, storage bottleneck flag, and MFU penalty percentage. |