core.solver.CheckpointModel

core.solver.CheckpointModel()

Analyzes the storage constraints and I/O burst penalties of saving model states.

Training massive models requires saving hundreds of gigabytes (weights + optimizer states) to persistent storage. This solver calculates the time spent blocked on I/O, subtracting from the cluster’s Model FLOPs Utilization.

Literature Source: 1. Eisenman et al. (2022), “Check-N-Run: A Checkpointing System for Training Large Language Models.”

Methods

Name	Description
solve	Solves for checkpoint size, write time, and resulting MFU penalty.

solve

core.solver.CheckpointModel.solve(
    model,
    hardware,
    optimizer='adam',
    checkpoint_interval_hours=4.0,
)

Solves for checkpoint size, write time, and resulting MFU penalty.

Parameters

Name	Type	Description	Default
model	Workload	The model architecture.	required
hardware	HardwareNode	The target hardware for storage bandwidth.	required
optimizer	str	Optimizer type (‘adam’ or ‘sgd’), determines bytes per parameter.	`'adam'`
checkpoint_interval_hours	float	Hours between checkpoints.	`4.0`

Returns

Name	Type	Description
	CheckpointResult	Checkpoint size (GB), write time, storage bottleneck flag, and MFU penalty percentage.