solvers.TrainingMemoryModel

solvers.TrainingMemoryModel()

Decomposes per-accelerator training memory into teachable components.

This model answers a different question than SingleNodeModel. Roofline feasibility asks whether a workload’s inference weights fit; training feasibility must also account for gradients, optimizer state, activations, and communication buffers. The accounting follows the common mixed-precision state breakdown used by Megatron-LM and ZeRO.

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM” (tensor/pipeline parallel state). 2. Rajbhandari et al. (2020), “ZeRO” (data-parallel state sharding). 3. Korthikanti et al. (2023), activation recomputation accounting.

Methods

Name Description
solve Estimate per-accelerator training memory.

solve

solvers.TrainingMemoryModel.solve(
    model,
    hardware,
    batch_size,
    seq_len=2048,
    precision='fp16',
    optimizer='adam',
    activation_checkpointing='selective',
    tp_size=1,
    pp_size=1,
    dp_size=1,
    ep_size=1,
    zero_stage=0,
    gradient_accumulation_steps=1,
    trainable_fraction=1.0,
    communication_buffer_fraction=0.05,
)

Estimate per-accelerator training memory.

batch_size is the global batch. The activation term uses the local microbatch implied by data parallelism and gradient accumulation. Model states are sharded by tensor, pipeline, and expert parallelism first; ZeRO then shards optimizer, gradient, and parameter states across the data-parallel group according to its stage.

Back to top