solvers.TrainingMemoryModel
solvers.TrainingMemoryModel()Decomposes per-accelerator training memory into teachable components.
This model answers a different question than SingleNodeModel. Roofline feasibility asks whether a workload’s inference weights fit; training feasibility must also account for gradients, optimizer state, activations, and communication buffers. The accounting follows the common mixed-precision state breakdown used by Megatron-LM and ZeRO.
Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM” (tensor/pipeline parallel state). 2. Rajbhandari et al. (2020), “ZeRO” (data-parallel state sharding). 3. Korthikanti et al. (2023), activation recomputation accounting.
Methods
| Name | Description |
|---|---|
| solve | Estimate per-accelerator training memory. |
solve
solvers.TrainingMemoryModel.solve(
model,
hardware,
batch_size,
seq_len=2048,
precision='fp16',
optimizer='adam',
activation_checkpointing='selective',
tp_size=1,
pp_size=1,
dp_size=1,
ep_size=1,
zero_stage=0,
gradient_accumulation_steps=1,
trainable_fraction=1.0,
communication_buffer_fraction=0.05,
)Estimate per-accelerator training memory.
batch_size is the global batch. The activation term uses the local microbatch implied by data parallelism and gradient accumulation. Model states are sharded by tensor, pipeline, and expert parallelism first; ZeRO then shards optimizer, gradient, and parameter states across the data-parallel group according to its stage.