core.solver.TrainingMemoryModel

core.solver.TrainingMemoryModel()

Decomposes per-accelerator training memory into weights, gradients, optimizer state, activations, and communication buffers.

This model is intended for first-order training feasibility analysis. It makes the difference between inference memory and training memory explicit without modeling framework internals.

Methods

Name Description
solve Estimate per-accelerator training memory.

solve

core.solver.TrainingMemoryModel.solve(
    model,
    hardware,
    batch_size,
    seq_len=2048,
    precision='fp16',
    optimizer='adam',
    activation_checkpointing='selective',
    tp_size=1,
    pp_size=1,
    dp_size=1,
    ep_size=1,
    zero_stage=0,
    gradient_accumulation_steps=1,
    trainable_fraction=1.0,
    communication_buffer_fraction=0.05,
)

Parameters

Name Type Description Default
model TransformerWorkload Transformer workload to train. required
hardware HardwareNode Per-rank accelerator target. required
batch_size int Global batch size. required
seq_len int Training sequence length. 2048
precision str Parameter/gradient precision. 'fp16'
optimizer str adam, adamw, sgd, or none. 'adam'
activation_checkpointing str none, selective, or full. 'selective'
tp_size, pp_size, dp_size, ep_size int Parallelism degrees. 1
zero_stage int ZeRO stage 0–3. 0
gradient_accumulation_steps int Steps used to derive local microbatch. 1
trainable_fraction float Fraction of local parameters with gradients and optimizer state. 1.0
communication_buffer_fraction float Gradient bucket buffer fraction. 0.05

Returns

TrainingMemoryResult with total memory, available memory, feasibility, utilization, and a component breakdown.

Back to top