core.solver.TrainingMemoryModel
core.solver.TrainingMemoryModel()Decomposes per-accelerator training memory into weights, gradients, optimizer state, activations, and communication buffers.
This model is intended for first-order training feasibility analysis. It makes the difference between inference memory and training memory explicit without modeling framework internals.
Methods
| Name | Description |
|---|---|
| solve | Estimate per-accelerator training memory. |
solve
core.solver.TrainingMemoryModel.solve(
model,
hardware,
batch_size,
seq_len=2048,
precision='fp16',
optimizer='adam',
activation_checkpointing='selective',
tp_size=1,
pp_size=1,
dp_size=1,
ep_size=1,
zero_stage=0,
gradient_accumulation_steps=1,
trainable_fraction=1.0,
communication_buffer_fraction=0.05,
)Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | Transformer workload to train. | required |
| hardware | HardwareNode | Per-rank accelerator target. | required |
| batch_size | int | Global batch size. | required |
| seq_len | int | Training sequence length. | 2048 |
| precision | str | Parameter/gradient precision. | 'fp16' |
| optimizer | str | adam, adamw, sgd, or none. |
'adam' |
| activation_checkpointing | str | none, selective, or full. |
'selective' |
| tp_size, pp_size, dp_size, ep_size | int | Parallelism degrees. | 1 |
| zero_stage | int | ZeRO stage 0–3. | 0 |
| gradient_accumulation_steps | int | Steps used to derive local microbatch. | 1 |
| trainable_fraction | float | Fraction of local parameters with gradients and optimizer state. | 1.0 |
| communication_buffer_fraction | float | Gradient bucket buffer fraction. | 0.05 |
Returns
TrainingMemoryResult with total memory, available memory, feasibility, utilization, and a component breakdown.