core.solver.DistributedModel
core.solver.DistributedModel()Resolves fleet-wide communication, synchronization, and pipelining constraints.
This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).
Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)
Methods
| Name | Description |
|---|---|
| solve | Calculates distributed training performance using the 3D/4D Parallelism model. |
solve
core.solver.DistributedModel.solve(
model,
fleet,
batch_size=1,
precision='fp16',
efficiency=0.5,
tp_size=1,
pp_size=1,
ep_size=1,
v_stages=1,
microbatch_count=1,
topology_override=None,
zero_stage=0,
is_lora=False,
activation_recomputation=False,
overlap_comm=False,
overlap_efficiency=0.85,
congestion_factor=1.0,
straggler_factor=1.0,
moe_routing_imbalance_factor=1.0,
gradient_accumulation_steps=1,
seq_len=2048,
)Calculates distributed training performance using the 3D/4D Parallelism model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture to analyze. | required |
| fleet | Fleet | The hardware cluster and network topology. | required |
| batch_size | int | Global batch size. | 1 |
| precision | str | Numerical precision (fp16, fp32, int8). | 'fp16' |
| efficiency | float | Achieved compute efficiency (0.0 to 1.0). | 0.5 |
| tp_size | int | Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. | 1 |
| pp_size | int | Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. | 1 |
| ep_size | int | Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. | 1 |
| v_stages | int | Number of virtual stages for interleaved pipeline schedules. | 1 |
| microbatch_count | int | Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. | 1 |
| topology_override | str | Force a specific topology (ring, tree). | None |
| zero_stage | int | ZeRO optimization stage (0–3). | 0 |
| is_lora | bool | Whether to approximate LoRA-style reduced gradient communication. | False |
| activation_recomputation | bool | Whether to trade extra compute for activation memory savings. | False |
| overlap_comm | bool | Whether to hide DP communication behind backward compute. | False |
| overlap_efficiency | float | Fraction of DP communication hidden when overlap is enabled. | 0.85 |
| congestion_factor | float | Multiplier for network congestion or oversubscription beyond fabric metadata. | 1.0 |
| straggler_factor | float | Multiplier for bulk-synchronous slow-worker effects. | 1.0 |
| moe_routing_imbalance_factor | float | Multiplier on MoE routed-token traffic, where 1.0 is balanced. |
1.0 |
| gradient_accumulation_steps | int | Microsteps over which DP communication is amortized. | 1 |
| seq_len | int | Sequence length for activation and routing-volume estimates. | 2048 |
Returns
| Name | Type | Description |
|---|---|---|
| DistributedResult | Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism. |