solvers.DistributedModel
solvers.DistributedModel()Resolves fleet-wide communication, synchronization, and pipelining constraints.
This model analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).
Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)
Methods
| Name | Description |
|---|---|
| solve | Calculates distributed training performance using the 3D/4D Parallelism model. |
solve
solvers.DistributedModel.solve(
model,
fleet,
batch_size=1,
precision='fp16',
efficiency=0.5,
tp_size=1,
pp_size=1,
ep_size=1,
v_stages=1,
microbatch_count=1,
topology_override=None,
zero_stage=0,
is_lora=False,
activation_recomputation=False,
overlap_comm=False,
overlap_efficiency=0.85,
congestion_factor=1.0,
straggler_factor=1.0,
moe_routing_imbalance_factor=1.0,
gradient_accumulation_steps=1,
seq_len=2048,
)Calculates distributed training performance using the 3D/4D Parallelism model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture to analyze. | required |
| fleet | Fleet | The hardware cluster and network topology. | required |
| batch_size | int | Global batch size. | 1 |
| precision | str | Numerical precision (fp16, fp32, int8). | 'fp16' |
| efficiency | float | Achieved compute efficiency (0.0 to 1.0). | 0.5 |
| tp_size | int | Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. | 1 |
| pp_size | int | Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. | 1 |
| ep_size | int | Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. | 1 |
| v_stages | int | Number of virtual stages for interleaved pipeline schedules. | 1 |
| microbatch_count | int | Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. | 1 |
| topology_override | str | Force a specific topology (ring, tree). | None |
| zero_stage | int | ZeRO optimization stage (0, 1, 2, 3) for sharding memory and altering DP comms. | 0 |
| is_lora | bool | Whether using Low-Rank Adaptation (PEFT). | False |
| activation_recomputation | bool | Whether to trade FLOPS (+33%) for activation memory savings. | False |
| overlap_comm | bool | Whether to overlap DP communication with backward pass compute. | False |
| overlap_efficiency | float | Fraction of communication hidden behind compute (0.0-1.0). Default 0.85 reflects typical Megatron-LM overlap efficiency. | 0.85 |
| congestion_factor | float | Multiplicative factor on communication time to account for network congestion (1.0 = ideal, 1.5-2.0 = shared fabric, 2.0-3.0 = oversubscribed multi-tenant). | 1.0 |
| moe_routing_imbalance_factor | float | Multiplier on routed MoE token traffic. A value of 1.0 is perfectly balanced routing; values above 1.0 approximate hot experts. | 1.0 |
| seq_len | int | Sequence length for memory calculation. | 2048 |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency. |