core.solver.DistributedModel
core.solver.DistributedModel()Resolves fleet-wide communication, synchronization, and pipelining constraints.
This solver models the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).
Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipePipe: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)
Methods
| Name | Description |
|---|---|
| solve | Calculates distributed training performance using the 3D/4D Parallelism model. |
solve
core.solver.DistributedModel.solve(
model,
fleet,
batch_size=1,
precision='fp16',
efficiency=0.5,
tp_size=1,
pp_size=1,
ep_size=1,
v_stages=1,
microbatch_count=1,
topology_override=None,
)Calculates distributed training performance using the 3D/4D Parallelism model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture to simulate. | required |
| fleet | Fleet | The hardware cluster and network topology. | required |
| batch_size | int | Global batch size. | 1 |
| precision | str | Numerical precision (fp16, fp32, int8). | 'fp16' |
| efficiency | float | Achieved compute efficiency (0.0 to 1.0). | 0.5 |
| tp_size | int | Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. | 1 |
| pp_size | int | Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. | 1 |
| ep_size | int | Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. | 1 |
| v_stages | int | Number of virtual stages for interleaved pipeline schedules. | 1 |
| microbatch_count | int | Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. | 1 |
| topology_override | str | Force a specific topology (ring, tree). | None |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency. |