core.solver.DistributedModel

core.solver.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This solver models the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipePipe: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)

Methods

Name Description
solve Calculates distributed training performance using the 3D/4D Parallelism model.

solve

core.solver.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name Type Description Default
model Workload The model architecture to simulate. required
fleet Fleet The hardware cluster and network topology. required
batch_size int Global batch size. 1
precision str Numerical precision (fp16, fp32, int8). 'fp16'
efficiency float Achieved compute efficiency (0.0 to 1.0). 0.5
tp_size int Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. 1
pp_size int Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. 1
ep_size int Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. 1
v_stages int Number of virtual stages for interleaved pipeline schedules. 1
microbatch_count int Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. 1
topology_override str Force a specific topology (ring, tree). None

Returns

Name Type Description
Dict[str, Any] Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency.
Back to top