core.solver.DistributedModel

core.solver.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)

Methods

Name Description
solve Calculates distributed training performance using the 3D/4D Parallelism model.

solve

core.solver.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name Type Description Default
model Workload The model architecture to analyze. required
fleet Fleet The hardware cluster and network topology. required
batch_size int Global batch size. 1
precision str Numerical precision (fp16, fp32, int8). 'fp16'
efficiency float Achieved compute efficiency (0.0 to 1.0). 0.5
tp_size int Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. 1
pp_size int Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. 1
ep_size int Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. 1
v_stages int Number of virtual stages for interleaved pipeline schedules. 1
microbatch_count int Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. 1
topology_override str Force a specific topology (ring, tree). None
zero_stage int ZeRO optimization stage (0–3). 0
is_lora bool Whether to approximate LoRA-style reduced gradient communication. False
activation_recomputation bool Whether to trade extra compute for activation memory savings. False
overlap_comm bool Whether to hide DP communication behind backward compute. False
overlap_efficiency float Fraction of DP communication hidden when overlap is enabled. 0.85
congestion_factor float Multiplier for network congestion or oversubscription beyond fabric metadata. 1.0
straggler_factor float Multiplier for bulk-synchronous slow-worker effects. 1.0
moe_routing_imbalance_factor float Multiplier on MoE routed-token traffic, where 1.0 is balanced. 1.0
gradient_accumulation_steps int Microsteps over which DP communication is amortized. 1
seq_len int Sequence length for activation and routing-volume estimates. 2048

Returns

Name Type Description
DistributedResult Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism.
Back to top