solvers.DistributedModel

solvers.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This model analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)

Methods

Name Description
solve Calculates distributed training performance using the 3D/4D Parallelism model.

solve

solvers.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name Type Description Default
model Workload The model architecture to analyze. required
fleet Fleet The hardware cluster and network topology. required
batch_size int Global batch size. 1
precision str Numerical precision (fp16, fp32, int8). 'fp16'
efficiency float Achieved compute efficiency (0.0 to 1.0). 0.5
tp_size int Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. 1
pp_size int Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. 1
ep_size int Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. 1
v_stages int Number of virtual stages for interleaved pipeline schedules. 1
microbatch_count int Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. 1
topology_override str Force a specific topology (ring, tree). None
zero_stage int ZeRO optimization stage (0, 1, 2, 3) for sharding memory and altering DP comms. 0
is_lora bool Whether using Low-Rank Adaptation (PEFT). False
activation_recomputation bool Whether to trade FLOPS (+33%) for activation memory savings. False
overlap_comm bool Whether to overlap DP communication with backward pass compute. False
overlap_efficiency float Fraction of communication hidden behind compute (0.0-1.0). Default 0.85 reflects typical Megatron-LM overlap efficiency. 0.85
congestion_factor float Multiplicative factor on communication time to account for network congestion (1.0 = ideal, 1.5-2.0 = shared fabric, 2.0-3.0 = oversubscribed multi-tenant). 1.0
moe_routing_imbalance_factor float Multiplier on routed MoE token traffic. A value of 1.0 is perfectly balanced routing; values above 1.0 approximate hot experts. 1.0
seq_len int Sequence length for memory calculation. 2048

Returns

Name Type Description
Dict[str, Any] Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency.
Back to top