solvers.DistributedModel

solvers.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This model analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D Parallelism (DP, TP, PP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)

Methods

Name	Description
solve	Calculates distributed training performance using the 3D/4D Parallelism model.

solve

solvers.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name	Type	Description	Default
model	Workload	The model architecture to analyze.	required
fleet	Fleet	The hardware cluster and network topology.	required
batch_size	int	Global batch size.	`1`
precision	str	Numerical precision (fp16, fp32, int8).	`'fp16'`
efficiency	float	Achieved compute efficiency (0.0 to 1.0).	`0.5`
tp_size	int	Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink.	`1`
pp_size	int	Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory.	`1`
ep_size	int	Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes.	`1`
v_stages	int	Number of virtual stages for interleaved pipeline schedules.	`1`
microbatch_count	int	Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead.	`1`
topology_override	str	Force a specific topology (ring, tree).	`None`
zero_stage	int	ZeRO optimization stage (0, 1, 2, 3) for sharding memory and altering DP comms.	`0`
is_lora	bool	Whether using Low-Rank Adaptation (PEFT).	`False`
activation_recomputation	bool	Whether to trade FLOPS (+33%) for activation memory savings.	`False`
overlap_comm	bool	Whether to overlap DP communication with backward pass compute.	`False`
overlap_efficiency	float	Fraction of communication hidden behind compute (0.0-1.0). Default 0.85 reflects typical Megatron-LM overlap efficiency.	`0.85`
congestion_factor	float	Multiplicative factor on communication time to account for network congestion (1.0 = ideal, 1.5-2.0 = shared fabric, 2.0-3.0 = oversubscribed multi-tenant).	`1.0`
moe_routing_imbalance_factor	float	Multiplier on routed MoE token traffic. A value of 1.0 is perfectly balanced routing; values above 1.0 approximate hot experts.	`1.0`
seq_len	int	Sequence length for memory calculation.	`2048`

Returns

Name	Type	Description
	Dict[str, Any]	Metrics including DP/TP/EP latency, the Pipeline Bubble penalty, and the final Scaling Efficiency.