core.solver.DistributedModel

core.solver.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Literature Source: 1. Shoeybi et al. (2019), “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” (3D Parallelism Framework) 2. Narayanan et al. (2019), “PipeDream: Efficient Pipeline Parallelism for Training Large Models.” (1F1B Pipeline Bubble Model) 3. Patarasuk & Mueller (2009), “Bandwidth-Optimal All-Reduce Algorithms for Clusters of Workstations.” (Ring All-Reduce)

Methods

Name	Description
solve	Calculates distributed training performance using the 3D/4D Parallelism model.

solve

core.solver.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name	Type	Description	Default
model	Workload	The model architecture to analyze.	required
fleet	Fleet	The hardware cluster and network topology.	required
batch_size	int	Global batch size.	`1`
precision	str	Numerical precision (fp16, fp32, int8).	`'fp16'`
efficiency	float	Achieved compute efficiency (0.0 to 1.0).	`0.5`
tp_size	int	Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink.	`1`
pp_size	int	Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory.	`1`
ep_size	int	Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes.	`1`
v_stages	int	Number of virtual stages for interleaved pipeline schedules.	`1`
microbatch_count	int	Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead.	`1`
topology_override	str	Force a specific topology (ring, tree).	`None`
zero_stage	int	ZeRO optimization stage (0–3).	`0`
is_lora	bool	Whether to approximate LoRA-style reduced gradient communication.	`False`
activation_recomputation	bool	Whether to trade extra compute for activation memory savings.	`False`
overlap_comm	bool	Whether to hide DP communication behind backward compute.	`False`
overlap_efficiency	float	Fraction of DP communication hidden when overlap is enabled.	`0.85`
congestion_factor	float	Multiplier for network congestion or oversubscription beyond fabric metadata.	`1.0`
straggler_factor	float	Multiplier for bulk-synchronous slow-worker effects.	`1.0`
moe_routing_imbalance_factor	float	Multiplier on MoE routed-token traffic, where `1.0` is balanced.	`1.0`
gradient_accumulation_steps	int	Microsteps over which DP communication is amortized.	`1`
seq_len	int	Sequence length for activation and routing-volume estimates.	`2048`

Returns

Name	Type	Description
	DistributedResult	Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism.