core.solver

core.solver

Classes

Name	Description
SingleNodeModel	Resolves single-node hardware Roofline bounds and feasibility.
ServingModel	Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
TrainingMemoryModel	Decomposes per-accelerator training memory into teachable components.
ServingCapacityModel	Sizes an LLM serving deployment from a QPS and tail-latency target.
ContinuousBatchingModel	Analyzes production LLM serving with Continuous Batching and PagedAttention.
WeightStreamingModel	Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.
TailLatencyModel	Analyzes queueing delays and P99 tail latency for deployed inference (M/M/c).
DistributedModel	Resolves fleet-wide communication, synchronization, and pipelining constraints.
MoERoutingModel	Models first-order MoE routing imbalance and expert-parallel all-to-all cost.
ReliabilityModel	Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
CheckpointModel	Analyzes checkpoint I/O burst penalties and MFU impact.
EconomicsModel	Calculates Total Cost of Ownership (TCO) including Capex and Opex.
SustainabilityModel	Calculates Datacenter-scale Sustainability metrics.

DistributedModel

core.solver.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Methods

Name	Description
solve	Calculates distributed training performance using the 3D/4D Parallelism model.

solve

core.solver.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters

Name	Type	Description	Default
model	Workload	The model architecture to analyze.	required
fleet	Fleet	The hardware cluster and network topology.	required
batch_size	int	Global batch size.	`1`
precision	str	Numerical precision (fp16, fp32, int8).	`'fp16'`
efficiency	float	Achieved compute efficiency (0.0 to 1.0).	`0.5`
tp_size	int	Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink.	`1`
pp_size	int	Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory.	`1`
ep_size	int	Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes.	`1`
v_stages	int	Number of virtual stages for interleaved pipeline schedules.	`1`
microbatch_count	int	Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead.	`1`
topology_override	str	Force a specific topology (ring, tree).	`None`
zero_stage	int	ZeRO optimization stage (0–3).	`0`
is_lora	bool	Whether to approximate LoRA-style reduced gradient communication.	`False`
activation_recomputation	bool	Whether to trade extra compute for activation memory savings.	`False`
overlap_comm	bool	Whether to hide DP communication behind backward compute.	`False`
overlap_efficiency	float	Fraction of DP communication hidden when overlap is enabled.	`0.85`
congestion_factor	float	Multiplier for network congestion or oversubscription beyond fabric metadata.	`1.0`
straggler_factor	float	Multiplier for bulk-synchronous slow-worker effects.	`1.0`
moe_routing_imbalance_factor	float	Multiplier on MoE routed-token traffic, where `1.0` is balanced.	`1.0`
gradient_accumulation_steps	int	Microsteps over which DP communication is amortized.	`1`
seq_len	int	Sequence length for activation and routing-volume estimates.	`2048`

Returns

Name	Type	Description
	DistributedResult	Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism.

EconomicsModel

core.solver.EconomicsModel()

Calculates Total Cost of Ownership (TCO) including Capex and Opex.

Combines hardware costs, energy consumption, and maintenance into a single financial model for the fleet. This solver exposes the ROI of architectural efficiency by showing how reducing power draw or increasing throughput directly impacts the bottom line.

Methods

Name	Description
solve	Calculates the TCO for a fleet over a specified duration.

solve

core.solver.EconomicsModel.solve(fleet, duration_days, kwh_price=0.12)

Calculates the TCO for a fleet over a specified duration.

Parameters

Name	Type	Description	Default
fleet	Fleet	The hardware cluster configuration.	required
duration_days	float	Operation duration in days.	required
kwh_price	float	Price of electricity per kWh, by default 0.12.	`0.12`

Returns

Name	Type	Description
	Dict[str, Any]	Financial metrics including CapEx, OpEx, and total TCO.

ReliabilityModel

core.solver.ReliabilityModel()

Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.

This solver handles the reliability modeling of massive clusters, helping determine the ‘Goodput’ of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time.

Methods

Name	Description
solve	Calculates reliability and checkpointing metrics for a fleet.

solve

core.solver.ReliabilityModel.solve(
    fleet,
    job_duration_hours,
    checkpoint_time_s=60.0,
)

Calculates reliability and checkpointing metrics for a fleet.

Parameters

Name	Type	Description	Default
fleet	Fleet	The hardware cluster configuration.	required
job_duration_hours	float	Total wall-clock duration of the training job.	required
checkpoint_time_s	float	Time taken to save a single checkpoint, by default 60.0.	`60.0`

Returns

Name	Type	Description
	Dict[str, Any]	Reliability metrics including fleet MTBF and failure probability.

ServingModel

core.solver.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes:

Pre-fill Phase: The initial processing of the input prompt. This is a compute-heavy phase where prompt tokens are processed in parallel.
Decoding Phase: The token-by-token generation. This phase is usually memory-bandwidth dominated because each step reads the model weights and accumulated KV-cache while producing only one token per request.

This solver also models the KV-Cache, the memory required to store previous token states, which grows linearly with sequence length and batch size, eventually hitting the ‘Memory Wall’. Modern serving options include prompt caching, speculative decoding, phase splitting, and optional chunked-prefill stall proxies.

Methods

Name	Description
solve	Solves for LLM serving performance.

solve

core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth='100 GB/s',
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters

Name	Type	Description	Default
model	TransformerWorkload	The LLM model architecture.	required
hardware	HardwareNode	The target hardware for inference, or the prefill node in disaggregated serving.	required
seq_len	int	The total context window (prompt + generated tokens).	required
batch_size	int	Number of concurrent user requests.	`1`
precision	str	Numerical format. Lower precision reduces memory pressure and speeds up the decode phase.	`'fp16'`
efficiency	float	Compute utilization efficiency, primarily affecting the prefill phase.	`0.5`
decode_hardware	HardwareNode	Optional decode node for phase-split serving with KV-cache transfer.	`None`
network_bandwidth	Quantity	Bandwidth between prefill and decode nodes.	`100 GB/s`
draft_model	TransformerWorkload	Optional draft model for speculative decoding.	`None`
draft_acceptance_rate	float	Expected draft token acceptance rate.	`0.7`
cached_prefix_len	int	Prefix tokens already covered by prompt-cache KV entries.	`0`
prefill_chunk_tokens	int	Optional prefill chunk budget for estimating a decode-stall proxy.	`None`

Returns

Name	Type	Description
	ServingResult	Inference metrics including TTFT, ITL, KV-cache footprint, memory feasibility, prompt-cache hit ratio, and chunked-prefill metadata.

SingleNodeModel

core.solver.SingleNodeModel()

Resolves single-node hardware Roofline bounds and feasibility.

This solver handles the ‘Iron Law’ of machine learning systems, calculating whether a model fits in memory and predicting its throughput based on arithmetic intensity.

Methods

Name	Description
solve	Solves the performance profile for a single hardware node.

solve

core.solver.SingleNodeModel.solve(
    model,
    hardware,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    raise_errors=False,
)

Solves the performance profile for a single hardware node.

Parameters

Name	Type	Description	Default
model	Workload	The model architecture (Transformer, CNN).	required
hardware	HardwareNode	The target hardware specification.	required
batch_size	int	Number of samples per inference/step, by default 1.	`1`
precision	str	Numerical precision format (‘fp32’, ‘fp16’, ‘int8’, ‘int4’), by default “fp16”.	`'fp16'`
efficiency	float	Hardware utilization efficiency (0.0 to 1.0), by default 0.5.	`0.5`
raise_errors	bool	Whether to raise OOMError for infeasible workloads, by default False.	`False`

Returns

Name	Type	Description
	PerformanceProfile	The resulting latency, throughput, and bottleneck analysis.

SustainabilityModel

core.solver.SustainabilityModel()

Calculates Datacenter-scale Sustainability metrics.

Handles Power Usage Effectiveness (PUE), Carbon Intensity, and Water Usage Effectiveness (WUE) across different regional grids. This solver models the ‘Infrastructure Tax’ — the energy spent on cooling and power delivery rather than on neural computation.

Methods

Name	Description
solve	Calculates energy, carbon, and water footprint for a fleet operation.

solve

core.solver.SustainabilityModel.solve(fleet, duration_days, datacenter=None)

Calculates energy, carbon, and water footprint for a fleet operation.

Parameters

Name	Type	Description	Default
fleet	Fleet	The hardware cluster configuration.	required
duration_days	float	Operating duration in days.	required
datacenter	Datacenter	A specific datacenter profile, defaults to fleet’s region.	`None`

Returns

Name	Type	Description
	Dict[str, Any]	Sustainability metrics including total energy (kWh) and carbon (kgCO2e).