core.solver

core.solver

Classes

Name Description
SingleNodeModel Resolves single-node hardware Roofline bounds and feasibility.
ServingModel Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
TrainingMemoryModel Decomposes per-accelerator training memory into teachable components.
ServingCapacityModel Sizes an LLM serving deployment from a QPS and tail-latency target.
ContinuousBatchingModel Analyzes production LLM serving with Continuous Batching and PagedAttention.
WeightStreamingModel Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.
TailLatencyModel Analyzes queueing delays and P99 tail latency for deployed inference (M/M/c).
DistributedModel Resolves fleet-wide communication, synchronization, and pipelining constraints.
MoERoutingModel Models first-order MoE routing imbalance and expert-parallel all-to-all cost.
ReliabilityModel Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
CheckpointModel Analyzes checkpoint I/O burst penalties and MFU impact.
EconomicsModel Calculates Total Cost of Ownership (TCO) including Capex and Opex.
SustainabilityModel Calculates Datacenter-scale Sustainability metrics.

DistributedModel

core.solver.DistributedModel()

Resolves fleet-wide communication, synchronization, and pipelining constraints.

This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).

Methods

Name Description
solve Calculates distributed training performance using the 3D/4D Parallelism model.
solve
core.solver.DistributedModel.solve(
    model,
    fleet,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    tp_size=1,
    pp_size=1,
    ep_size=1,
    v_stages=1,
    microbatch_count=1,
    topology_override=None,
    zero_stage=0,
    is_lora=False,
    activation_recomputation=False,
    overlap_comm=False,
    overlap_efficiency=0.85,
    congestion_factor=1.0,
    straggler_factor=1.0,
    moe_routing_imbalance_factor=1.0,
    gradient_accumulation_steps=1,
    seq_len=2048,
)

Calculates distributed training performance using the 3D/4D Parallelism model.

Parameters
Name Type Description Default
model Workload The model architecture to analyze. required
fleet Fleet The hardware cluster and network topology. required
batch_size int Global batch size. 1
precision str Numerical precision (fp16, fp32, int8). 'fp16'
efficiency float Achieved compute efficiency (0.0 to 1.0). 0.5
tp_size int Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. 1
pp_size int Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. 1
ep_size int Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. 1
v_stages int Number of virtual stages for interleaved pipeline schedules. 1
microbatch_count int Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. 1
topology_override str Force a specific topology (ring, tree). None
zero_stage int ZeRO optimization stage (0–3). 0
is_lora bool Whether to approximate LoRA-style reduced gradient communication. False
activation_recomputation bool Whether to trade extra compute for activation memory savings. False
overlap_comm bool Whether to hide DP communication behind backward compute. False
overlap_efficiency float Fraction of DP communication hidden when overlap is enabled. 0.85
congestion_factor float Multiplier for network congestion or oversubscription beyond fabric metadata. 1.0
straggler_factor float Multiplier for bulk-synchronous slow-worker effects. 1.0
moe_routing_imbalance_factor float Multiplier on MoE routed-token traffic, where 1.0 is balanced. 1.0
gradient_accumulation_steps int Microsteps over which DP communication is amortized. 1
seq_len int Sequence length for activation and routing-volume estimates. 2048
Returns
Name Type Description
DistributedResult Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism.

EconomicsModel

core.solver.EconomicsModel()

Calculates Total Cost of Ownership (TCO) including Capex and Opex.

Combines hardware costs, energy consumption, and maintenance into a single financial model for the fleet. This solver exposes the ROI of architectural efficiency by showing how reducing power draw or increasing throughput directly impacts the bottom line.

Methods

Name Description
solve Calculates the TCO for a fleet over a specified duration.
solve
core.solver.EconomicsModel.solve(fleet, duration_days, kwh_price=0.12)

Calculates the TCO for a fleet over a specified duration.

Parameters
Name Type Description Default
fleet Fleet The hardware cluster configuration. required
duration_days float Operation duration in days. required
kwh_price float Price of electricity per kWh, by default 0.12. 0.12
Returns
Name Type Description
Dict[str, Any] Financial metrics including CapEx, OpEx, and total TCO.

ReliabilityModel

core.solver.ReliabilityModel()

Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.

This solver handles the reliability modeling of massive clusters, helping determine the ‘Goodput’ of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time.

Methods

Name Description
solve Calculates reliability and checkpointing metrics for a fleet.
solve
core.solver.ReliabilityModel.solve(
    fleet,
    job_duration_hours,
    checkpoint_time_s=60.0,
)

Calculates reliability and checkpointing metrics for a fleet.

Parameters
Name Type Description Default
fleet Fleet The hardware cluster configuration. required
job_duration_hours float Total wall-clock duration of the training job. required
checkpoint_time_s float Time taken to save a single checkpoint, by default 60.0. 60.0
Returns
Name Type Description
Dict[str, Any] Reliability metrics including fleet MTBF and failure probability.

ServingModel

core.solver.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes:

  1. Pre-fill Phase: The initial processing of the input prompt. This is a compute-heavy phase where prompt tokens are processed in parallel.
  2. Decoding Phase: The token-by-token generation. This phase is usually memory-bandwidth dominated because each step reads the model weights and accumulated KV-cache while producing only one token per request.

This solver also models the KV-Cache, the memory required to store previous token states, which grows linearly with sequence length and batch size, eventually hitting the ‘Memory Wall’. Modern serving options include prompt caching, speculative decoding, phase splitting, and optional chunked-prefill stall proxies.

Methods

Name Description
solve Solves for LLM serving performance.
solve
core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth='100 GB/s',
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters
Name Type Description Default
model TransformerWorkload The LLM model architecture. required
hardware HardwareNode The target hardware for inference, or the prefill node in disaggregated serving. required
seq_len int The total context window (prompt + generated tokens). required
batch_size int Number of concurrent user requests. 1
precision str Numerical format. Lower precision reduces memory pressure and speeds up the decode phase. 'fp16'
efficiency float Compute utilization efficiency, primarily affecting the prefill phase. 0.5
decode_hardware HardwareNode Optional decode node for phase-split serving with KV-cache transfer. None
network_bandwidth Quantity Bandwidth between prefill and decode nodes. 100 GB/s
draft_model TransformerWorkload Optional draft model for speculative decoding. None
draft_acceptance_rate float Expected draft token acceptance rate. 0.7
cached_prefix_len int Prefix tokens already covered by prompt-cache KV entries. 0
prefill_chunk_tokens int Optional prefill chunk budget for estimating a decode-stall proxy. None
Returns
Name Type Description
ServingResult Inference metrics including TTFT, ITL, KV-cache footprint, memory feasibility, prompt-cache hit ratio, and chunked-prefill metadata.

SingleNodeModel

core.solver.SingleNodeModel()

Resolves single-node hardware Roofline bounds and feasibility.

This solver handles the ‘Iron Law’ of machine learning systems, calculating whether a model fits in memory and predicting its throughput based on arithmetic intensity.

Methods

Name Description
solve Solves the performance profile for a single hardware node.
solve
core.solver.SingleNodeModel.solve(
    model,
    hardware,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    raise_errors=False,
)

Solves the performance profile for a single hardware node.

Parameters
Name Type Description Default
model Workload The model architecture (Transformer, CNN). required
hardware HardwareNode The target hardware specification. required
batch_size int Number of samples per inference/step, by default 1. 1
precision str Numerical precision format (‘fp32’, ‘fp16’, ‘int8’, ‘int4’), by default “fp16”. 'fp16'
efficiency float Hardware utilization efficiency (0.0 to 1.0), by default 0.5. 0.5
raise_errors bool Whether to raise OOMError for infeasible workloads, by default False. False
Returns
Name Type Description
PerformanceProfile The resulting latency, throughput, and bottleneck analysis.

SustainabilityModel

core.solver.SustainabilityModel()

Calculates Datacenter-scale Sustainability metrics.

Handles Power Usage Effectiveness (PUE), Carbon Intensity, and Water Usage Effectiveness (WUE) across different regional grids. This solver models the ‘Infrastructure Tax’ — the energy spent on cooling and power delivery rather than on neural computation.

Methods

Name Description
solve Calculates energy, carbon, and water footprint for a fleet operation.
solve
core.solver.SustainabilityModel.solve(fleet, duration_days, datacenter=None)

Calculates energy, carbon, and water footprint for a fleet operation.

Parameters
Name Type Description Default
fleet Fleet The hardware cluster configuration. required
duration_days float Operating duration in days. required
datacenter Datacenter A specific datacenter profile, defaults to fleet’s region. None
Returns
Name Type Description
Dict[str, Any] Sustainability metrics including total energy (kWh) and carbon (kgCO2e).
Back to top