core.solver
core.solver
Classes
| Name | Description |
|---|---|
| SingleNodeModel | Resolves single-node hardware Roofline bounds and feasibility. |
| ServingModel | Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding. |
| TrainingMemoryModel | Decomposes per-accelerator training memory into teachable components. |
| ServingCapacityModel | Sizes an LLM serving deployment from a QPS and tail-latency target. |
| ContinuousBatchingModel | Analyzes production LLM serving with Continuous Batching and PagedAttention. |
| WeightStreamingModel | Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming. |
| TailLatencyModel | Analyzes queueing delays and P99 tail latency for deployed inference (M/M/c). |
| DistributedModel | Resolves fleet-wide communication, synchronization, and pipelining constraints. |
| MoERoutingModel | Models first-order MoE routing imbalance and expert-parallel all-to-all cost. |
| ReliabilityModel | Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals. |
| CheckpointModel | Analyzes checkpoint I/O burst penalties and MFU impact. |
| EconomicsModel | Calculates Total Cost of Ownership (TCO) including Capex and Opex. |
| SustainabilityModel | Calculates Datacenter-scale Sustainability metrics. |
DistributedModel
core.solver.DistributedModel()Resolves fleet-wide communication, synchronization, and pipelining constraints.
This solver analyzes the constraints of distributed scale for distributed training. It decomposes a workload across a cluster using 3D/4D Parallelism (DP, TP, PP, EP) and calculates the resulting communication overheads and idle times (bubbles) that determine the Model FLOPs Utilization (MFU).
Methods
| Name | Description |
|---|---|
| solve | Calculates distributed training performance using the 3D/4D Parallelism model. |
solve
core.solver.DistributedModel.solve(
model,
fleet,
batch_size=1,
precision='fp16',
efficiency=0.5,
tp_size=1,
pp_size=1,
ep_size=1,
v_stages=1,
microbatch_count=1,
topology_override=None,
zero_stage=0,
is_lora=False,
activation_recomputation=False,
overlap_comm=False,
overlap_efficiency=0.85,
congestion_factor=1.0,
straggler_factor=1.0,
moe_routing_imbalance_factor=1.0,
gradient_accumulation_steps=1,
seq_len=2048,
)Calculates distributed training performance using the 3D/4D Parallelism model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture to analyze. | required |
| fleet | Fleet | The hardware cluster and network topology. | required |
| batch_size | int | Global batch size. | 1 |
| precision | str | Numerical precision (fp16, fp32, int8). | 'fp16' |
| efficiency | float | Achieved compute efficiency (0.0 to 1.0). | 0.5 |
| tp_size | int | Tensor Parallelism degree. Splits individual layers across GPUs, usually within a single node over high-speed NVLink. | 1 |
| pp_size | int | Pipeline Parallelism degree. Chains model layers across multiple nodes, introducing ‘pipeline bubbles’ while saving memory. | 1 |
| ep_size | int | Expert Parallelism degree for MoE models. Introduces All-to-All communication overhead across nodes. | 1 |
| v_stages | int | Number of virtual stages for interleaved pipeline schedules. | 1 |
| microbatch_count | int | Number of microbatches (M). Increasing M reduces the pipeline bubble but increases synchronization overhead. | 1 |
| topology_override | str | Force a specific topology (ring, tree). | None |
| zero_stage | int | ZeRO optimization stage (0–3). | 0 |
| is_lora | bool | Whether to approximate LoRA-style reduced gradient communication. | False |
| activation_recomputation | bool | Whether to trade extra compute for activation memory savings. | False |
| overlap_comm | bool | Whether to hide DP communication behind backward compute. | False |
| overlap_efficiency | float | Fraction of DP communication hidden when overlap is enabled. | 0.85 |
| congestion_factor | float | Multiplier for network congestion or oversubscription beyond fabric metadata. | 1.0 |
| straggler_factor | float | Multiplier for bulk-synchronous slow-worker effects. | 1.0 |
| moe_routing_imbalance_factor | float | Multiplier on MoE routed-token traffic, where 1.0 is balanced. |
1.0 |
| gradient_accumulation_steps | int | Microsteps over which DP communication is amortized. | 1 |
| seq_len | int | Sequence length for activation and routing-volume estimates. | 2048 |
Returns
| Name | Type | Description |
|---|---|---|
| DistributedResult | Metrics including DP/TP/EP latency, pipeline bubble penalty, scaling efficiency, and parallelism. |
EconomicsModel
core.solver.EconomicsModel()Calculates Total Cost of Ownership (TCO) including Capex and Opex.
Combines hardware costs, energy consumption, and maintenance into a single financial model for the fleet. This solver exposes the ROI of architectural efficiency by showing how reducing power draw or increasing throughput directly impacts the bottom line.
Methods
| Name | Description |
|---|---|
| solve | Calculates the TCO for a fleet over a specified duration. |
solve
core.solver.EconomicsModel.solve(fleet, duration_days, kwh_price=0.12)Calculates the TCO for a fleet over a specified duration.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| fleet | Fleet | The hardware cluster configuration. | required |
| duration_days | float | Operation duration in days. | required |
| kwh_price | float | Price of electricity per kWh, by default 0.12. | 0.12 |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Financial metrics including CapEx, OpEx, and total TCO. |
ReliabilityModel
core.solver.ReliabilityModel()Calculates Mean Time Between Failures (MTBF) and optimal checkpointing intervals.
This solver handles the reliability modeling of massive clusters, helping determine the ‘Goodput’ of long-running training jobs. It identifies the probability of a job failure before completion and calculates the Young-Daly optimal interval to minimize wasted compute time.
Methods
| Name | Description |
|---|---|
| solve | Calculates reliability and checkpointing metrics for a fleet. |
solve
core.solver.ReliabilityModel.solve(
fleet,
job_duration_hours,
checkpoint_time_s=60.0,
)Calculates reliability and checkpointing metrics for a fleet.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| fleet | Fleet | The hardware cluster configuration. | required |
| job_duration_hours | float | Total wall-clock duration of the training job. | required |
| checkpoint_time_s | float | Time taken to save a single checkpoint, by default 60.0. | 60.0 |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Reliability metrics including fleet MTBF and failure probability. |
ServingModel
core.solver.ServingModel()Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes:
- Pre-fill Phase: The initial processing of the input prompt. This is a compute-heavy phase where prompt tokens are processed in parallel.
- Decoding Phase: The token-by-token generation. This phase is usually memory-bandwidth dominated because each step reads the model weights and accumulated KV-cache while producing only one token per request.
This solver also models the KV-Cache, the memory required to store previous token states, which grows linearly with sequence length and batch size, eventually hitting the ‘Memory Wall’. Modern serving options include prompt caching, speculative decoding, phase splitting, and optional chunked-prefill stall proxies.
Methods
| Name | Description |
|---|---|
| solve | Solves for LLM serving performance. |
solve
core.solver.ServingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
decode_hardware=None,
network_bandwidth='100 GB/s',
draft_model=None,
draft_acceptance_rate=0.7,
cached_prefix_len=0,
prefill_chunk_tokens=None,
)Solves for LLM serving performance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The LLM model architecture. | required |
| hardware | HardwareNode | The target hardware for inference, or the prefill node in disaggregated serving. | required |
| seq_len | int | The total context window (prompt + generated tokens). | required |
| batch_size | int | Number of concurrent user requests. | 1 |
| precision | str | Numerical format. Lower precision reduces memory pressure and speeds up the decode phase. | 'fp16' |
| efficiency | float | Compute utilization efficiency, primarily affecting the prefill phase. | 0.5 |
| decode_hardware | HardwareNode | Optional decode node for phase-split serving with KV-cache transfer. | None |
| network_bandwidth | Quantity | Bandwidth between prefill and decode nodes. | 100 GB/s |
| draft_model | TransformerWorkload | Optional draft model for speculative decoding. | None |
| draft_acceptance_rate | float | Expected draft token acceptance rate. | 0.7 |
| cached_prefix_len | int | Prefix tokens already covered by prompt-cache KV entries. | 0 |
| prefill_chunk_tokens | int | Optional prefill chunk budget for estimating a decode-stall proxy. | None |
Returns
| Name | Type | Description |
|---|---|---|
| ServingResult | Inference metrics including TTFT, ITL, KV-cache footprint, memory feasibility, prompt-cache hit ratio, and chunked-prefill metadata. |
SingleNodeModel
core.solver.SingleNodeModel()Resolves single-node hardware Roofline bounds and feasibility.
This solver handles the ‘Iron Law’ of machine learning systems, calculating whether a model fits in memory and predicting its throughput based on arithmetic intensity.
Methods
| Name | Description |
|---|---|
| solve | Solves the performance profile for a single hardware node. |
solve
core.solver.SingleNodeModel.solve(
model,
hardware,
batch_size=1,
precision='fp16',
efficiency=0.5,
raise_errors=False,
)Solves the performance profile for a single hardware node.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture (Transformer, CNN). | required |
| hardware | HardwareNode | The target hardware specification. | required |
| batch_size | int | Number of samples per inference/step, by default 1. | 1 |
| precision | str | Numerical precision format (‘fp32’, ‘fp16’, ‘int8’, ‘int4’), by default “fp16”. | 'fp16' |
| efficiency | float | Hardware utilization efficiency (0.0 to 1.0), by default 0.5. | 0.5 |
| raise_errors | bool | Whether to raise OOMError for infeasible workloads, by default False. | False |
Returns
| Name | Type | Description |
|---|---|---|
| PerformanceProfile | The resulting latency, throughput, and bottleneck analysis. |
SustainabilityModel
core.solver.SustainabilityModel()Calculates Datacenter-scale Sustainability metrics.
Handles Power Usage Effectiveness (PUE), Carbon Intensity, and Water Usage Effectiveness (WUE) across different regional grids. This solver models the ‘Infrastructure Tax’ — the energy spent on cooling and power delivery rather than on neural computation.
Methods
| Name | Description |
|---|---|
| solve | Calculates energy, carbon, and water footprint for a fleet operation. |
solve
core.solver.SustainabilityModel.solve(fleet, duration_days, datacenter=None)Calculates energy, carbon, and water footprint for a fleet operation.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| fleet | Fleet | The hardware cluster configuration. | required |
| duration_days | float | Operating duration in days. | required |
| datacenter | Datacenter | A specific datacenter profile, defaults to fleet’s region. | None |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Sustainability metrics including total energy (kWh) and carbon (kgCO2e). |