core.solver.ServingModel

core.solver.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).

Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).

Methods

Name	Description
solve	Solves for LLM serving performance.

solve

core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth='100 GB/s',
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters

Name	Type	Description	Default
model	TransformerWorkload	The primary model to be served.	required
hardware	HardwareNode	The serving node, or prefill node for disaggregated serving.	required
seq_len	int	Sequence length / context window.	required
batch_size	int	Batch size.	`1`
precision	str	Numerical precision.	`'fp16'`
efficiency	float	Compute efficiency.	`0.5`
decode_hardware	HardwareNode	Optional decode node for phase-split serving with KV-cache transfer.	`None`
network_bandwidth	Quantity	Bandwidth between prefill and decode nodes.	`100 GB/s`
draft_model	TransformerWorkload	Optional draft model for speculative decoding.	`None`
draft_acceptance_rate	float	Expected draft token acceptance rate.	`0.7`
cached_prefix_len	int	Prefix tokens already covered by prompt-cache KV entries.	`0`
prefill_chunk_tokens	int	Optional prefill chunk budget for estimating a decode-stall proxy.	`None`

Returns

ServingResult with TTFT, ITL, KV-cache size, memory feasibility, prompt-cache hit ratio, and optional chunked-prefill fields (prefill_chunks, prefill_chunk_time, decode_stall_bound).