solvers.ServingModel

solvers.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).

Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).

Methods

Name	Description
solve	Solves for LLM serving performance.

solve

solvers.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth=Q_('100 GB/s'),
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters

Name	Type	Description	Default
model	TransformerWorkload	The primary model to be served.	required
hardware	HardwareNode	The hardware node for serving (or pre-fill node in disaggregated serving).	required
seq_len	int	Sequence length (context window).	required
batch_size	int	Batch size.	`1`
precision	str	Numerical precision.	`'fp16'`
efficiency	float	Compute efficiency.	`0.5`
decode_hardware	HardwareNode	If provided, models Disaggregated Serving where ‘hardware’ does pre-fill and ‘decode_hardware’ does decoding. KV-cache is transferred over the network.	`None`
network_bandwidth	Quantity	Network bandwidth between pre-fill and decode nodes.	`Q_('100 GB/s')`
draft_model	TransformerWorkload	If provided, models Speculative Decoding using this smaller draft model.	`None`
draft_acceptance_rate	float	Expected acceptance rate (0.0 to 1.0) of draft tokens per step.	`0.7`
cached_prefix_len	int	Number of tokens with pre-computed KV-cache (prompt caching / prefix caching). When > 0, the prefill phase only processes (seq_len - cached_prefix_len) new tokens, reducing TTFT proportionally. The full KV-cache (including cached prefix) still occupies memory. Must be < seq_len.	`0`
prefill_chunk_tokens	int	If provided, split new prefill tokens into chunks of at most this size. This estimates a Sarathi-Serve-style chunked-prefill stall proxy: total TTFT keeps the same compute work plus one dispatch tax per chunk, while decode_stall_bound reports the slowest single chunk that can interfere with ongoing decode iterations. It is not a full scheduler simulation.	`None`

Returns

Name	Type	Description
	ServingResult	Serving performance metrics.