solvers.ServingModel

solvers.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).

Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).

Methods

Name Description
solve Solves for LLM serving performance.

solve

solvers.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth=Q_('100 GB/s'),
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters

Name Type Description Default
model TransformerWorkload The primary model to be served. required
hardware HardwareNode The hardware node for serving (or pre-fill node in disaggregated serving). required
seq_len int Sequence length (context window). required
batch_size int Batch size. 1
precision str Numerical precision. 'fp16'
efficiency float Compute efficiency. 0.5
decode_hardware HardwareNode If provided, models Disaggregated Serving where ‘hardware’ does pre-fill and ‘decode_hardware’ does decoding. KV-cache is transferred over the network. None
network_bandwidth Quantity Network bandwidth between pre-fill and decode nodes. Q_('100 GB/s')
draft_model TransformerWorkload If provided, models Speculative Decoding using this smaller draft model. None
draft_acceptance_rate float Expected acceptance rate (0.0 to 1.0) of draft tokens per step. 0.7
cached_prefix_len int Number of tokens with pre-computed KV-cache (prompt caching / prefix caching). When > 0, the prefill phase only processes (seq_len - cached_prefix_len) new tokens, reducing TTFT proportionally. The full KV-cache (including cached prefix) still occupies memory. Must be < seq_len. 0
prefill_chunk_tokens int If provided, split new prefill tokens into chunks of at most this size. This estimates a Sarathi-Serve-style chunked-prefill stall proxy: total TTFT keeps the same compute work plus one dispatch tax per chunk, while decode_stall_bound reports the slowest single chunk that can interfere with ongoing decode iterations. It is not a full scheduler simulation. None

Returns

Name Type Description
ServingResult Serving performance metrics.
Back to top