solvers.ServingModel
solvers.ServingModel()Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).
Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).
Methods
| Name | Description |
|---|---|
| solve | Solves for LLM serving performance. |
solve
solvers.ServingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
decode_hardware=None,
network_bandwidth=Q_('100 GB/s'),
draft_model=None,
draft_acceptance_rate=0.7,
cached_prefix_len=0,
prefill_chunk_tokens=None,
)Solves for LLM serving performance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The primary model to be served. | required |
| hardware | HardwareNode | The hardware node for serving (or pre-fill node in disaggregated serving). | required |
| seq_len | int | Sequence length (context window). | required |
| batch_size | int | Batch size. | 1 |
| precision | str | Numerical precision. | 'fp16' |
| efficiency | float | Compute efficiency. | 0.5 |
| decode_hardware | HardwareNode | If provided, models Disaggregated Serving where ‘hardware’ does pre-fill and ‘decode_hardware’ does decoding. KV-cache is transferred over the network. | None |
| network_bandwidth | Quantity | Network bandwidth between pre-fill and decode nodes. | Q_('100 GB/s') |
| draft_model | TransformerWorkload | If provided, models Speculative Decoding using this smaller draft model. | None |
| draft_acceptance_rate | float | Expected acceptance rate (0.0 to 1.0) of draft tokens per step. | 0.7 |
| cached_prefix_len | int | Number of tokens with pre-computed KV-cache (prompt caching / prefix caching). When > 0, the prefill phase only processes (seq_len - cached_prefix_len) new tokens, reducing TTFT proportionally. The full KV-cache (including cached prefix) still occupies memory. Must be < seq_len. | 0 |
| prefill_chunk_tokens | int | If provided, split new prefill tokens into chunks of at most this size. This estimates a Sarathi-Serve-style chunked-prefill stall proxy: total TTFT keeps the same compute work plus one dispatch tax per chunk, while decode_stall_bound reports the slowest single chunk that can interfere with ongoing decode iterations. It is not a full scheduler simulation. | None |
Returns
| Name | Type | Description |
|---|---|---|
| ServingResult | Serving performance metrics. |