core.solver.ServingModel
core.solver.ServingModel()Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).
Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).
Methods
| Name | Description |
|---|---|
| solve | Solves for LLM serving performance. |
solve
core.solver.ServingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
decode_hardware=None,
network_bandwidth='100 GB/s',
draft_model=None,
draft_acceptance_rate=0.7,
cached_prefix_len=0,
prefill_chunk_tokens=None,
)Solves for LLM serving performance.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The primary model to be served. | required |
| hardware | HardwareNode | The serving node, or prefill node for disaggregated serving. | required |
| seq_len | int | Sequence length / context window. | required |
| batch_size | int | Batch size. | 1 |
| precision | str | Numerical precision. | 'fp16' |
| efficiency | float | Compute efficiency. | 0.5 |
| decode_hardware | HardwareNode | Optional decode node for phase-split serving with KV-cache transfer. | None |
| network_bandwidth | Quantity | Bandwidth between prefill and decode nodes. | 100 GB/s |
| draft_model | TransformerWorkload | Optional draft model for speculative decoding. | None |
| draft_acceptance_rate | float | Expected draft token acceptance rate. | 0.7 |
| cached_prefix_len | int | Prefix tokens already covered by prompt-cache KV entries. | 0 |
| prefill_chunk_tokens | int | Optional prefill chunk budget for estimating a decode-stall proxy. | None |
Returns
ServingResult with TTFT, ITL, KV-cache size, memory feasibility, prompt-cache hit ratio, and optional chunked-prefill fields (prefill_chunks, prefill_chunk_time, decode_stall_bound).