core.solver.ServingModel

core.solver.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).

Literature Source: 1. Pope et al. (2023), “Efficiently Scaling Transformer Inference.” 2. Agrawal et al. (2024), “Sarathi-Serve” (chunked prefill scheduling). 3. Patel et al. (2024), “Splitwise” and Zhong et al. (2024), “DistServe” (prefill/decode disaggregation).

Methods

Name Description
solve Solves for LLM serving performance.

solve

core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    decode_hardware=None,
    network_bandwidth='100 GB/s',
    draft_model=None,
    draft_acceptance_rate=0.7,
    cached_prefix_len=0,
    prefill_chunk_tokens=None,
)

Solves for LLM serving performance.

Parameters

Name Type Description Default
model TransformerWorkload The primary model to be served. required
hardware HardwareNode The serving node, or prefill node for disaggregated serving. required
seq_len int Sequence length / context window. required
batch_size int Batch size. 1
precision str Numerical precision. 'fp16'
efficiency float Compute efficiency. 0.5
decode_hardware HardwareNode Optional decode node for phase-split serving with KV-cache transfer. None
network_bandwidth Quantity Bandwidth between prefill and decode nodes. 100 GB/s
draft_model TransformerWorkload Optional draft model for speculative decoding. None
draft_acceptance_rate float Expected draft token acceptance rate. 0.7
cached_prefix_len int Prefix tokens already covered by prompt-cache KV entries. 0
prefill_chunk_tokens int Optional prefill chunk budget for estimating a decode-stall proxy. None

Returns

ServingResult with TTFT, ITL, KV-cache size, memory feasibility, prompt-cache hit ratio, and optional chunked-prefill fields (prefill_chunks, prefill_chunk_time, decode_stall_bound).

Back to top