core.solver.ServingModel
core.solver.ServingModel()Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.
LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).
Literature Source: 1. Pope et al. (2023), “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” (Inference Bottlenecks) 2. Aminabadi et al. (2022), “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” 3. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”
Methods
| Name | Description |
|---|---|
| solve | Solves for LLM serving performance. |
solve
core.solver.ServingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
)Solves for LLM serving performance.