core.solver.ServingModel

core.solver.ServingModel()

Analyzes the two-phase LLM serving lifecycle: Pre-fill vs. Decoding.

LLM inference is not a single mathematical operation; it is a stateful process with two distinct physical regimes (Compute-bound Pre-fill and Memory-bound Decoding).

Literature Source: 1. Pope et al. (2023), “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” (Inference Bottlenecks) 2. Aminabadi et al. (2022), “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” 3. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”

Methods

Name Description
solve Solves for LLM serving performance.

solve

core.solver.ServingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
)

Solves for LLM serving performance.

Back to top