solvers.ServingCapacityModel

solvers.ServingCapacityModel()

Sizes an LLM serving deployment from a QPS and tail-latency target.

The model deliberately composes existing first-order pieces: ServingModel for TTFT/ITL, ContinuousBatchingModel for per-replica token capacity, and TailLatencyModel for queueing pressure. It is a capacity planner, not a request-level scheduler.

Methods

Name Description
solve Return the minimum replica count that satisfies the target P99.

solve

solvers.ServingCapacityModel.solve(
    model,
    hardware,
    qps,
    target_p99_latency_ms,
    seq_len=2048,
    output_tokens=128,
    max_batch_size=32,
    precision='fp16',
    efficiency=0.5,
    max_replicas=1024,
    service_time_cv=1.0,
)

Return the minimum replica count that satisfies the target P99.

Back to top