solvers.ServingCapacityModel
solvers.ServingCapacityModel()Sizes an LLM serving deployment from a QPS and tail-latency target.
The model deliberately composes existing first-order pieces: ServingModel for TTFT/ITL, ContinuousBatchingModel for per-replica token capacity, and TailLatencyModel for queueing pressure. It is a capacity planner, not a request-level scheduler.
Methods
| Name | Description |
|---|---|
| solve | Return the minimum replica count that satisfies the target P99. |
solve
solvers.ServingCapacityModel.solve(
model,
hardware,
qps,
target_p99_latency_ms,
seq_len=2048,
output_tokens=128,
max_batch_size=32,
precision='fp16',
efficiency=0.5,
max_replicas=1024,
service_time_cv=1.0,
)Return the minimum replica count that satisfies the target P99.