core.solver.ServingCapacityModel
core.solver.ServingCapacityModel()Sizes an LLM serving deployment from a QPS target and P99 latency budget.
The model composes ServingModel, ContinuousBatchingModel, and TailLatencyModel. It is a first-order capacity planner, not a request-level scheduler.
Methods
| Name | Description |
|---|---|
| solve | Return the minimum replica count that satisfies the target P99. |
solve
core.solver.ServingCapacityModel.solve(
model,
hardware,
qps,
target_p99_latency_ms,
seq_len=2048,
output_tokens=128,
max_batch_size=32,
precision='fp16',
efficiency=0.5,
max_replicas=1024,
service_time_cv=1.0,
)Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | LLM workload to serve. | required |
| hardware | HardwareNode | Per-replica accelerator. | required |
| qps | float | Target request arrival rate. | required |
| target_p99_latency_ms | float | P99 request latency budget. | required |
| seq_len | int | Prompt/context length. | 2048 |
| output_tokens | int | Mean generated tokens per request. | 128 |
| max_batch_size | int | Maximum active batch per replica. | 32 |
| precision | str | Serving precision. | 'fp16' |
| efficiency | float | Compute efficiency. | 0.5 |
| max_replicas | int | Search limit for replica count. | 1024 |
| service_time_cv | float | Service-time coefficient of variation for queueing. | 1.0 |
Returns
ServingCapacityResult with feasibility, required replicas, QPS capacity, utilization, estimated P99 latency, queue wait, and TTFT/ITL details.