core.solver.ServingCapacityModel

core.solver.ServingCapacityModel()

Sizes an LLM serving deployment from a QPS target and P99 latency budget.

The model composes ServingModel, ContinuousBatchingModel, and TailLatencyModel. It is a first-order capacity planner, not a request-level scheduler.

Methods

Name Description
solve Return the minimum replica count that satisfies the target P99.

solve

core.solver.ServingCapacityModel.solve(
    model,
    hardware,
    qps,
    target_p99_latency_ms,
    seq_len=2048,
    output_tokens=128,
    max_batch_size=32,
    precision='fp16',
    efficiency=0.5,
    max_replicas=1024,
    service_time_cv=1.0,
)

Parameters

Name Type Description Default
model TransformerWorkload LLM workload to serve. required
hardware HardwareNode Per-replica accelerator. required
qps float Target request arrival rate. required
target_p99_latency_ms float P99 request latency budget. required
seq_len int Prompt/context length. 2048
output_tokens int Mean generated tokens per request. 128
max_batch_size int Maximum active batch per replica. 32
precision str Serving precision. 'fp16'
efficiency float Compute efficiency. 0.5
max_replicas int Search limit for replica count. 1024
service_time_cv float Service-time coefficient of variation for queueing. 1.0

Returns

ServingCapacityResult with feasibility, required replicas, QPS capacity, utilization, estimated P99 latency, queue wait, and TTFT/ITL details.

Back to top