core.solver.ServingCapacityModel

core.solver.ServingCapacityModel()

Sizes an LLM serving deployment from a QPS target and P99 latency budget.

The model composes ServingModel, ContinuousBatchingModel, and TailLatencyModel. It is a first-order capacity planner, not a request-level scheduler.

Methods

Name	Description
solve	Return the minimum replica count that satisfies the target P99.

solve

core.solver.ServingCapacityModel.solve(
    model,
    hardware,
    qps,
    target_p99_latency_ms,
    seq_len=2048,
    output_tokens=128,
    max_batch_size=32,
    precision='fp16',
    efficiency=0.5,
    max_replicas=1024,
    service_time_cv=1.0,
)

Parameters

Name	Type	Description	Default
model	TransformerWorkload	LLM workload to serve.	required
hardware	HardwareNode	Per-replica accelerator.	required
qps	float	Target request arrival rate.	required
target_p99_latency_ms	float	P99 request latency budget.	required
seq_len	int	Prompt/context length.	`2048`
output_tokens	int	Mean generated tokens per request.	`128`
max_batch_size	int	Maximum active batch per replica.	`32`
precision	str	Serving precision.	`'fp16'`
efficiency	float	Compute efficiency.	`0.5`
max_replicas	int	Search limit for replica count.	`1024`
service_time_cv	float	Service-time coefficient of variation for queueing.	`1.0`

Returns

ServingCapacityResult with feasibility, required replicas, QPS capacity, utilization, estimated P99 latency, queue wait, and TTFT/ITL details.