solvers.TailLatencyModel
solvers.TailLatencyModel()Analyzes queueing delays and P99 tail latency for deployed inference models.
Models inference servers as M/M/c queues to determine if the deployment can sustain the target arrival rate while meeting strict SLA latency bounds.
Literature Source: 1. Dean & Barroso (2013), “The Tail at Scale.”
Methods
| Name | Description |
|---|---|
| solve | Solves for P50 and P99 tail latencies under variable load. |
solve
solvers.TailLatencyModel.solve(
arrival_rate_qps,
service_latency_ms,
num_replicas=1,
service_time_cv=1.0,
)Solves for P50 and P99 tail latencies under variable load.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| arrival_rate_qps | float | Request arrival rate in queries per second. | required |
| service_latency_ms | float | Mean service time per request in milliseconds. | required |
| num_replicas | int | Number of server replicas (c in M/M/c). | 1 |
| service_time_cv | float | Coefficient of variation of service time (default 1.0 = exponential). When CV != 1, applies Kingman’s M/G/1 correction factor (cv^2 + 1) / 2 to queue wait times, approximating M/G/c behavior. | 1.0 |