solvers.TailLatencyModel

solvers.TailLatencyModel()

Analyzes queueing delays and P99 tail latency for deployed inference models.

Models inference servers as M/M/c queues to determine if the deployment can sustain the target arrival rate while meeting strict SLA latency bounds.

Literature Source: 1. Dean & Barroso (2013), “The Tail at Scale.”

Methods

Name Description
solve Solves for P50 and P99 tail latencies under variable load.

solve

solvers.TailLatencyModel.solve(
    arrival_rate_qps,
    service_latency_ms,
    num_replicas=1,
    service_time_cv=1.0,
)

Solves for P50 and P99 tail latencies under variable load.

Parameters

Name Type Description Default
arrival_rate_qps float Request arrival rate in queries per second. required
service_latency_ms float Mean service time per request in milliseconds. required
num_replicas int Number of server replicas (c in M/M/c). 1
service_time_cv float Coefficient of variation of service time (default 1.0 = exponential). When CV != 1, applies Kingman’s M/G/1 correction factor (cv^2 + 1) / 2 to queue wait times, approximating M/G/c behavior. 1.0
Back to top