core.solver.TailLatencyModel

core.solver.TailLatencyModel()

Analyzes queueing delays and P99 tail latency for deployed inference models.

Models inference servers as M/M/c queues to determine if the deployment can sustain the target arrival rate while meeting strict SLA latency bounds.

Literature Source: 1. Dean & Barroso (2013), “The Tail at Scale.”

Methods

Name Description
solve Solves for P50 and P99 tail latencies under variable load.

solve

core.solver.TailLatencyModel.solve(
    arrival_rate_qps,
    service_latency_ms,
    num_replicas=1,
)

Solves for P50 and P99 tail latencies under variable load.

Parameters

Name Type Description Default
arrival_rate_qps float Request arrival rate in queries per second. required
service_latency_ms float Average service latency per request in milliseconds. required
num_replicas int Number of inference replicas (servers). 1

Returns

Name Type Description
TailLatencyResult P50 latency, P99 latency, queue utilization, stability flag, and SLO violation probability.
Back to top