core.solver.ContinuousBatchingModel

core.solver.ContinuousBatchingModel()

Analyzes production LLM serving with Continuous Batching and PagedAttention.

Traditional static batching suffers from severe memory fragmentation and padding waste. This solver models the throughput improvements achieved by iteration-level scheduling and non-contiguous KV cache allocation.

Literature Source: 1. Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention.” 2. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”

Methods

Name Description
solve Solves for continuous batching throughput and PagedAttention memory.

solve

core.solver.ContinuousBatchingModel.solve(
    model,
    hardware,
    seq_len,
    max_batch_size=1,
    page_size=16,
    precision='fp16',
    efficiency=0.5,
)

Solves for continuous batching throughput and PagedAttention memory.

Parameters

Name Type Description Default
model TransformerWorkload The LLM model architecture. required
hardware HardwareNode The target hardware for inference. required
seq_len int The total context window (prompt + generated tokens). required
max_batch_size int Maximum concurrent requests in the batch. 1
page_size int Tokens per KV cache page (PagedAttention granularity). 16
precision str Numerical format (fp16, int8, int4). 'fp16'
efficiency float Compute utilization efficiency (0.0 to 1.0). 0.5

Returns

Name Type Description
ContinuousBatchingResult Throughput (tokens/s), max active requests, memory fragmentation, TTFT, ITL, and speedup vs. static batching.
Back to top