core.solver.ContinuousBatchingModel
core.solver.ContinuousBatchingModel()Analyzes production LLM serving with Continuous Batching and PagedAttention.
Traditional static batching suffers from severe memory fragmentation and padding waste. This solver models the throughput improvements achieved by iteration-level scheduling and non-contiguous KV cache allocation.
Literature Source: 1. Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention.” 2. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”
Methods
| Name | Description |
|---|---|
| solve | Solves for continuous batching throughput and PagedAttention memory. |
solve
core.solver.ContinuousBatchingModel.solve(
model,
hardware,
seq_len,
max_batch_size=1,
page_size=16,
precision='fp16',
efficiency=0.5,
)Solves for continuous batching throughput and PagedAttention memory.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The LLM model architecture. | required |
| hardware | HardwareNode | The target hardware for inference. | required |
| seq_len | int | The total context window (prompt + generated tokens). | required |
| max_batch_size | int | Maximum concurrent requests in the batch. | 1 |
| page_size | int | Tokens per KV cache page (PagedAttention granularity). | 16 |
| precision | str | Numerical format (fp16, int8, int4). | 'fp16' |
| efficiency | float | Compute utilization efficiency (0.0 to 1.0). | 0.5 |
Returns
| Name | Type | Description |
|---|---|---|
| ContinuousBatchingResult | Throughput (tokens/s), max active requests, memory fragmentation, TTFT, ITL, and speedup vs. static batching. |