solvers.ContinuousBatchingModel
solvers.ContinuousBatchingModel()Analyzes production LLM serving with Continuous Batching and PagedAttention.
Traditional static batching suffers from severe memory fragmentation and padding waste. This model simulates the throughput improvements achieved by iteration-level scheduling and non-contiguous KV cache allocation.
Literature Source: 1. Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention.” 2. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”
Methods
| Name | Description |
|---|---|
| solve | Calculates continuous batching throughput and PagedAttention memory. |
solve
solvers.ContinuousBatchingModel.solve(
model,
hardware,
seq_len,
max_batch_size=1,
page_size=16,
precision='fp16',
efficiency=0.5,
)Calculates continuous batching throughput and PagedAttention memory.