solvers.ContinuousBatchingModel

solvers.ContinuousBatchingModel()

Analyzes production LLM serving with Continuous Batching and PagedAttention.

Traditional static batching suffers from severe memory fragmentation and padding waste. This model simulates the throughput improvements achieved by iteration-level scheduling and non-contiguous KV cache allocation.

Literature Source: 1. Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention.” 2. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”

Methods

Name Description
solve Calculates continuous batching throughput and PagedAttention memory.

solve

solvers.ContinuousBatchingModel.solve(
    model,
    hardware,
    seq_len,
    max_batch_size=1,
    page_size=16,
    precision='fp16',
    efficiency=0.5,
)

Calculates continuous batching throughput and PagedAttention memory.

Back to top