core.solver.ContinuousBatchingModel

core.solver.ContinuousBatchingModel()

Analyzes production LLM serving with Continuous Batching and PagedAttention.

Traditional static batching suffers from severe memory fragmentation and padding waste. This solver models the throughput improvements achieved by iteration-level scheduling and non-contiguous KV cache allocation.

Literature Source: 1. Kwon et al. (2023), “Efficient Memory Management for Large Language Model Serving with PagedAttention.” 2. Yu et al. (2022), “ORCA: A Distributed Serving System for Transformer-Based Generative Models.”

Methods

Name	Description
solve	Solves for continuous batching throughput and PagedAttention memory.

solve

core.solver.ContinuousBatchingModel.solve(
    model,
    hardware,
    seq_len,
    max_batch_size=1,
    page_size=16,
    precision='fp16',
    efficiency=0.5,
)

Solves for continuous batching throughput and PagedAttention memory.

Parameters

Name	Type	Description	Default
model	TransformerWorkload	The LLM model architecture.	required
hardware	HardwareNode	The target hardware for inference.	required
seq_len	int	The total context window (prompt + generated tokens).	required
max_batch_size	int	Maximum concurrent requests in the batch.	`1`
page_size	int	Tokens per KV cache page (PagedAttention granularity).	`16`
precision	str	Numerical format (fp16, int8, int4).	`'fp16'`
efficiency	float	Compute utilization efficiency (0.0 to 1.0).	`0.5`

Returns

Name	Type	Description
	ContinuousBatchingResult	Throughput (tokens/s), max active requests, memory fragmentation, TTFT, ITL, and speedup vs. static batching.