At 128K context, the cache alone fills an 80 GB GPU — room for exactly one user.
node
intermediate
Discover that KV-cache memory — not model weights, not compute — determines how many users you can serve concurrently. Sweep batch size and context length to find the real OOM boundary.
The Question
You deploy Llama-3 8B on an H100. The model weights take 16 GB. You have 64 GB left. Surely you can serve dozens of users concurrently?
Not if they have long contexts. Every active user requires a KV-cache that grows linearly with sequence length. At 128K context, a single user’s cache can consume the entire remaining memory. This tutorial shows you exactly where the real memory wall lives and how to push it back.
Calculate the KV-cache size for any model, sequence length, and batch size
Identify the OOM boundary where KV-cache exhausts GPU memory
Explain why context length — not model size — is the binding memory constraint in serving
Compare static batching vs. paged attention for maximizing concurrent users
TipBackground: What Is the KV-Cache?
During LLM decoding, every attention layer stores Key and Value matrices for all tokens generated so far. If you have studied data structures, this is memoization applied to the attention mechanism: store computed results instead of recomputing them. The names come from a database-style lookup: the Query is what you search for, the Key is what you match against, and the Value is what you retrieve. Without this cache, the model would need to recompute attention over the entire context at every step — quadratic cost. The KV-cache trades memory for compute:
Factor
Effect on KV-Cache
More layers
Linear growth (one K + one V per layer)
Longer context
Linear growth (one entry per token)
More users (batch)
Linear growth (independent cache per user)
Lower precision
Proportional reduction (INT8 = half of FP16)
The formula: KV-cache = 2 x layers x kv_heads x head_dim x seq_len x batch x bytes_per_element. At short contexts this is negligible. At long contexts it dominates everything.
Note on GQA (Grouped Query Attention): Modern architectures like Llama-3 use GQA, where kv_heads < num_heads. Llama-3 8B has 32 attention heads but only 8 KV-heads, reducing KV-cache by 4× compared to standard multi-head attention. Using num_heads instead of kv_heads in the formula is a common source of 4× overestimates.
── Memory Breakdown ────────────────────────
Model weights: 16.06 GB
KV cache 1 user: 1.20 GB
Total memory: 17.26 GB
Memory utilization: 20.1%
KV as pct of total: 6.9 dimensionless%
At 2K context with one user, the KV-cache is tiny — a rounding error compared to the model weights. This is why many engineers assume memory pressure comes from model size. They are about to be surprised.
3. Batch Size Sweep: The Concurrency Wall
Now let’s add users. Each concurrent user needs their own KV-cache. Watch memory utilization climb:
Batch KV-Cache Total Util Feasible
───────────────────────────────────────────
1 1.20 GB 17.26 GB 20.1% OK
4 4.80 GB 20.86 GB 24.3% OK
8 9.59 GB 25.65 GB 29.9% OK
16 19.18 GB 35.24 GB 41.0% OK
32 38.36 GB 54.42 GB 63.4% OK
64 76.72 GB 92.78 GB 108.0% OOM
128 153.4 GB 169.5 GB 197.3% OOM
At 2K context, you can fit many users. The KV-cache per user is small enough that batch size scales comfortably. But this picture changes dramatically when we extend the context.
4. Context Length Sweep: The Real Memory Wall
Fix batch size at 8 users and sweep context length from 512 tokens to 128K. This is where the hidden tax reveals itself:
KV-cache grows linearly with sequence length and batch size. It is the hidden memory consumer that determines your maximum concurrent users — not model size, not compute, but cache state. At 2K context, the cache is negligible. At 128K context, a single user’s cache can exceed the model weights. The same 80 GB GPU that serves 64 users at short context can serve exactly one user at long context. The “context length” on the model card is not a feature — it is a memory bill.
Now let’s see what happens when we try to serve even a single user at 128K:
# Single user at 128K context — the extreme caser_long = solver.solve( model=model, hardware=hardware, seq_len=131072, batch_size=1, precision="fp16")info("Single User @ 128K Context", Context="131,072 tokens (128K)", Model_weights=r_long.model_weights_size, KV_cache=r_long.kv_cache_size, Total=r_long.total_memory_required, Feasible=str(r_long.feasible), KV_as_pct_of_total=f"{r_long.kv_cache_size / r_long.total_memory_required *100:.0f}%")
── Single User @ 128K Context ──────────────
Context: 131,072 tokens (128K)
Model weights: 16.06 GB
KV cache: 76.72 GB
Total: 92.78 GB
Feasible: False
KV as pct of total: 83 dimensionless%
5. Paged Attention: Pushing Back the Wall
So the KV-cache fills memory fast, and at long contexts you hit OOM with just a handful of users. Is the only option to buy more memory? No — the allocation strategy itself is wasting space. Most sequences do not actually use the maximum context length, yet static batching reserves memory for the worst case.
Static batching allocates contiguous memory for the maximum sequence length, wasting space on incomplete sequences. PagedAttention (from vLLM) allocates KV-cache in small, fixed-size pages — exactly like how an operating system uses virtual memory paging to avoid physical memory fragmentation. Just as the OS maps virtual pages to physical frames on demand, PagedAttention maps KV-cache blocks to GPU memory on demand, eliminating fragmentation and fitting more concurrent requests:
Paged attention reduces fragmentation from ~50% to single digits, allowing more concurrent requests from the same memory budget. This is why vLLM and TensorRT-LLM default to paged KV-cache management in production.
Your Turn
CautionExercises
Exercise 1: Predict before you compute. Llama-3 70B has 80 layers (vs. 32 for the 8B model) and 8 KV-heads with 128 head_dim. Before running any code, predict: at seq_len=4096 and FP16, what batch size will cause OOM on an 80 GB H100? Write your prediction, then sweep batch sizes with mlsysim.Models.Llama3_70B to find the actual limit. How close were you?
Exercise 2: Maximum users at 128K context. Using the H200 (141 GB HBM3e), calculate the maximum number of concurrent users you can serve with Llama-3 8B at 128K context in FP16. Then try INT8. How many additional users does quantization buy you?
Exercise 3: Paged vs. static at long context. Run the ContinuousBatchingModel for Llama-3 8B at seq_len=32768 with max_batch_size=16. Compare page_size=16 vs. page_size=256. Which gives better throughput? Why does page size matter more at long context?
Self-check: If a model has 32 layers, 8 KV-heads, 128 head_dim, and uses FP16 (2 bytes), how many bytes does the KV-cache consume per token per user? (Answer: 2 x 32 x 8 x 128 x 2 = 131,072 bytes = 128 KB per token.)
Key Takeaways
TipSummary
KV-cache size scales linearly with layers, KV-heads, sequence length, and batch size
At short context, cache is negligible — model weights dominate and you can serve many users
At long context, cache dominates — a single 128K user’s cache can exceed model weights
The OOM boundary depends on context length x batch size, not just model size
Paged attention reduces fragmentation, fitting more concurrent requests in the same memory
Next Steps
Quantization: Not a Free Lunch — Learn when reducing precision shrinks the KV-cache effectively vs. when it doesn’t help
Two Phases, One Request — Revisit the prefill/decode split now that you understand the cache pressure
Where to Invest — Use sensitivity analysis to quantify whether more memory or more bandwidth helps more
Silicon Zoo — Compare HBM capacity across H100, H200, MI300X, and see which GPUs tolerate long context