KV-Cache: The Hidden Tax

At 128K context, the cache alone fills an 80 GB GPU — room for exactly one user.

node

intermediate

Discover that KV-cache memory — not model weights, not compute — determines how many users you can serve concurrently. Sweep batch size and context length to find the real OOM boundary.

The Question

You deploy Llama-3 8B on an H100. The model weights take 16 GB. You have 64 GB left. Surely you can serve dozens of users concurrently?

Not if they have long contexts. Every active user requires a KV-cache that grows linearly with sequence length. At 128K context, a single user’s cache can consume the entire remaining memory. This tutorial shows you exactly where the real memory wall lives and how to push it back.

Prerequisites

Complete Tutorial 1: The Memory Wall and Tutorial 2: Two Phases, One Request. You should understand memory-bound vs. compute-bound regimes and the two-phase LLM serving model.

What You Will Learn

Calculate the KV-cache size for any model, sequence length, and batch size
Identify the OOM boundary where KV-cache exhausts GPU memory
Explain why context length — not model size — is the binding memory constraint in serving
Compare static batching vs. paged attention for maximizing concurrent users

Background: What Is the KV-Cache?

During LLM decoding, every attention layer stores Key and Value matrices for all tokens generated so far. If you have studied data structures, this is memoization applied to the attention mechanism: store computed results instead of recomputing them. The names come from a database-style lookup: the Query is what you search for, the Key is what you match against, and the Value is what you retrieve. Without this cache, the model would need to recompute attention over the entire context at every step — quadratic cost. The KV-cache trades memory for compute:

Factor	Effect on KV-Cache
More layers	Linear growth (one K + one V per layer)
Longer context	Linear growth (one entry per token)
More users (batch)	Linear growth (independent cache per user)
Lower precision	Proportional reduction (INT8 = half of FP16)

The formula: KV-cache = 2 x layers x kv_heads x head_dim x seq_len x batch x bytes_per_element. At short contexts this is negligible. At long contexts it dominates everything.

Note on GQA (Grouped Query Attention): Modern architectures like Llama-3 use GQA, where kv_heads < num_heads. Llama-3 8B has 32 attention heads but only 8 KV-heads, reducing KV-cache by 4× compared to standard multi-head attention. Using num_heads instead of kv_heads in the formula is a common source of 4× overestimates.

1. Setup

import mlsysim
from mlsysim.solvers import ServingModel
from mlsysim.solvers import ContinuousBatchingModel

2. Single-User Baseline: Where Does the Memory Go?

Let’s start with a single user at a modest 2K context and see how memory breaks down:

from mlsysim.solvers import ServingModel

model = mlsysim.Models.Language.Llama3_8B
hardware = mlsysim.Hardware.Cloud.H100
solver = ServingModel()

# Single user, 2K context — the easy case
r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision="fp16")

from mlsysim.show import table, info

info("Memory Breakdown",
     Model_weights=r.model_weights_size,
     KV_cache_1_user=r.kv_cache_size,
     Total_memory=r.total_memory_required,
     Memory_utilization=f"{r.memory_utilization:.1%}",
     KV_as_pct_of_total=f"{r.kv_cache_size / r.total_memory_required * 100:.1f}%")

── Memory Breakdown ────────────────────────
Model weights:       16.06 GB
KV cache 1 user:     0.268 GB
Total memory:        16.33 GB
Memory utilization:  19.0%
KV as pct of total:  1.6 dimensionless%

At 2K context with one user, the KV-cache is tiny — a rounding error compared to the model weights. This is why many engineers assume memory pressure comes from model size. They are about to be surprised.

3. Batch Size Sweep: The Concurrency Wall

Now let’s add users. Each concurrent user needs their own KV-cache. Watch memory utilization climb:

rows = []
for batch in [1, 4, 8, 16, 32, 64, 128]:
    r = solver.solve(
        model=model, hardware=hardware,
        seq_len=2048, batch_size=batch, precision="fp16"
    )
    rows.append([batch, r.kv_cache_size, r.total_memory_required,
                 f"{r.memory_utilization:.1%}", "OK" if r.feasible else "OOM"])

table(["Batch", "KV-Cache", "Total", "Util", "Feasible"], rows)

Batch  KV-Cache     Total   Util  Feasible
──────────────────────────────────────────
1      0.268 GB  16.33 GB  19.0%        OK
4       1.07 GB  17.13 GB  19.9%        OK
8       2.15 GB  18.21 GB  21.2%        OK
16      4.29 GB  20.35 GB  23.7%        OK
32      8.59 GB  24.65 GB  28.7%        OK
64     17.18 GB  33.24 GB  38.7%        OK
128    34.36 GB  50.42 GB  58.7%        OK

At 2K context, you can fit many users. The KV-cache per user is small enough that batch size scales comfortably. But this picture changes dramatically when we extend the context.

4. Context Length Sweep: The Real Memory Wall

Fix batch size at 8 users and sweep context length from 512 tokens to 128K. This is where the hidden tax reveals itself:

rows = []
for ctx in [512, 2048, 4096, 8192, 16384, 32768, 65536, 131072]:
    r = solver.solve(
        model=model, hardware=hardware,
        seq_len=ctx, batch_size=8, precision="fp16"
    )
    rows.append([ctx, r.kv_cache_size, r.model_weights_size,
                 r.total_memory_required, f"{r.memory_utilization:.1%}",
                 "OK" if r.feasible else "OOM"])

table(["Context", "KV-Cache", "Weights", "Total", "Util", "Status"], rows)

Context  KV-Cache   Weights     Total    Util  Status
─────────────────────────────────────────────────────
512      0.537 GB  16.06 GB  16.60 GB   19.3%      OK
2048      2.15 GB  16.06 GB  18.21 GB   21.2%      OK
4096      4.29 GB  16.06 GB  20.35 GB   23.7%      OK
8192      8.59 GB  16.06 GB  24.65 GB   28.7%      OK
16384    17.18 GB  16.06 GB  33.24 GB   38.7%      OK
32768    34.36 GB  16.06 GB  50.42 GB   58.7%      OK
65536    68.72 GB  16.06 GB  84.78 GB   98.7%      OK
131072   137.4 GB  16.06 GB  153.5 GB  178.7%     OOM

Key Insight

KV-cache grows linearly with sequence length and batch size. It is the hidden memory consumer that determines your maximum concurrent users — not model size, not compute, but cache state. At 2K context, the cache is negligible. At 128K context, a single user’s cache can exceed the model weights. The same 80 GB GPU that serves 64 users at short context can serve exactly one user at long context. The “context length” on the model card is not a feature — it is a memory bill.

Now let’s see what happens when we try to serve even a single user at 128K:

# Single user at 128K context — the extreme case
r_long = solver.solve(
    model=model, hardware=hardware,
    seq_len=131072, batch_size=1, precision="fp16"
)

info("Single User @ 128K Context",
     Context="131,072 tokens (128K)",
     Model_weights=r_long.model_weights_size,
     KV_cache=r_long.kv_cache_size,
     Total=r_long.total_memory_required,
     Feasible=str(r_long.feasible),
     KV_as_pct_of_total=f"{r_long.kv_cache_size / r_long.total_memory_required * 100:.0f}%")

── Single User @ 128K Context ──────────────
Context:             131,072 tokens (128K)
Model weights:       16.06 GB
KV cache:            17.18 GB
Total:               33.24 GB
Feasible:            True
KV as pct of total:  52 dimensionless%

5. Paged Attention: Pushing Back the Wall

So the KV-cache fills memory fast, and at long contexts you hit OOM with just a handful of users. Is the only option to buy more memory? No — the allocation strategy itself is wasting space. Most sequences do not actually use the maximum context length, yet static batching reserves memory for the worst case.

Static batching allocates contiguous memory for the maximum sequence length, wasting space on incomplete sequences. PagedAttention (from vLLM) allocates KV-cache in small, fixed-size pages — exactly like how an operating system uses virtual memory paging to avoid physical memory fragmentation. Just as the OS maps virtual pages to physical frames on demand, PagedAttention maps KV-cache blocks to GPU memory on demand, eliminating fragmentation and fitting more concurrent requests:

from mlsysim.solvers import ContinuousBatchingModel

cb_solver = ContinuousBatchingModel()

rows = []
for label, max_b, page in [("Static (baseline)", 32, 2048), ("Paged (16 tok)", 32, 16), ("Paged (64 tok)", 32, 64)]:
    cb = cb_solver.solve(
        model=model, hardware=hardware,
        seq_len=4096, max_batch_size=max_b,
        page_size=page, precision="fp16"
    )
    rows.append([label, cb.max_active_requests,
                 f"{cb.throughput_tokens_per_sec:.0f} t/s",
                 f"{cb.memory_fragmentation_pct:.1f}%",
                 f"{cb.speedup_vs_static:.1f}x"])

table(["System", "Max Users", "Throughput", "Frag", "Speedup"], rows)

System             Max Users  Throughput  Frag  Speedup
───────────────────────────────────────────────────────
Static (baseline)         32    3225 t/s  0.0%     1.3x
Paged (16 tok)            32    3225 t/s  0.0%     1.3x
Paged (64 tok)            32    3225 t/s  0.0%     1.3x

Paged attention reduces fragmentation from ~50% to single digits, allowing more concurrent requests from the same memory budget. This is why vLLM and TensorRT-LLM default to paged KV-cache management in production.

Your Turn

Exercises

Exercise 1: Predict before you compute. Llama-3 70B has 80 layers (vs. 32 for the 8B model) and 8 KV-heads with 128 head_dim. Before running any code, predict: at seq_len=4096 and FP16, what batch size will cause OOM on an 80 GB H100? Write your prediction, then sweep batch sizes with mlsysim.Models.Language.Llama3_70B to find the actual limit. How close were you?

Exercise 2: Maximum users at 128K context. Using the H200 (141 GB HBM3e), calculate the maximum number of concurrent users you can serve with Llama-3 8B at 128K context in FP16. Then try INT8. How many additional users does quantization buy you?

Exercise 3: Paged vs. static at long context. Run the ContinuousBatchingModel for Llama-3 8B at seq_len=32768 with max_batch_size=16. Compare page_size=16 vs. page_size=256. Which gives better throughput? Why does page size matter more at long context?

Self-check: If a model has 32 layers, 8 KV-heads, 128 head_dim, and uses FP16 (2 bytes), how many bytes does the KV-cache consume per token per user? (Answer: 2 x 32 x 8 x 128 x 2 = 131,072 bytes = 128 KB per token.)

Key Takeaways

Summary

KV-cache size scales linearly with layers, KV-heads, sequence length, and batch size
At short context, cache is negligible — model weights dominate and you can serve many users
At long context, cache dominates — a single 128K user’s cache can exceed model weights
The OOM boundary depends on context length x batch size, not just model size
Paged attention reduces fragmentation, fitting more concurrent requests in the same memory

Next Steps

Quantization: Not a Free Lunch — Learn when reducing precision shrinks the KV-cache effectively vs. when it doesn’t help
Two Phases, One Request — Revisit the prefill/decode split now that you understand the cache pressure
Where to Invest — Use sensitivity analysis to quantify whether more memory or more bandwidth helps more
Silicon Zoo — Compare HBM capacity across H100, H200, MI300X, and see which GPUs tolerate long context