Quantization: Not a Free Lunch

INT4 gives 4x speedup for decode. For training, it gives 0x.

algorithm

intermediate

Discover that quantization only helps when you are memory-bound. Compare the effect of INT4 on LLM decode (dramatic) vs. training (negligible). The regime determines whether the optimization works.

The Question

INT4 quantization reduces model size by 4x. Intuitively, 4x fewer bytes should mean 4x faster. Blogs claim massive speedups. Does INT4 always give 4x speedup?

You already have the tools to predict the answer. Before reading further, think about what Tutorial 1 taught you: which ceiling determines performance? If quantization reduces bytes but not FLOPs, when would it help — and when would it do nothing?

This tutorial makes the prediction, runs the experiment, and reveals that the answer depends entirely on the regime.

Prerequisites

Complete Tutorial 1: The Memory Wall and Tutorial 2: Two Phases, One Request. You should understand memory-bound vs. compute-bound regimes and why LLM decode is memory-bound.

What You Will Learn

Predict whether quantization will help a given workload based on its regime
Measure the speedup from INT8 and INT4 for both memory-bound and compute-bound workloads
Explain why the same optimization yields 4x in one case and 0x in another
Evaluate the accuracy-compression trade-off using CompressionModel

Background: What Quantization Actually Does

Quantization reduces the number of bytes per parameter:

Precision	Bytes/Param	Relative to FP16
FP16	2	1x (baseline)
INT8	1	0.5x
INT4	0.5	0.25x

For memory-bound workloads, performance scales with bytes loaded from HBM per step. Halving bytes halves latency. For compute-bound workloads, performance scales with FLOP/s. Fewer bytes does not change the number of FLOPs, so latency stays the same.

The key question is always: which ceiling am I hitting?

1. Setup

import mlsysim
from mlsysim.solvers import ServingModel, SingleNodeModel
from mlsysim.solvers import CompressionModel

2. Memory-Bound Case: LLM Decode at Batch 1

LLM decoding at batch 1 is the textbook memory-bound workload. Each token generation must reload the entire model from HBM. Fewer bytes per parameter means fewer bytes to load means lower inter-token latency:

from mlsysim.solvers import ServingModel
from mlsysim.show import table, info

model = mlsysim.Models.Language.Llama3_8B
hardware = mlsysim.Hardware.Cloud.H100
solver = ServingModel()

rows = []
baseline_itl = None
for prec in ["fp16", "int8", "int4"]:
    r = solver.solve(
        model=model, hardware=hardware,
        seq_len=2048, batch_size=1, precision=prec
    )
    itl_ms = r.itl.to("ms").magnitude
    if baseline_itl is None:
        baseline_itl = itl_ms
    speedup = baseline_itl / itl_ms
    rows.append([prec, r.itl.to('ms'), r.model_weights_size, f"{speedup:.1f}x"])

table(["Precision", "ITL", "Weights", "Speedup vs FP16"], rows)

Precision      ITL   Weights  Speedup vs FP16
─────────────────────────────────────────────
fp16       5.19 ms  16.06 GB             1.0x
int8       2.76 ms   8.03 GB             1.9x
int4       1.54 ms   4.02 GB             3.4x

INT8 gives roughly 2x speedup. INT4 gives roughly 4x. The speedup tracks the byte reduction almost exactly — because the workload is purely memory-bound. Every byte you eliminate directly reduces the time to load model weights.

3. Compute-Bound Case: Training at Large Batch

Now let’s try the same optimization on a compute-bound workload — ResNet-50 training at batch 256 on the A100:

from mlsysim.solvers import SingleNodeModel

train_model = mlsysim.Models.Vision.ResNet50
train_hw = mlsysim.Hardware.Cloud.A100
train_solver = SingleNodeModel()

rows = []
baseline_lat = None
for prec in ["fp16", "int8", "int4"]:
    p = train_solver.solve(
        model=train_model, hardware=train_hw,
        batch_size=256, precision=prec
    )
    lat_ms = p.latency.to("ms").magnitude
    if baseline_lat is None:
        baseline_lat = lat_ms
    speedup = baseline_lat / lat_ms if lat_ms > 0 else 0
    rows.append([prec, p.latency.to('ms'), f"{p.throughput:.0f} img/s", f"{speedup:.1f}x"])

table(["Precision", "Latency", "Throughput", "Speedup"], rows)

Precision   Latency            Throughput  Speedup
──────────────────────────────────────────────────
fp16       13.97 ms  18323 / second img/s     1.0x
int8        7.24 ms  35343 / second img/s     1.9x
int4       13.97 ms  18323 / second img/s     1.0x

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/mlsysim/engine/solvers/performance.py:74: UserWarning: Precision 'int4' not in NVIDIA A100 precision_flops (available: ['fp32', 'tf32', 'int8', 'fp16_sparse']); using default peak_flops (FP16). Results may be inaccurate.
  return Engine.solve(model, hardware, batch_size=batch_size, precision=precision, efficiency=efficiency, raise_errors=raise_errors, **kwargs)

The speedup is negligible. Why? Because at batch 256, ResNet-50 training is compute-bound. The bottleneck is arithmetic throughput (FLOP/s), not memory bandwidth. Reducing bytes per parameter does not change the number of FLOPs in the forward and backward passes. The GPU is already saturated with compute — loading weights faster does not help.

Nuance: INT8 Tensor Cores

In practice, GPUs with dedicated INT8/INT4 Tensor Cores (like A100 and H100) also gain higher compute throughput at lower precision — e.g., the A100 does 624 TFLOP/s INT8 vs. 312 TFLOP/s FP16, a 2× compute boost. This means quantization simultaneously changes both the memory ceiling (fewer bytes) and the compute ceiling (more INT ops/sec). For workloads near the ridge point, this dual effect can shift the regime classification itself. MLSys·im’s first-order model captures the memory effect; the compute boost is a second-order effect that depends on hardware-specific Tensor Core support.

4. The Reveal: Same Optimization, Two Regimes

Let’s put the results side by side to make the contrast stark:

rows = []

# Memory-bound: LLM decode
decode_row = ["Llama-3 8B decode"]
for prec in ["fp16", "int8", "int4"]:
    r = solver.solve(model=model, hardware=hardware, seq_len=2048, batch_size=1, precision=prec)
    decode_row.append(r.itl.to('ms'))
decode_row.append("Memory-bound")
rows.append(decode_row)

# Compute-bound: training
train_row = ["ResNet-50 train bs=256"]
for prec in ["fp16", "int8", "int4"]:
    p = train_solver.solve(model=train_model, hardware=train_hw, batch_size=256, precision=prec)
    train_row.append(p.latency.to('ms'))
train_row.append("Compute-bound")
rows.append(train_row)

table(["Workload", "FP16", "INT8", "INT4", "Regime"], rows)

Workload                    FP16     INT8      INT4         Regime
──────────────────────────────────────────────────────────────────
Llama-3 8B decode        5.19 ms  2.76 ms   1.54 ms   Memory-bound
ResNet-50 train bs=256  13.97 ms  7.24 ms  13.97 ms  Compute-bound

Key Insight

Quantization reduces bytes loaded from memory. If you are memory-bound, fewer bytes means proportional speedup. If you are compute-bound, fewer bytes means nothing — compute, not memory, is the ceiling. The regime determines whether the optimization works. INT4 gives ~4x for LLM decode (memory-bound) and ~0x for large-batch training (compute-bound). The same technique, applied to different regimes, yields completely different results. Always check the regime before choosing an optimization.

5. The Accuracy Tax: CompressionModel

Quantization is not free — it trades accuracy for speed. The CompressionModel quantifies this trade-off:

from mlsysim.solvers import CompressionModel

comp_solver = CompressionModel()

rows = []
for bits in [16, 8, 4]:
    c = comp_solver.solve(
        model=model, hardware=hardware,
        method="quantization", target_bitwidth=bits
    )
    rows.append([
        bits,
        c.compressed_size_gb,
        f"{c.compression_ratio:.1f}x",
        f"{c.estimated_accuracy_delta:+.1%}",
        f"{c.memory_savings_pct:.1f}%"
    ])

table(["Bits", "Compressed", "Compression", "Accuracy", "Savings"], rows)

Bits  Compressed  Compression  Accuracy  Savings
────────────────────────────────────────────────
16      16.06 GB         2.0x     +0.0%    50.0%
8        8.03 GB         4.0x     -0.2%    75.0%
4        4.02 GB         8.0x     -2.5%    87.5%

INT8 has minimal accuracy loss (< 1%). INT4 can degrade accuracy by 2-5% depending on the model and calibration method. The decision is not “always quantize” — it is “quantize when you are memory-bound AND the accuracy cost is acceptable for your application.”

When NOT to Quantize

Training: You are compute-bound at large batch sizes. Quantization does not help and introduces gradient noise that can harm convergence.
High-accuracy applications: Medical, financial, and safety-critical systems may not tolerate even 1% accuracy loss.
Already compute-bound inference: If your inference workload runs at large batch sizes (e.g., offline batch processing), you are likely compute-bound already.

Your Turn

Exercises

Exercise 1: Predict before you compute. Llama-3 70B has 5x more parameters than Llama-3 8B, making it even more memory-bound at batch 1. Before running code, predict: will INT4 give a larger or smaller speedup for the 70B model compared to the 8B? Write your prediction, then verify with mlsysim.Models.Language.Llama3_70B. (Hint: think about what determines speedup in the memory-bound regime.)

Exercise 2: Find the crossover batch size. At some batch size, LLM inference transitions from memory-bound to compute-bound. At that point, quantization stops helping. Sweep batch sizes from 1 to 256 for Llama-3 8B on the H100 and compare FP16 vs. INT4 ITL. At what batch size does the INT4 speedup drop below 2x? Below 1.5x?

Exercise 3: Accuracy-compression frontier. Use CompressionModel to compare quantization (INT8, INT4) vs. pruning — a technique that removes parameters entirely (setting them to zero), reducing both model size and computation. Try sparsity levels of 0.5, 0.75, and 0.9 for Llama-3 8B. Build a table showing compression ratio vs. accuracy delta for each method. Which method gives the best compression-to-accuracy trade-off?

Self-check: A workload has arithmetic intensity of 5 FLOP/byte and the hardware ridge point is 150 FLOP/byte. Is this workload memory-bound or compute-bound? Will quantization help? (Answer: Memory-bound. Yes, quantization will help proportionally.)

Key Takeaways

Summary

Quantization primarily reduces bytes loaded from memory: it helps memory-bound workloads proportionally and compute-bound workloads negligibly (though dedicated INT8/INT4 Tensor Cores also increase compute throughput)
LLM decode at batch 1 is the ideal case for quantization: ~2x for INT8, ~4x for INT4
Large-batch training is compute-bound: quantization provides near-zero speedup
The regime determines the outcome: always check whether you are memory-bound or compute-bound before applying quantization
Accuracy is the tax: INT8 costs < 1%, INT4 costs 2-5% — acceptable for some applications, not for others

Next Steps

KV-Cache: The Hidden Tax — Quantization also shrinks the KV-cache, allowing more concurrent users
The Memory Wall — Revisit the memory wall to see how quantization shifts the bandwidth bottleneck
Starving the GPU — Another case where the bottleneck is not where you expect
Where to Invest — Quantify exactly how much quantization buys you compared to hardware upgrades