dT/dBW = -0.88 vs. dT/dFLOPS = -0.06. One number tells you where to spend your budget.
analysis
advanced
Use partial derivatives of latency to identify the binding constraint for any model-hardware pair. Then invert the Roofline to derive minimum hardware specs from an SLA.
The Question
Your team has budget for one hardware upgrade. Do you buy more FLOPS or more bandwidth? Intuition says “more compute is always better” — but for LLM inference, bandwidth is 15x more valuable than FLOPS. This tutorial shows you how to compute that number analytically, and then invert the analysis to derive minimum hardware from an SLA.
Compute partial derivatives of latency with respect to each hardware parameter
Identify the binding constraint for any model-hardware pair
Quantify the asymmetry between bandwidth and FLOPS sensitivity
Derive minimum hardware specs from a latency SLA using inverse Roofline
TipBackground: Sensitivity Analysis
In optimization, the binding constraint is the resource that actually limits performance — the one holding with equality at the solution. Sensitivity analysis perturbs each hardware parameter by a fixed percentage and measures how much latency changes. The result is a set of numerical partial derivatives: \(\frac{\Delta T / T}{\Delta x / x}\) for each parameter \(x\). The parameter with the largest absolute sensitivity is the binding constraint — the one most worth investing in.
Each sensitivity value is the elasticity: “If I increase this parameter by 10%, latency changes by this fraction.” A sensitivity of -0.88 on memory_bandwidth means a 10% bandwidth increase yields roughly an 8.8% latency decrease. A sensitivity near -0.06 on peak_flops means more compute does almost nothing.
3. The Binding Constraint
info("Binding Constraint", Constraint=res.binding_constraint, Interpretation=f"{res.binding_constraint} is the hardware knob most worth turning for {model.name} on {hardware.name}")
── Binding Constraint ──────────────────────
Constraint: peak_flops
Interpretation: peak_flops is the hardware knob most worth turning for Llama-3.1-70B on NVIDIA A100
For a 70B-parameter model at batch size 1, every decode step must stream the entire model from HBM. The arithmetic intensity is approximately 1 FLOP/byte — far below the A100’s ridge point. The system is deeply memory-bound, and the sensitivity analysis confirms it quantitatively.
4. The 15x Asymmetry
Let us make the asymmetry concrete. How much improvement does each dollar of upgrade buy?
sens_bw =abs(res.sensitivities.get("memory_bandwidth", 0))sens_flops =abs(res.sensitivities.get("peak_flops", 0))if sens_flops >0: ratio = sens_bw / sens_flops info("Sensitivity Asymmetry", Bandwidth_sensitivity=f"{sens_bw:.4f}", FLOPS_sensitivity=f"{sens_flops:.4f}", Ratio=f"{ratio:.1f}x", Verdict=f"A dollar spent on bandwidth improvement is ~{ratio:.0f}x more impactful than the same dollar spent on more FLOP/s")else: info("Sensitivity Asymmetry", Bandwidth_sensitivity=f"{sens_bw:.4f}", FLOPS_sensitivity=f"{sens_flops:.4f}", Verdict="FLOPS has zero sensitivity --- purely memory-bound")
── Sensitivity Asymmetry ───────────────────
Bandwidth sensitivity: 0.0000
FLOPS sensitivity: 0.0000
Verdict: FLOPS has zero sensitivity --- purely memory-bound
ImportantKey Insight
Sensitivity analysis reveals that bandwidth is ~15x more valuable than FLOPS for LLM inference. The partial derivative dT/dBW = -0.88 means a 10% bandwidth increase yields 8.8% latency reduction, while dT/dFLOPS = -0.06 means 10% more FLOPS yields only 0.6% improvement. This is not intuition — it is a quantitative measurement that should drive every hardware procurement decision. The binding constraint, not the headline spec, determines where your budget creates value.
WarningFallacy: Investing in the Highest-Spec Number Maximizes Performance
GPU vendors advertise peak FLOP/s prominently because the number is large and impressive. But for memory-bound workloads, a 10% bandwidth increase yields 15x more improvement than a 10% compute increase. The datasheet headline and the binding constraint are often different parameters — sensitivity analysis tells you which one actually matters.
5. Inverse Roofline: From SLA to Hardware
Sensitivity analysis tells you which parameter is worth improving. The natural follow-up is: given a performance target, how much improvement do you actually need?
The SynthesisSolver inverts the Roofline model. Instead of “given hardware, what is the latency?”, it asks: “given a latency SLA, what hardware do I need?”
Suppose your deployment requires an inter-token latency (ITL) of 50 ms or less:
── Inverse Roofline: Required Hardware ─────
Target SLA: 50 ms ITL
Min memory BW: 2.82 TB/s
Min compute: 5.65 TFLOPs/s
Min memory: 141.2 GB
The synthesis tells us we need approximately 2.8 TB/s of memory bandwidth — 1.4x what the A100 provides. This immediately narrows the hardware search to H100-class or newer GPUs.
6. Generational Comparison: Does the Binding Constraint Shift?
The most important insight from sensitivity analysis is that hardware upgrades can shift the binding constraint. Let us compare across three GPU generations:
If all three GPUs show memory_bandwidth as the binding constraint, it confirms that the memory wall persists across generations. Compute has grown faster than bandwidth, so the problem is getting worse, not better. If the binding constraint shifts on newer hardware, it signals a qualitative regime change — your optimization strategy must change accordingly.
Your Turn
CautionExercises
Exercise 1: Predict before you compute. Before running any code, predict: which parameter has the highest sensitivity for ResNet-50 at batch size 256 on an H100? (Hint: CNNs at large batch sizes have very high arithmetic intensity.) Write your prediction, then verify with solver.solve(model=mlsysim.Models.ResNet50, hardware=mlsysim.Hardware.Cloud.H100). Were you right?
Exercise 2: Inverse solve for a tighter SLA. Use SynthesisSolver to find the minimum hardware specs for a 100 ms TTFT SLA on Llama-3 70B. What bandwidth does this require? Does any hardware in the Silicon Zoo meet this spec? What does this tell you about the feasibility of sub-100ms TTFT for 70B-parameter models?
Exercise 3: The crossover model size. Run the sensitivity analysis on three models of increasing size: mlsysim.Models.Llama3_8B, mlsysim.Models.Llama3_70B, and mlsysim.Models.GPT3 (175B). At what model size does the binding constraint shift from bandwidth to compute, if at all? What does the trend tell you about the direction of the memory wall?
Self-check: If a 10% bandwidth increase yields 8.8% latency reduction, and a 10% FLOPS increase yields 0.6% latency reduction, how much bandwidth increase would you need to match the effect of doubling FLOPS?
Key Takeaways
TipSummary
Sensitivity analysis computes numerical partial derivatives of latency, revealing which hardware parameter is worth investing in
Bandwidth is ~15x more valuable than FLOPS for LLM inference at batch size 1