Accuracy & Validation

How Well Do MLSYSIM Predictions Match Real Hardware?

MLSYSIM is a first-order analytical model. It predicts performance from closed-form equations, not from cycle-accurate simulation or empirical measurement. This page documents where those predictions are accurate, where they diverge, and what drives the gap.

NoteWhat “first-order” means

A first-order model captures the dominant system behavior without modeling second-order effects such as cache hierarchy dynamics, memory fragmentation, NIC DMA contention, or driver overhead. Expect predictions within 15–30% of measured throughput for well-optimized workloads on modern hardware. Use MLSYSIM to reason about bottlenecks and compare configurations, not to produce production SLA estimates.

For the formal treatment of roofline modeling and arithmetic intensity, see the Hardware Acceleration slides (Vol I, Ch 11).


Design Philosophy

MLSYSIM follows the same trade-off that Hennessy & Patterson’s MIPS simulator made for computer architecture: sacrifice cycle accuracy for taxonomic completeness and execution speed. A gem5 simulation of a single LLaMA-70B inference step can take hours; MLSYSIM solves it in milliseconds. The goal is not to replace empirical benchmarks but to enable rapid, principled reasoning about system design trade-offs before committing to expensive experiments.

This philosophy maps directly to the six solvers in the Math Foundations page. Each solver implements a specific analytical model grounded in published research, and each model has a well-characterized accuracy envelope documented below.


The Accuracy Hierarchy

Not all MLSYSIM predictions are created equal. The table below ranks prediction types from most to least accurate, so you know how much to trust each output.

Prediction Type Typical Accuracy Why
KV-cache sizing Exact Definitional formula, not approximated
Checkpoint sizing Exact Direct calculation: \(N \times \text{bytes\_per\_param}\)
Bottleneck classification (compute vs. memory) >95% correct The roofline ridge point is a structural property of the hardware
Relative configuration ranking (which config is faster?) >90% correct Errors cancel when comparing two configurations on the same hardware
Scaling efficiency direction (how does MFU change with DP/TP/PP?) ±10% on MFU Communication models are bandwidth-optimal lower bounds
Single-node throughput (absolute latency) ±15–30% Sensitive to the efficiency parameter η
TCO and carbon estimates ±20% Dominated by PUE and grid carbon intensity assumptions
ITL in production serving −25% to −50% Missing KV-cache paging, batch scheduling, and quantization kernel overhead
TipThe golden rule

Trust MLSYSIM for direction and classification. Be cautious with absolute numbers. The model tells you which resource is the bottleneck and which configuration is better. It does not promise the exact millisecond.


Validation Against Published Benchmarks

The table below compares MLSYSIM roofline predictions against publicly reported results from MLPerf Inference v4.0 (July 2024) and community benchmarks.

Workload Hardware Predicted Measured Error Source
ResNet-50 (BS=1) A100 SXM4 ~0.42 ms ~0.38 ms +11% MLPerf Inference v4.0
ResNet-50 (BS=64) A100 SXM4 ~8.1 ms ~7.5 ms +8% MLPerf Inference v4.0
BERT-Large (BS=1) H100 SXM5 ~2.1 ms ~1.9 ms +11% MLPerf Inference v4.0
Llama2-70B TTFT H100 SXM5 ~45 ms (2K ctx) ~40–50 ms ±10% vLLM benchmarks
Llama2-70B ITL H100 SXM5 ~4.2 ms/token ~5–8 ms/token −25% vLLM benchmarks
WarningITL underprediction is expected

MLSYSIM predicts the roofline lower bound for inter-token latency. It does not model quantization kernel overhead, KV-cache paging latency (PagedAttention), or batch scheduling overhead in production serving systems. Real ITL is typically 1.5–2× the roofline bound. Use ITL predictions as the theoretical floor, not as a production estimate.

For the full treatment of serving-system overheads (continuous batching, PagedAttention, SLO compliance), see the Inference at Scale slides (Vol II, Ch 9).

Interpreting the error column

All errors are relative to the measured value: \(\text{Error} = (\text{Predicted} - \text{Measured}) / \text{Measured}\). A positive error means MLSYSIM overpredicts latency (conservative). A negative error means MLSYSIM underpredicts (optimistic). The sign matters: overprediction is safer for capacity planning; underprediction is dangerous for SLA commitments.


Where MLSYSIM is Most Accurate

Bottleneck classification (roofline)

The model is most reliable for determining which resource limits performance. If MLSYSIM predicts “Memory Bound,” the actual workload will almost always be memory-bound too, even if the exact latency differs. This classification is typically >95% correct across documented workloads because the roofline ridge point is a structural property of the hardware that does not depend on software efficiency.

The underlying roofline model is documented in the Hardware Acceleration slides (Vol I, Ch 11) and applied to real H100 workloads in the Compute Infrastructure slides (Vol II, Ch 2).

Scaling efficiency direction (distributed training)

The model correctly predicts how scaling efficiency changes as you vary DP/TP/PP configuration. The relative ranking of configurations is reliable even when absolute MFU values are off by ±10%. This is because the communication cost models (Ring AllReduce, Tree AllReduce, Hierarchical AllReduce) implement bandwidth-optimal lower bounds from Patarasuk & Mueller (2009), so the relative costs scale correctly.

See the Distributed Training slides (Vol II, Ch 5) for the 3D parallelism framework and scaling efficiency analysis.

KV-cache and checkpoint sizing

The KV-cache formula (\(\text{KV} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times S \times B \times \text{bpe}\)) is definitional, not approximated. It computes the exact number of bytes the K and V tensors occupy. Similarly, checkpoint sizing (\(N \times \text{bytes\_per\_param}\)) is a direct count. Memory feasibility checks (feasible: True/False) are accurate because they compare against the same HBM capacity reported in manufacturer datasheets.

Carbon and TCO estimates

Sustainability and economics predictions are accurate to within 20% for standard cloud deployments. The main source of error is the assumed PUE (power usage effectiveness), which varies from ~1.1 (hyperscaler) to ~1.6 (enterprise). The carbon accounting methodology aligns with the lifecycle framework in the Sustainable AI slides (Vol II, Ch 15).


Where MLSYSIM Diverges From Measurement

Source of Error Typical Impact When It Matters
efficiency=0.5 default ±20% on latency Any roofline prediction
No cache hierarchy model 5–30% on small batches Batch size 1–4, small models
No NVLink contention model 5–15% on TP overhead Tensor parallel with TP > 4
No pipeline schedule optimization 10–20% on PP efficiency Interleaved 1F1B schedules
No quantization kernel overhead −30% on INT8 ITL Quantized serving
No memory fragmentation −10–20% on KV-cache capacity Long-context serving

The efficiency parameter (η)

MLSYSIM uses efficiency (η, default 0.5) as a single scalar representing hardware utilization. This is the largest source of absolute error. The table below provides calibrated ranges from published benchmarks.

Workload Type Recommended η Reference
Training (fp16/bf16, Megatron-LM) 0.35–0.55 Shoeybi et al. (2019)
Inference (fp16, vLLM/TRT-LLM) 0.25–0.45 MLPerf Inference v4.0
Inference (int8, quantized) 0.20–0.40 Community benchmarks
Edge inference (TFLite, ONNX RT) 0.15–0.30 MLPerf Tiny

When you use the default efficiency=0.5, you are modeling a well-optimized training job on datacenter hardware. For inference, pass efficiency=0.35 for more conservative estimates. For edge devices, use efficiency=0.2.

For a rigorous treatment of precision engineering and its impact on performance, see the Performance Engineering slides (Vol II, Ch 10).

What is not modeled

MLSYSIM deliberately omits the following effects. Each omission is a design decision, not an oversight:

  1. Cache hierarchy behavior. L1/L2/SRAM tiling effects can swing latency by 5–30% for small batch sizes. Modeling them would require operator-level simulation, which conflicts with the goal of millisecond solve times.

  2. NVLink contention under tensor parallelism. For TP ≤ 4 (within a single NVSwitch domain), contention is negligible. For TP > 4 (cross-switch), contention can add 5–15% overhead that MLSYSIM does not capture.

  3. Pipeline schedule variants. MLSYSIM implements the standard GPipe bubble formula: \(\text{bubble} = (P - 1) / (V \cdot M + P - 1)\). Advanced schedules (zero-bubble, interleaved 1F1B with \(V > 1\)) can reduce bubbles by 10–20% beyond what the formula predicts.

  4. Quantization kernel efficiency. INT8/INT4 kernels achieve lower utilization than FP16 Tensor Core kernels due to dequantization overhead and less mature compiler support. MLSYSIM treats precision as a pure ops-per-byte ratio.

  5. Memory fragmentation and PagedAttention overhead. The KV-cache formula gives the dense allocation size. In practice, PagedAttention introduces fragmentation (typically 5–10%) and paging latency.


When to Use MLSYSIM (and When Not To)

  • Comparing hardware for a known workload (“H100 vs. MI300X for Llama-3 70B?”)
  • Choosing parallelism strategy (“DP=8 or TP=4×PP=2?”)
  • Estimating memory feasibility (“Does Llama-3 405B fit in 8×H100 HBM at FP16?”)
  • Budgeting TCO and carbon before procurement decisions
  • Teaching the Iron Law, roofline model, and systems reasoning
  • Rapid prototyping system designs before committing to cluster time
  • Production SLA guarantees (use empirical benchmarks instead)
  • Kernel-level optimization (use PyTorch Profiler or NSight Compute)
  • Quantized model accuracy (MLSYSIM models throughput, not model quality)
  • Exact latency targets where ±30% error is unacceptable
  • Novel hardware with no published specs in the Silicon Zoo

Slide Deck Reference

The theory behind MLSYSIM’s analytical models is covered across several lecture decks. Use this table to find the right slides for each validation domain.

Validation Domain MLSYSIM Solver Companion Slides
Roofline model, arithmetic intensity, bottleneck classification SingleNodeModel Hardware Acceleration (Vol I, Ch 11)
Statistical rigor, MLPerf methodology, benchmark anti-patterns All solvers Benchmarking (Vol I, Ch 12)
H100 roofline landscape, GPU specs, TCO analysis SingleNodeModel, EconomicsModel Compute Infrastructure (Vol II, Ch 2)
3D parallelism, scaling efficiency, communication overhead DistributedModel Distributed Training (Vol II, Ch 5)
Serving latency, KV-cache, continuous batching, SLO compliance ServingModel Inference at Scale (Vol II, Ch 9)
Precision engineering, operator fusion, profiling methodology SingleNodeModel Performance Engineering (Vol II, Ch 10)
Energy, carbon lifecycle, PUE, energy roofline SustainabilityModel Sustainable AI (Vol II, Ch 15)

Citing Sources

The hardware specifications in the Silicon Zoo are sourced from official manufacturer datasheets. See the source_url and last_verified metadata fields in mlsysim/hardware/registry.py for the specific document and verification date for each entry.

For the MLPerf comparison data on this page, see MLPerf Inference v4.0 Results (MLCommons, July 2024).


If you observe a significant discrepancy between MLSYSIM predictions and measured results on your hardware, please open an issue with the workload, hardware, and measured numbers. Discrepancies often reveal bugs or missing constants in the model.

Back to top