Model Accuracy & Validation

How well do MLSys·im predictions match measured hardware performance?

MLSys·im is a first-order analytical model — it predicts performance from analytical equations, not from empirical measurements. This page documents where those predictions are accurate, where they diverge, and why.

NoteWhat “first-order” means

A first-order analytical model captures the dominant system behavior without modeling second-order effects like cache hierarchy behavior, memory fragmentation, NIC DMA contention, or driver overhead. Expect predictions to be within 15–30% of measured throughput for well-optimized workloads on modern hardware. Use MLSys·im to reason about bottlenecks and compare configurations, not to produce production SLA estimates.


Validation Against Published Benchmarks

MLSys·im validates across both regimes of the Roofline model: a compute-bound anchor (ResNet-50 training, where batch size pushes arithmetic intensity above the ridge point) and a memory-bound anchor (LLM autoregressive decoding, where single-token generation streams the full weight matrix). Confirming accuracy in both regimes demonstrates that the analytical model correctly captures the dominant bottleneck — not just one side of the ridge point.

Workload Hardware Regime Predicted Measured Error Source
ResNet-50 (BS=1) A100 SXM4 Memory ~0.42 ms ~0.38 ms +11% MLPerf Inference v4.0
ResNet-50 (BS=64) A100 SXM4 Compute ~8.1 ms ~7.5 ms +8% MLPerf Inference v4.0
BERT-Large (BS=1) H100 SXM5 Memory ~2.1 ms ~1.9 ms +11% MLPerf Inference v4.0
Llama2-70B TTFT H100 SXM5 Compute ~45 ms (2K ctx) ~40–50 ms ±10% vLLM benchmarks
Llama2-70B ITL H100 SXM5 Memory ~4.2 ms/token ~5–8 ms/token −25% vLLM benchmarks
WarningITL underprediction is expected

MLSys·im’s ITL prediction does not include quantization kernel overhead, KV-cache paging latency (as in PagedAttention), or batch scheduling overhead in production serving systems like vLLM. Real ITL in production is 1.5–2× the roofline lower bound. Use ITL predictions as a best-case estimate of what efficient hardware can theoretically achieve.


Where MLSys·im is Most Accurate

Single-node roofline (compute vs. memory bound classification):

The model is most reliable for determining which resource limits performance. If the model predicts “Memory Bound,” the actual workload will almost always be memory-bound too — even if the exact latency numbers differ. This classification is typically >95% correct across documented workloads.

Scaling efficiency direction (distributed training):

The model correctly predicts how scaling efficiency changes as you vary DP/TP/PP configuration. The relative ranking of configurations is reliable, even if absolute MFU values are off by ±10%.

KV-cache sizing (LLM serving memory planning):

The formula \(\text{KV-Cache} = 2 \times L \times H_{kv} \times d_{head} \times S \times B \times \text{bpp}\) is exact — this is definitional, not approximated. Memory feasibility checks (feasible: True/False) are accurate because they compare against the same HBM capacity reported in datasheets.

Carbon and TCO estimates (order-of-magnitude):

Sustainability and economics predictions are accurate to within 20% for standard cloud deployments. The main source of error is the assumed PUE (power usage effectiveness), which varies significantly by datacenter operator and workload intensity.


Where MLSys·im Diverges From Measurement

Source of Error Typical Impact When It Matters
efficiency=0.5 default ±20% on latency Any roofline prediction
No cache hierarchy model 5–30% on small batches Batch size 1–4, small models
No NVLink contention model 5–15% on TP overhead Tensor parallel with TP > 4
No pipeline schedule optimization 10–20% on PP efficiency Interleaved 1F1B schedules
No quantization kernel overhead −30% on INT8 ITL Quantized serving
No memory fragmentation −10–20% on KV-cache capacity Long-context serving

The efficiency parameter

MLSys·im uses efficiency (η, default 0.5) as a single scalar representing hardware utilization. The table below provides framework-calibrated starting points:

Workload Framework Hardware Recommended η Source
Training (bf16) Megatron-LM A100/H100 0.45–0.55 Chowdhery et al. (2022), PaLM
Training (bf16) DeepSpeed ZeRO-3 A100/H100 0.35–0.45 Rajbhandari et al. (2020)
Training (bf16) PyTorch FSDP A100/H100 0.30–0.40 Zhao et al. (2023)
Inference (fp16) vLLM A100/H100 0.25–0.35 Kwon et al. (2023)
Inference (fp16) TensorRT-LLM A100/H100 0.35–0.45 NVIDIA (2024)
Inference (int8) Any quantized A100/H100 0.20–0.35 Dettmers et al. (2022)
Training (fp16) Any MI300X 0.30–0.45 MLPerf Training v4.0

When you use the default efficiency=0.5, you are modeling a well-optimized training job on NVIDIA hardware with Megatron-LM. For inference workloads, pass efficiency=0.30 as a conservative starting point. For AMD or other accelerators, reduce by 10–15% from the NVIDIA baseline until framework maturity catches up.


When to Use MLSys·im vs. When to Profile

The relevant question is not “How accurate is MLSys·im?” but “Does it identify the correct binding constraint?” A first-order model that correctly determines whether a system is memory-bound, compute-bound, or network-bound provides actionable architectural insight even when its absolute latency prediction is ±20%.

Decision Stage Use MLSys·im? Use Profiling? Why
Exploratory design — “Which GPU family?” Yes No Sweep 100+ configs in seconds
Parallelism strategy — “DP=32 or TP=8?” Yes No Relative ranking is reliable
Memory feasibility — “Does it fit?” Yes No KV-cache formula is exact
Bottleneck classification — “Memory or compute?” Yes No >95% correct classification
Carbon/TCO estimation — “Region comparison” Yes No Order-of-magnitude sufficient
SLA validation — “Will we meet P99?” Start here Then validate Use as lower bound, then profile
Kernel optimization — “Which op is slow?” No Yes Requires profiler trace
Production capacity — “Exact QPS at P99?” No Yes Second-order effects matter
Framework tuning — “vLLM vs. TensorRT?” No Yes Framework-specific behavior
Debugging regression — “Why did ITL jump?” No Yes Regressions are measurement problems
TipThe MLSys·im Workflow

Use MLSys·im to narrow the design space (from 1,000 configurations to 5), then use empirical profiling to validate the shortlist. This is the same methodology that chip architects use: analytical models first, then RTL simulation for the promising designs.


What MLSys·im Cannot Model

For transparency, here are the effects that MLSys·im deliberately omits in exchange for sub-second execution:

  • No microarchitectural effects — L1/L2 cache hierarchies, warp scheduling, register pressure are absorbed into the efficiency parameter (η)
  • No real network congestion — Uses the classical α-β formulation; does not model adaptive routing or multi-tenant contention
  • No OS/runtime overhead — Kernel launch latency, CUDA stream scheduling, Python GIL contention are absent
  • No dynamic behavior — Models steady-state throughput; thermal throttling, dynamic clock boosting, and memory fragmentation over time are outside scope
  • Heuristic accuracy models — Compression accuracy curves are conservative estimates, not architecture-specific empirical fits
  • No model loading or warmup — Cold start time for serverless inference and model swapping overhead in multi-model serving are not modeled
  • No GPU partitioning — MIG, MPS, and time-slicing are not represented; all estimates assume exclusive device access

These omissions are not oversights; they are deliberate design choices that trade second-order accuracy for three orders of magnitude improvement in evaluation speed.


Dimensional Correctness as a Validation Layer

Beyond numerical accuracy, MLSys·im enforces dimensional correctness via pint — a runtime invariant analogous to memory safety in Rust. Every quantity carries its physical units, and operations on incompatible units raise a DimensionalityError before producing a result.

This catches an entire class of silent errors that numerical validation cannot:

# These all raise DimensionalityError at runtime:
bandwidth + flops          # GB/s + FLOP/s: incompatible
memory_capacity / latency  # GB / ms: not a rate you meant
peak_flops.to("GB/s")      # FLOP/s cannot become GB/s

In practice, the most common back-of-envelope errors — confusing gigabytes with gigabits, mixing per-device and per-node bandwidth, forgetting to convert precision to bytes-per-parameter — are structurally impossible in MLSys·im. The ridge point calculation I* = Peak_FLOPs / Memory_BW is a worked example: if the numerator and denominator have the wrong units, pint catches the error immediately rather than silently producing a meaningless number.


Evaluation Speed

MLSys·im evaluates >1,000 hardware × model × parallelism configurations per second on a laptop CPU. A full 22-wall analysis of a single configuration completes in <0.3 seconds. This is three orders of magnitude faster than cycle-accurate simulators (gem5: hours per configuration) and eliminates the need for hardware access entirely.

The speed advantage enables workflows that are impossible with empirical profiling: sweeping every GPU in the registry against every model at multiple batch sizes to build a complete design-space map before provisioning a single instance.


Citing Sources

The hardware specifications in the Silicon Zoo are sourced from official manufacturer datasheets. See the source_url and last_verified metadata fields in mlsysim/hardware/registry.py for the specific document and verification date for each entry.

For the MLPerf comparison data in this page, see: MLPerf Inference v4.0 Results (MLCommons, July 2024).


If you observe a significant discrepancy between MLSys·im predictions and measured results on your hardware, please open an issue with the workload, hardware, and measured numbers. Discrepancies often reveal bugs or missing constants.

Back to top