Accuracy & Validation

How Well Do MLSys·im Predictions Match Real Hardware?

MLSys·im is a first-order analytical model. It predicts performance from closed-form equations, not from cycle-accurate simulation or empirical measurement. This page documents where those predictions are accurate, where they diverge, and what drives the gap.

What “first-order” means

A first-order model captures the dominant system behavior without modeling second-order effects such as cache hierarchy dynamics, memory fragmentation, NIC DMA contention, or driver overhead. Trust MLSys·im for bottleneck classification and relative comparisons. Absolute latency is workload-dependent; well-calibrated cases are often within 15–30%, while production serving can be 1.5–2× slower than idealized roofline bounds. Use MLSys·im to reason about bottlenecks and compare configurations, not to produce production SLA estimates.

For the formal treatment of roofline modeling and arithmetic intensity, see the Hardware Acceleration slides (Vol I, Ch 11).

Design Philosophy

MLSys·im follows the same trade-off that Hennessy & Patterson’s MIPS simulator made for computer architecture: sacrifice cycle accuracy for taxonomic completeness and execution speed. A gem5 simulation of a single LLaMA-70B inference step can take hours; MLSys·im solves it in milliseconds. The goal is not to replace empirical benchmarks but to enable rapid, principled reasoning about system design trade-offs before committing to expensive experiments.

This philosophy maps directly to the core analytical models in the Math Foundations page. Each model implements a specific approximation grounded in published research, with a characterized accuracy envelope documented below.

Validation Hardening Checklist

The implementation now enforces strict precision names, explicit unit-family validation, and exact distributed topology divisibility before solver math runs. The remaining validation work is about evidence quality rather than basic input safety:

Golden textbook examples. Preserve representative chapter calculations as small, named regression tests with expected values and tolerances. This catches drift when formulas or registry specs change.
Formula-by-formula solver review. For each solver, document the source equation, assumptions, units, and calibration constants next to a focused test fixture. Start with DistributedModel, ServingModel, TrainingMemoryModel, EconomicsModel, and SustainabilityModel.
Registry provenance audit. Check every hardware, model, system, and infrastructure entry for source URL, SKU/variant specificity, verified date, and whether the value is measured, estimated, illustrative, or conventional.
Empirical calibration refresh. Re-run the benchmark comparisons in empirical-calibration.md against current registry values and current published benchmark reports. Published benchmark scalar values live in Literature.Benchmarks; the code contract for reviewed sanity ranges lives in mlsysim.engine.empirical.
Property and edge-case tests. Add invariants such as monotonic memory growth with sequence length, non-increasing feasible batch size with model size, non-negative latency components, and stable unit conversions across SI/binary byte units.
CLI plan semantics cleanup. hardware.accelerators is the total accelerator count. Use node_count plus accelerators_per_node when the physical topology matters.
Solver module decomposition. Solver implementations now live in engine/solvers/ by domain, while mlsysim.solvers is the stable public import path; the export list derives mechanically from engine/solvers/__init__.py, so the two cannot drift.

The Accuracy Hierarchy

Not all MLSys·im predictions are created equal. The table below ranks prediction types from most to least accurate, so you know how much to trust each output.

Prediction Type	Typical Accuracy	Why
KV-cache sizing	Exact	Definitional formula, not approximated
Checkpoint sizing	Exact	Direct calculation: \(N \times \text{bytes\_per\_param}\)
Training memory components	Exact for state tensors; approximate for activations	Weights, gradients, and optimizer states are direct byte counts; activation memory follows checkpointing coefficients
Bottleneck classification (compute vs. memory)	>95% correct	The roofline ridge point is a structural property of the hardware
Relative configuration ranking (which config is faster?)	>90% correct	Errors cancel when comparing two configurations on the same hardware
Scaling efficiency direction (how does MFU change with DP/TP/PP?)	±10% on MFU	Communication models are bandwidth-optimal lower bounds
Single-node throughput (absolute latency)	±15–30%	Sensitive to the efficiency parameter η
Serving replica count	Directionally useful	Composes roofline, batching capacity, and M/M/c queueing; validate final SLAs empirically
TCO and carbon estimates	±20%	Dominated by PUE and grid carbon intensity assumptions
ITL in production serving	−25% to −50%	Missing KV-cache paging, batch scheduling, and quantization kernel overhead

The golden rule

Trust MLSys·im for direction and classification. Be cautious with absolute numbers. The model tells you which resource is the bottleneck and which configuration is better. It does not promise the exact millisecond.

Validation Against Published Benchmarks

The table below compares MLSys·im roofline predictions against publicly reported results from MLPerf Inference v4.0 (July 2024) and community benchmarks.

Workload	Hardware	Predicted	Measured	Error	Source
ResNet-50 (BS=1)	A100 SXM4	~0.42 ms	~0.38 ms	+11%	MLPerf Inference v4.0
ResNet-50 (BS=64)	A100 SXM4	~8.1 ms	~7.5 ms	+8%	MLPerf Inference v4.0
BERT-Large (BS=1)	H100 SXM5	~2.1 ms	~1.9 ms	+11%	MLPerf Inference v4.0
Llama2-70B TTFT	H100 SXM5	~45 ms (2K ctx)	~40–50 ms	±10%	vLLM benchmarks
Llama2-70B ITL	H100 SXM5	~4.2 ms/token	~5–8 ms/token	−25%	vLLM benchmarks

ITL underprediction is expected

MLSys·im predicts the roofline lower bound for inter-token latency. It does not model quantization kernel overhead, KV-cache paging latency (PagedAttention), or batch scheduling overhead in production serving systems. Real ITL is typically 1.5–2× the roofline bound. Use ITL predictions as the theoretical floor, not as a production estimate.

For the full treatment of serving-system overheads (continuous batching, PagedAttention, SLO compliance), see the Inference at Scale slides (Vol II, Ch 10).

Interpreting the error column

All errors are relative to the measured value: \(\text{Error} = (\text{Predicted} - \text{Measured}) / \text{Measured}\). A positive error means MLSys·im overpredicts latency (conservative). A negative error means MLSys·im underpredicts (optimistic). The sign matters: overprediction is safer for capacity planning; underprediction is dangerous for SLA commitments.

Where MLSys·im is Most Accurate

Bottleneck classification (roofline)

The model is most reliable for determining which resource limits performance. If MLSys·im predicts a memory bottleneck, the actual workload will almost always be memory-bound too, even if the exact latency differs. This classification is typically >95% correct across documented workloads because the roofline ridge point is a structural property of the hardware that does not depend on software efficiency.

The underlying roofline model is documented in the Hardware Acceleration slides (Vol I, Ch 11) and applied to real H100 workloads in the Compute Infrastructure slides (Vol II, Ch 2).

Scaling efficiency direction (distributed training)

The model correctly predicts how scaling efficiency changes as you vary DP/TP/PP configuration. The relative ranking of configurations is reliable even when absolute MFU values are off by ±10%. This is because the communication cost models (Ring AllReduce, Tree AllReduce, Hierarchical AllReduce) implement bandwidth-optimal lower bounds from Patarasuk & Mueller (2009), so the relative costs scale correctly.

See the Distributed Training slides (Vol II, Ch 5) for the 3D parallelism framework and scaling efficiency analysis.

KV-cache and checkpoint sizing

The KV-cache formula (\(\text{KV} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times S \times B \times \text{bpe}\)) is definitional, not approximated. It computes the exact number of bytes the K and V tensors occupy. Similarly, checkpoint sizing (\(N \times \text{bytes\_per\_param}\)) is a direct count. Memory feasibility checks (feasible: True/False) are accurate because they compare against the same HBM capacity reported in manufacturer datasheets.

Training memory breakdown

TrainingMemoryModel separates weights, gradients, optimizer state, activations, and communication buffers. The first three terms are direct byte counts from the model size, precision, optimizer, and ZeRO stage. Activations are the approximate term: they depend on the selected checkpointing strategy and use coefficients from transformer activation-recomputation analysis [1]. Treat the model as a memory feasibility and attribution tool: it tells you which term is responsible for pressure, then empirical framework profiling can refine the activation term for a specific implementation.

Serving capacity planning

ServingCapacityModel is a composition model. It uses ServingModel for TTFT/ITL, ContinuousBatchingModel for per-replica token capacity, and TailLatencyModel for queueing pressure. This is useful for first-pass sizing and sensitivity sweeps: if QPS doubles, or generated length doubles, you can see how many replicas are implied. Do not use the result as a production SLA guarantee. Production serving must still be validated with the target scheduler, traffic mix, batching policy, and percentile measurement.

MoE routing imbalance

MoERoutingModel captures a single effect: hot experts make a top-k MoE behave as though more experts are active, increasing effective active parameters and expert-parallel all-to-all traffic. The routing_imbalance_factor is intentionally exposed. Use 1.0 for balanced routing, then sweep higher values to understand sensitivity. A real router’s load distribution should be measured from token counts per expert and fed back into this parameter.

Carbon and TCO estimates

Sustainability and economics estimates are usually dominated by a few explicit assumptions: PUE, energy price, regional carbon intensity, utilization, and amortization period. The main source of error is often the assumed PUE (power usage effectiveness), which varies from ~1.1 (hyperscaler) to ~1.6 (enterprise). The carbon accounting methodology aligns with the lifecycle framework in the Sustainable AI slides (Vol II, Ch 15).

Where MLSys·im Diverges From Measurement

Source of Error	Typical Impact	When It Matters
`efficiency=0.5` default	±20% on latency	Any roofline prediction
No cache hierarchy model	5–30% on small batches	Batch size 1–4, small models
No NVLink contention model	5–15% on TP overhead	Tensor parallel with TP > 4
No pipeline schedule optimization	10–20% on PP efficiency	Interleaved 1F1B schedules
No quantization kernel overhead	−30% on INT8 ITL	Quantized serving
No memory fragmentation	−10–20% on KV-cache capacity	Long-context serving

The efficiency parameter (η)

MLSys·im uses efficiency (η, default 0.5) as a single scalar representing hardware utilization. This is the largest source of absolute error. The table below provides calibrated ranges from published benchmarks.

Workload Type	Recommended η	Reference
Training (fp16/bf16, Megatron-LM)	0.35–0.55	Shoeybi et al. (2019)
Inference (fp16, vLLM/TRT-LLM)	0.25–0.45	MLPerf Inference v4.0
Inference (int8, quantized)	0.20–0.40	Community benchmarks
Edge inference (TFLite, ONNX RT)	0.15–0.30	MLPerf Tiny

When you use the default efficiency=0.5, you are modeling a well-optimized training job on datacenter hardware. For inference, pass efficiency=0.35 for more conservative estimates. For edge devices, use efficiency=0.2. All solvers whose compute term depends on η expose it as a parameter; defaults are starting points from the literature and benchmarks, not hidden constants.

Cross-check η against cited bands in the Literature Zoo (Literature.Training.MfuHigh, Literature.Training.MfuInferenceBatched) and engine knobs in mlsysim.engine.calibration.REFERENCE_MFU_SUSTAINED.

For a rigorous treatment of precision engineering and its impact on performance, see the Performance Engineering slides (Vol II, Ch 9).

What is not modeled

MLSys·im deliberately omits the following effects. Each omission is a design decision, not an oversight:

Cache hierarchy behavior. L1/L2/SRAM tiling effects can swing latency by 5–30% for small batch sizes. Modeling them would require operator-level simulation, which conflicts with the goal of millisecond solve times.
NVLink contention under tensor parallelism. For TP ≤ 4 (within a single NVSwitch domain), contention is negligible. For TP > 4 (cross-switch), contention can add 5–15% overhead that MLSys·im does not capture.
Pipeline schedule variants. MLSys·im implements the standard GPipe bubble formula: \(\text{bubble} = (P - 1) / (V \cdot M + P - 1)\). Advanced schedules (zero-bubble, interleaved 1F1B with \(V > 1\)) can reduce bubbles by 10–20% beyond what the formula predicts.
Quantization kernel efficiency. INT8/INT4 kernels achieve lower utilization than FP16 Tensor Core kernels due to dequantization overhead and less mature compiler support. MLSys·im treats precision as a pure ops-per-byte ratio.
Memory fragmentation and PagedAttention overhead. The KV-cache formula gives the dense allocation size. In practice, PagedAttention introduces fragmentation (typically 5–10%) and paging latency.

Comparison to Related Tools

Tool	Type	Accuracy	Solve Time	Best For
MLSys·im	First-order analytical	±15–30%	Milliseconds	Bottleneck analysis, HW comparison, education
MLPerf	Empirical benchmark	Ground truth	Days–weeks	Published industry comparisons
vLLM benchmark_serving.py	Empirical profiling	Exact (that config)	Hours	Production serving tuning
PyTorch Profiler	Empirical profiling	Exact (that run)	Minutes	Kernel-level optimization
Megatron estimator	Heuristic model	±5–10%	Seconds	Megatron-specific training configs
gem5	Cycle-accurate simulation	±1–5%	Days	Hardware research (100–1000× slower)

MLSys·im occupies a unique position: it is the only tool that covers the full stack (single-node roofline through fleet-scale TCO) in a single, composable framework. Use it when you want to compare options before running experiments: “Will the H100 or MI300X be better for this serving workload?” or “Does PP=4 or PP=8 give better scaling efficiency?” For production SLAs, validate with empirical benchmarks.

For the full benchmarking methodology taxonomy (system, model, and data benchmarks), statistical rigor requirements (percentiles, confidence intervals), and MLPerf submission anatomy, see the Benchmarking slides (Vol I, Ch 12).

When to Use MLSys·im (and When Not To)

Comparing hardware for a known workload (“H100 vs. MI300X for Llama-3 70B?”)
Choosing parallelism strategy (“DP=8 or TP=4×PP=2?”)
Estimating memory feasibility (“Does Llama-3 405B fit in 8×H100 HBM at FP16?”)
Budgeting TCO and carbon before procurement decisions
Teaching the Iron Law, roofline model, and systems reasoning
Rapid prototyping system designs before committing to cluster time

Production SLA guarantees (use empirical benchmarks instead)
Kernel-level optimization (use PyTorch Profiler or NSight Compute)
Quantized model accuracy (MLSys·im models throughput, not model quality)
Exact latency targets where ±30% error is unacceptable
Novel hardware with no published specs in the Silicon Zoo

Slide Deck Reference

The theory behind MLSys·im’s analytical models is covered across several lecture decks. Use this table to find the right slides for each validation domain.

Validation Domain	MLSys·im Solver	Companion Slides
Roofline model, arithmetic intensity, bottleneck classification	SingleNodeModel	Hardware Acceleration (Vol I, Ch 11)
Statistical rigor, MLPerf methodology, benchmark anti-patterns	All solvers	Benchmarking (Vol I, Ch 12)
H100 roofline landscape, GPU specs, TCO analysis	SingleNodeModel, EconomicsModel	Compute Infrastructure (Vol II, Ch 2)
3D parallelism, scaling efficiency, communication overhead	DistributedModel	Distributed Training (Vol II, Ch 5)
Serving latency, KV-cache, continuous batching, SLO compliance	ServingModel	Inference at Scale (Vol II, Ch 10)
Precision engineering, operator fusion, profiling methodology	SingleNodeModel	Performance Engineering (Vol II, Ch 9)
Energy, carbon lifecycle, PUE, energy roofline	SustainabilityModel	Sustainable AI (Vol II, Ch 15)

Citing Sources

The hardware specifications in the Silicon Zoo are sourced from official manufacturer datasheets. See metadata.provenance (and last_verified when set) on each entry in mlsysim/hardware/registry.py for the document and verification date.

For the MLPerf comparison data on this page, see MLPerf Inference v4.0 Results (MLCommons, July 2024).

If you observe a significant discrepancy between MLSys·im predictions and measured results on your hardware, please open an issue with the workload, hardware, and measured numbers. Discrepancies often reveal bugs or missing constants in the model.

References

[1]

V. A. Korthikanti et al., “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023, Available: https://proceedings.mlsys.org/paper_files/paper/2023/hash/80083951326cf5b35e5100260d64ed81-Abstract-mlsys2023.html