The Literature Zoo

Cited Scalars for MFU, Chinchilla, Batch Size, and Communication

The Literature Zoo holds published anchors cited in the textbook appendices — MFU bands, Chinchilla ratios, critical-batch-size anchors, and communication overheads. Every entry is a Sourced scalar with structured Provenance (see Provenance).

ImportantNot hardware specs

Do not confuse Literature.Training.MfuHigh with a GPU datasheet field. Literature entries are cited teaching assumptions; silicon numbers live in Hardware. Operational run-overhead profiles live in Ops, and scenario scale-efficiency profiles live in Scenarios; provenance records the source, but the registry path records the category.

Training MFU bands

Entry Value Description
MFU Training (Upper Bound) 0.5 Upper bound MFU for excellent large-model training runs.
MFU Inference (Batch 1) 0.05 MFU for single-request inference, heavily memory-bandwidth-bound.
MFU Inference (Batched) 0.4 Illustrative MFU upper bound for large-batch inference.
MFU Training (Lower Bound) 0.3 Lower bound MFU for well-optimized large-model training.

Benchmark anchors

Entry Value Description
Llama 3 8B H100 ITL Lower Bound 3.0 Lower edge of the H100 Llama-family decode latency sanity envelope.
Llama 3 8B H100 ITL Upper Bound 10.0 Upper edge of the H100 Llama-family decode latency sanity envelope.
ResNet-50 A100 Training Throughput 3200.0 Single-accelerator ResNet-50/A100 throughput anchor for empirical sanity checks.
ResNet-50 H100 Training Throughput 5000.0 Single-accelerator ResNet-50/H100 throughput anchor for empirical sanity checks.

Critical batch size anchors

Entry Value Description
BERT critical batch size 256.0 Rounded critical batch-size anchor for BERT-scale training examples.
Default critical batch size 1024.0 Generic rounded critical batch-size anchor for first-pass training examples.
GPT-3 critical batch size 4096.0 Rounded critical batch-size anchor for GPT-3-scale training examples.

Chinchilla anchors

Entry Value Description
Training Compute Constant (C ≈ 6PD) 6.0 Training FLOPs multiplier (6PD): 2 forward + 4 backward FLOPs per parameter per token.
Decode Compute Constant (2P) 2.0 Autoregressive decode FLOPs multiplier (2P): 2 forward FLOPs per parameter per token.
Compute-Optimal Token Ratio 20.0 Optimal training tokens per parameter (D ≈ 20P).

Communication

Entry Value Description
AllReduce factor 2.0 Ring AllReduce communication multiplier (2×).

Python Access

import mlsysim

chinchilla_ratio = mlsysim.Literature.Chinchilla.TokensPerParam
mfu_high = mlsysim.Literature.Training.MfuHigh
ring_factor = mlsysim.Literature.Communication.RingAllreduceFactor

When calibrating η in accuracy.qmd, cross-check against these literature bands before picking solver efficiency kwargs.

Back to top