The Literature Zoo

Cited Scalars for MFU, Chinchilla, Batch Size, and Communication

The Literature Zoo holds published anchors cited in the textbook appendices — MFU bands, Chinchilla ratios, critical-batch-size anchors, and communication overheads. Every entry is a Sourced scalar with structured Provenance (see Provenance).

Not hardware specs

Do not confuse Literature.Training.MfuHigh with a GPU datasheet field. Literature entries are cited teaching assumptions; silicon numbers live in Hardware. Operational run-overhead profiles live in Ops, and scenario scale-efficiency profiles live in Scenarios; provenance records the source, but the registry path records the category.

Training MFU bands

Entry	Value	Description
MFU Training (Upper Bound)	0.5	Upper bound MFU for excellent large-model training runs.
MFU Inference (Batch 1)	0.05	MFU for single-request inference, heavily memory-bandwidth-bound.
MFU Inference (Batched)	0.4	Illustrative MFU upper bound for large-batch inference.
MFU Training (Lower Bound)	0.3	Lower bound MFU for well-optimized large-model training.

Benchmark anchors

Entry	Value	Description
Llama 3 8B H100 ITL Lower Bound	3.0	Lower edge of the H100 Llama-family decode latency sanity envelope.
Llama 3 8B H100 ITL Upper Bound	10.0	Upper edge of the H100 Llama-family decode latency sanity envelope.
ResNet-50 A100 Training Throughput	3200.0	Single-accelerator ResNet-50/A100 throughput anchor for empirical sanity checks.
ResNet-50 H100 Training Throughput	5000.0	Single-accelerator ResNet-50/H100 throughput anchor for empirical sanity checks.

Critical batch size anchors

Entry	Value	Description
BERT critical batch size	256.0	Rounded critical batch-size anchor for BERT-scale training examples.
Default critical batch size	1024.0	Generic rounded critical batch-size anchor for first-pass training examples.
GPT-3 critical batch size	4096.0	Rounded critical batch-size anchor for GPT-3-scale training examples.

Chinchilla anchors

Entry	Value	Description
Training Compute Constant (C ≈ 6PD)	6.0	Training FLOPs multiplier (6PD): 2 forward + 4 backward FLOPs per parameter per token.
Decode Compute Constant (2P)	2.0	Autoregressive decode FLOPs multiplier (2P): 2 forward FLOPs per parameter per token.
Compute-Optimal Token Ratio	20.0	Optimal training tokens per parameter (D ≈ 20P).

Communication

Entry	Value	Description
AllReduce factor	2.0	Ring AllReduce communication multiplier (2×).

Python Access

import mlsysim

chinchilla_ratio = mlsysim.Literature.Chinchilla.TokensPerParam
mfu_high = mlsysim.Literature.Training.MfuHigh
ring_factor = mlsysim.Literature.Communication.RingAllreduceFactor

When calibrating η in accuracy.qmd, cross-check against these literature bands before picking solver efficiency kwargs.