System Assumptions
Purpose
What assumptions sit underneath the “napkin math” throughout the book?
Every quantitative example in this book, from training-time estimates to energy-per-inference calculations, uses a specific set of assumed numbers: reference accelerator bandwidth, reference model scale, cloud electricity price, decimal GB definitions, and dozens more. This appendix collects those assumptions in one place so the book’s arithmetic can be audited, checked against chapter napkin math, or substituted with a different accelerator generation, regional power price, or local convention. In D·A·M terms, the assumptions keep data volume, algorithm work, and machine capacity measured on the same scale across chapters.
How to Use This Appendix
When a chapter says an H100 delivers a certain ridge point, or that training a 7B model needs a certain amount of memory, the underlying bandwidths, capacities, and model sizes are listed here. Find the relevant section (for example, NVIDIA H100 or Reference Model Specifications), read the Value and Unit columns, and plug them into your own estimate. The tables are grouped by topic: accelerators, reference models, energy per access, interconnect bandwidth, economics, production-scale anchors, and unit conventions. Assumptions come from vendor datasheet peaks, published studies or industry reports, illustrative market or grid statistics, and book conventions; section 1.1 lists the provenance in one place.
Napkin Math 1.1: Napkin math with these constants
Problem: Decide whether a workload is compute bound or memory bound.
Variables: For an H100 at FP16/BF16 peak, use 989 TFLOP/s peak compute and 3.35 TB/s memory bandwidth.
Math: Divide peak FLOP/s by memory bandwidth to get the roofline ridge point: 989 TFLOP/s / 3.35 TB/s \(\approx\) 295.2 FLOP/byte. A large general matrix multiply (GEMM) with \(n=\) 4,096 has intensity \(n/3 \approx\) 1,365.3 FLOP/byte.
Result: Operations above 295.2 FLOP/byte are compute bound; operations below it are memory bound. The GEMM example is compute bound, while a single-token autoregressive decode with intensity \(\approx 1\) FLOP/byte is deeply memory bound.
Systems insight: The same accelerator can be compute rich and memory constrained. Arithmetic intensity determines which resource the workload actually consumes.
Problem: Estimate the memory needed to train a 7B model.
Variables: Mixed-precision Adaptive Moment Estimation (Adam) stores BF16 weights, gradients, FP32 master weights, momentum, and variance. The optimizer state budget is 16 bytes per parameter.
Math: For 7B parameters, model state is 7 \(\times 10^9 \times\) 16 bytes = 112 GB.
Result: An H100 has 85.9 GB of HBM, so the model state alone exceeds a single accelerator before accounting for activations.
Systems insight: Optimizer state, not weights alone, sets the floor for training memory. Capacity planning that starts from parameter bytes underestimates the real requirement.
Problem: Estimate the electricity cost of a GPT-3-scale training run on the reference cluster.
Variables: Use 1,024 A100s, 400 W per accelerator, ~25 days wall-clock, and $0.12/kWh electricity.
Math: The run requires roughly \(3.14 \times 10^{23}\) FLOPs. The A100-equivalent energy estimate is ~25 days wall-clock \(\times\) 1,024 A100s \(\times\) 24 h/day \(\times\) 0.4 kW \(\times\) $0.12/kWh = ~$29,491.2.
Result: The electricity estimate is only a small fraction of total training cost, which is dominated by accelerator amortization. The original GPT-3 run used V100-era infrastructure; this is an A100-equivalent reference estimate, not a derivation of the run duration from peak FLOP/s alone.
Systems insight: Energy is measurable from power and time, but full cost accounting must also include capital utilization, networking, storage, staffing, and failed or repeated runs.
Accelerator Specifications
These are the accelerator assumptions used in the book’s roofline, training-memory, energy, and cost examples: peak throughput, memory bandwidth, memory capacity, and thermal design power (TDP) for each generation the chapters cite. Values are vendor datasheet peaks—ceilings for napkin math, not sustained utilization—drawn from NVIDIA product documentation and IEEE Micro architecture articles (NVIDIA Corporation 2017, 2020, 2018, 2024; Choquette et al. 2021; Choquette 2023), AMD MI300X documentation (AMD 2023), Google TPU publications (Jouppi et al. 2023; Jouppi et al. 2021), and Google Cloud’s current TPU v6e specification page (Google Cloud 2026). Tables progress from table 1 (Volta) and table 2 (Turing) through current and forward-looking generations.
NVIDIA V100
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (V100) | 125 | TFLOP/s |
| Peak FP32 throughput (V100) | 15.7 | TFLOP/s |
| HBM bandwidth (V100) | 900 | GB/s |
| HBM capacity (V100) | 32 | GiB |
| TDP (V100) | 300 | W |
NVIDIA T4
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (T4) | 65 | TFLOP/s |
| Peak INT8 throughput (T4) | 130 | TOPS |
| Memory bandwidth (T4) | 320 | GB/s |
| TDP (T4) | 70 | W |
NVIDIA A100
Table 3 lists the Ampere-generation specs that anchor most training examples in the book.
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (A100) | 312 | TFLOP/s |
| Peak FP32 throughput (A100) | 19.5 | TFLOP/s |
| Peak INT8 throughput (A100) | 624 | TOPS |
| Peak TF32 throughput (A100) | 156 | TFLOP/s |
| HBM bandwidth (A100) | 2039 | GB/s |
| HBM capacity (A100) | 80 | GiB |
| TDP (A100) | 400 | W |
NVIDIA H100
Table 4 adds Hopper-generation FP8 Tensor Cores and the Transformer Engine, driving “current generation” estimates.
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (H100) | 989 | TFLOP/s |
| Peak FP32 CUDA throughput (H100) | 67 | TFLOP/s |
| Peak FP8 tensor throughput (H100) | 1979 | TFLOP/s |
| Peak INT8 throughput (H100) | 1979 | TOPS |
| Peak TF32 throughput (H100) | 494 | TFLOP/s |
| HBM bandwidth (H100) | 3.35 | TB/s |
| HBM capacity (H100) | 80 | GiB |
| TDP (H100) | 700 | W |
NVIDIA B200
Table 5 provides Blackwell-generation specs used in forward-looking capacity-planning examples.
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (B200) | 2250 | TFLOP/s |
| Peak FP8 tensor throughput (B200) | 4500 | TFLOP/s |
| Peak INT4 throughput (B200) | 9000 | TOPS |
| HBM bandwidth (B200) | 8 | TB/s |
| HBM capacity (B200) | 192 | GiB |
| TDP (B200) | 1000 | W |
AMD Instinct MI300X
Table 6 provides specifications for the MI300X, often used as the primary alternative to NVIDIA’s H100 in large-scale inference and training clusters.
| Assumption | Value | Unit |
|---|---|---|
| Peak FP16 tensor throughput (MI300X) | 1307 | TFLOP/s |
| HBM bandwidth (MI300X) | 5.3 | TB/s |
| HBM capacity (MI300X) | 192 | GiB |
| TDP (MI300X) | 750 | W |
Google TPU v4
Table 7 provides the ASIC-based alternative used when comparing training economics across accelerator families.
| Assumption | Value | Unit |
|---|---|---|
| Peak BF16 throughput (TPU v4) | 275 | TFLOP/s |
| Memory bandwidth (TPU v4) | 1200 | GB/s |
| Peak BF16 throughput (TPU v6e) | 918 | TFLOP/s |
| Memory bandwidth (TPU v6e) | 1600 | GB/s |
CPU and mobile/edge processors
Table 8 grounds the edge and mobile ML examples, where the contrast with data center accelerator throughput illustrates why deployment target shapes every design decision.
| Assumption | Value | Unit |
|---|---|---|
| Peak FP32 throughput (reference CPU) | 1 | TFLOP/s |
| DRAM bandwidth (reference server) | 50 | GB/s |
| Peak INT8 throughput (iPhone 15 Pro NPU) | 35 | TOPS |
| Memory bandwidth (iPhone 15 Pro) | 100 | GB/s |
| TDP (mobile device, reference) | 5 | W |
| Power (edge object detector, reference) | 2 | W |
| Battery capacity (phone, reference) | 15 | Wh |
Model Specifications
These are the reference model assumptions behind training-cost, memory-footprint, and inference-workload examples: parameter counts, per-inference FLOP budgets, and published training-scale anchors. Published model sizes follow primary papers, model reports, and official model documentation (Devlin et al. 2019; Radford et al. 2019; Brown et al. 2020; Touvron et al. 2023; Dubey et al. 2024; He et al. 2016; Sandler et al. 2018; Ultralytics 2023). GPT-3 training FLOPs and duration follow (Brown et al. 2020). The GPT-4 parameter count and training GPU-days are public third-party MoE estimates (SemiAnalysis 2023) because the GPT-4 technical report does not disclose architecture size. When a chapter estimates GPT-3-scale training time or 7B optimizer state, it uses table 9.
| Assumption | Value | Unit |
|---|---|---|
| Inference FLOPs (BERT-Base) | 2.2e+10 | flop |
| Parameters (BERT-Base) | 1.1e+08 | param |
| Parameters (Llama 3 8B) | 8.03e+09 | param |
| Hidden dimension (GPT-2) | 1600 | - |
| Layers (GPT-2) | 48 | - |
| Parameters (GPT-2) | 1.5e+09 | param |
| Parameters (GPT-3) | 1.75e+11 | param |
| Reference training duration (GPT-3) | 25 | d |
| Reference training FLOPs (GPT-3) | 3.14e+23 | flop |
| Parameters (GPT-4, public MoE estimate) | 1.76e+12 | param |
| Reference training GPU-days (GPT-4) | 2.5e+06 | - |
| Inference FLOPs (ResNet-50) | 4.1e+09 | flop |
| Parameters (ResNet-50) | 2.56e+07 | param |
| Inference FLOPs (MobileNetV2) | 3e+08 | flop |
| Parameters (MobileNetV2) | 3.50487e+06 | param |
| Inference FLOPs (YOLOv8-Nano) | 8.7e+09 | flop |
Training Memory Conventions
Training-memory napkin math assumes the mixed-precision Adam storage model used in the napkin-math callout and several training chapters: BF16 weights, BF16 gradients, and FP32 master weights plus Adam first- and second-moment buffers (12 bytes per parameter). This is a book convention aligned with common mixed-precision Adam training layouts (Kingma and Ba 2014; Micikevicius et al. 2017; NVIDIA 2017), not a measured hardware constant. Table 10 lists per-component byte widths; multiplying bytes per parameter (mixed-precision Adam) by the parameter count gives optimizer-state footprint before activations.
| Assumption | Value | Unit |
|---|---|---|
| Weight/gradient width (BF16) | 2 | bytes |
| Master weight and optimizer state width (FP32) | 4 | bytes |
| Adam state per parameter (momentum + variance, FP32) | 8 | bytes |
| Bytes per parameter (mixed-precision Adam) | 12 | bytes |
Hardware and model assumptions fix what runs where; energy assumptions fix whether the design is thermally and economically viable at the operation and memory-access level.
Energy Constants
These energy assumptions (primarily from Horowitz’s 45 nm table (Horowitz 2014)) underpin the book’s efficiency and sustainability discussions. Table 11 lists the hierarchy from register through DRAM—why data reuse dominates kernel design.
| Assumption | Value | Unit |
|---|---|---|
| Register access energy | 0.01 | pJ |
| L1 SRAM access energy | 0.5 | pJ |
| L2 SRAM access energy | 2 | pJ |
| DRAM access energy (per access) | 640 | pJ |
| DRAM access energy (per byte) | 160 | pJ/byte |
| FP16 FLOP energy | 1.1 | pJ/FLOP |
| FP32 FLOP energy | 3.7 | pJ/FLOP |
| INT8 multiply-add energy | 0.2 | pJ/MAC |
| MobileNetV2 inference energy (reference) | 0.1 | mJ |
| 5G transfer energy per MB | 100 | mJ/MB |
Energy costs operate at the chip level, but real ML systems also move data across interconnects—between accelerators, across racks, and over wide-area networks. The next section lists the bandwidth assumptions used when chapters estimate communication overhead.
Interconnect and Network Bandwidth
These bandwidth assumptions apply when chapters reason about gradient synchronization, pipeline bubbles, checkpoint I/O, or cross–data center latency. NVLink and PCIe rates follow accelerator product documentation (NVIDIA Corporation 2017, 2020; Choquette 2023); the InfiniBand architecture specification anchors the protocol family (InfiniBand Trade Association 2000), while current high-speed product families and Ethernet roadmaps anchor modern link-rate examples (NVIDIA 2026; Ethernet Alliance 2025); the speed-of-light-in-fiber floor is a physics identity. Table 12 lists NVLink, InfiniBand, PCIe, NVMe, Ethernet, and WAN latency anchors.
| Assumption | Value | Unit |
|---|---|---|
| NVLink bandwidth (V100) | 300 | GB/s |
| NVLink bandwidth (A100) | 600 | GB/s |
| NVLink bandwidth (H100) | 900 | GB/s |
| InfiniBand HDR link bandwidth | 200 | Gb/s |
| InfiniBand NDR link bandwidth | 400 | Gb/s |
| InfiniBand XDR link bandwidth | 800 | Gb/s |
| InfiniBand GXDR link bandwidth | 1600 | Gb/s |
| PCIe Gen4 bandwidth (A100 host) | 32 | GB/s |
| PCIe Gen5 bandwidth (H100 host) | 64 | GB/s |
| NVMe sequential read bandwidth | 7 | GB/s |
| 10 GbE bandwidth | 10 | Gb/s |
| 100 GbE bandwidth | 100 | Gb/s |
| Speed of light in fiber | 200000 | km/s |
Economic Constants
These pricing assumptions underpin TCO and energy-cost napkin math in table 13. They are illustrative hyperscaler-order rates for ratio analysis (similar in spirit to carbon accounting examples in (Patterson et al. 2021)), not quotes for a specific region or contract. Substitute your own values when absolute price dominates.
| Assumption | Value | Unit |
|---|---|---|
| Cloud electricity price | 0.12 | dollar/kWh |
| Cloud egress price per GB | 0.09 | dollar/GB |
Economic constants set the price per unit of compute and data transfer, but they mean little without a sense of the volumes involved. Production ML systems handle millions to billions of requests per day—numbers large enough to be difficult to internalize without concrete reference points.
Scale References
These scale assumptions anchor “how big is big?” in capacity-planning examples (table 14): order-of-magnitude public disclosures (email/search volume, autonomous-driving sensor rates) plus standard 1080p and 4K video parameters. They are magnitude anchors, not audited statistics for a specific year.
| Assumption | Value | Unit |
|---|---|---|
| Gmail emails per day | 1.21e+11 | - |
| Google searches per day | 8.5e+09 | - |
| Waymo sensor data rate (low) | 1 | TB/h |
| Waymo sensor data rate (high) | 19 | TB/h |
| 1080p frame width | 1920 | - |
| 1080p frame height | 1080 | - |
| 4K frame width | 3840 | - |
| 4K frame height | 2160 | - |
| Bytes per RGB pixel | 3 | bytes |
| Video frame rate (standard) | 30 | Hz |
The assumptions above use the unit conventions in table 15—decimal data prefixes (KB \(= 10^3\) bytes), separate FLOPs (work) and FLOP/s (throughput), and the aliases below.
Unit Conventions
Table 15 fixes the unit conventions used in every quantitative example in this volume. Each row gives the multiplier \(k\) in 1 alias \(=\) \(k\) base. Data prefixes use decimal SI (\(\mathrm{KB} = 10^3\) bytes, not 1024). Binary IEC storage prefixes appear in a few storage-specific discussions but are omitted here because most fleet-scale estimates in the book use decimal KB/GB/TB. Throughput quantities (FLOP/s, GB/s) combine these aliases with time; adding incompatible dimensions (bytes to FLOP/s) is a category error in napkin math, not a unit conversion.
| Alias | Multiplier | Base unit |
|---|---|---|
byte |
1 | byte |
KB |
1000 | byte |
MB |
1e+06 | byte |
GB |
1e+09 | byte |
TB |
1e+12 | byte |
PB |
1e+15 | byte |
flop |
1 | flop |
GFLOPs |
1e+09 | flop |
TFLOPs |
1e+12 | flop |
ZFLOPs |
1e+21 | flop |
param |
1 | param |
Mparam |
1e+06 | param |
Gbps |
1e+09 | bit/s |
NS |
\(10^{-9}\) | second |
US |
\(10^{-6}\) | second |
MS |
\(10^{-3}\) | second |
second |
1 | second |
hour |
3600 | second |
day |
86400 | second |
joule |
1 | joule |
watt |
1 | watt |
meter |
1 | meter |
With all constants, units, and scale references in place, the next section catalogs the provenance of each assumption so readers can trace the numbers back to their primary sources.
Assumption Provenance
The catalog below summarizes where each section’s numbers come from. The book uses Quarto @citekey references in captions and this table; mlsysim stores structured Provenance on Sourced registry scalars and metadata for audits and labs—no BibTeX keys in the package. See table 16 for the section-to-reference map.
| Appendix section | Source type | Primary references |
|---|---|---|
| Accelerator specifications | Vendor datasheet peaks (2026-Q1) | (NVIDIA Corporation 2017, 2018, 2020, 2024; Choquette et al. 2021; Choquette 2023; AMD 2023; Jouppi et al. 2023; Jouppi et al. 2021; Google Cloud 2026) |
| Model specifications | Published papers, model reports, official docs; GPT-4 size from public analysis | (Devlin et al. 2019; Radford et al. 2019; Brown et al. 2020; Touvron et al. 2023; Dubey et al. 2024; He et al. 2016; Sandler et al. 2018; Ultralytics 2023; OpenAI et al. 2023; SemiAnalysis 2023) |
| Training memory conventions | Book convention (mixed-precision Adam layout) | (Kingma and Ba 2014; Micikevicius et al. 2017; NVIDIA 2017) |
| Energy constants | Published 45 nm energy table | (Horowitz 2014) |
| Interconnect bandwidth | Vendor specs; InfiniBand standard; physics | (NVIDIA Corporation 2017, 2020; Choquette 2023; InfiniBand Trade Association 2000; NVIDIA 2026; Ethernet Alliance 2025) |
| Economic assumptions | Illustrative cloud/utility rates | (Patterson et al. 2021) (methodology context) |
| Scale references | Order-of-magnitude public anchors | Editorial magnitude anchors |
| Unit conventions | Decimal SI; book notation | Editorial |
Summary
This appendix is the book’s shared assumption sheet: every napkin-math estimate in the volume should be traceable to a row in the tables above—and, where it matters, to section 1.1 and table 16.
Key Takeaways: Using the assumption tables
- One place for all shared numbers: Hardware peaks, model sizes, link bandwidths, energy per access, cloud prices, and scale anchors used in worked examples are listed here so you can audit or replace them without re-deriving from prose.
- Hardware rows are ceilings: Peak FLOP/s, HBM bandwidth, and TDP are datasheet maxima. Real training often reaches 30 to 50 percent of peak compute utilization; using peak values without discounting yields optimistic timelines.
- Ratios often outlast absolutes: Ridge point (FLOP/s ÷ bandwidth), mixed-precision Adam bytes per parameter (16), and the register-to-DRAM energy gap are more portable across generations than any single SKU’s peak TFLOP/s.
- Substitute your own assumptions: When your deployment differs—different GPU, region, or utilization—swap the Value column and rerun the same formulas the chapters use.
- Check provenance before debating a number: Datasheet peaks, published studies, illustrative rates, and book conventions are sourced differently; table 16 and section captions say which is which.