System Assumptions
Purpose
What assumptions sit underneath the “napkin math” throughout the book?
Every quantitative example in this book—from training-time estimates to energy-per-inference calculations—rests on a shared set of physical constants, hardware specifications, and economic assumptions. Rather than scatter these values across chapters (where they would inevitably diverge), we define them once in the book’s source code (mlsysim/core/constants.py) and import them wherever a calculation needs them. This appendix exposes the main constants used in the volume’s worked examples so the numbers that underpin the book’s reasoning are open to audit, verification, and update.
How to Use This Appendix
This appendix serves as a reference for verifying a napkin-math calculation, swapping in an alternative assumption (for example, a different memory bandwidth or electricity price), or tracing which constants a particular estimate depends on. Each constant name matches its Python identifier in mlsysim/core/constants.py, so searching for any name in the book’s source surfaces every chapter that uses it.
The reference tables are organized into thematic groups—accelerator specifications, model parameters, energy constants, interconnect bandwidths, and so on—so related assumptions surface quickly without scanning an alphabetical list. Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).
Because hardware generations change faster than textbooks, these tables also serve as a change log of sorts: updating a single value in mlsysim/core/constants.py propagates the change to every calculation in every chapter automatically. A few worked estimates show how these constants combine for quick back-of-the-envelope calculations.
Napkin Math 1.1: Napkin math with these constants
Is this workload compute bound or memory bound? Divide peak FLOPS by memory bandwidth to get the ridge point (The roofline model). For an H100: 989 TFLOPS / 3.35 TB/s \(\approx\) 295 FLOP/byte. Any operation with arithmetic intensity above 295 FLOP/byte is compute bound; below it, memory bound. A large general matrix multiply (GEMM) with \(n=\) 4,096 has intensity \(n/3 \approx\) 1,365 FLOP/byte (compute bound). A single-token autoregressive decode has intensity \(\approx 1\) FLOP/byte (deeply memory bound).
How much memory does training a 7B model require? Mixed-precision Adaptive Moment Estimation (Adam) stores 2 bytes (BF16 weights) + 2 bytes (gradients) + 12 bytes (FP32 master weights + momentum + variance) = 16 bytes per parameter. For 7B parameters: 7 \(\times 10^9 \times\) 16 bytes = 112 GB. An H100 has 80 GB of HBM, so the model state alone exceeds a single accelerator—before accounting for activations.
How much energy would a GPT-3-scale training run cost on the reference cluster? An A100 draws 400 W at TDP. Using 1,024 A100s as a normalization for a run requiring roughly \(3.14 \times 10^{23}\) FLOPs and the imported reference duration of ~25 wall-clock days gives an A100-equivalent energy estimate. At $0.12/kWh: ~25 wall-clock days \(\times\) 1,024 A100s \(\times\) 24 h/day \(\times\) 0.4 kW \(\times\) $0.12/kWh \(\approx\) ~USD 29,491 in electricity alone—a small fraction of the total cost, which is dominated by accelerator amortization. The original GPT-3 run used V100-era infrastructure; this calculation is an A100-equivalent reference estimate, not a derivation of the run duration from peak FLOPS alone.
Accelerator Specifications
The tables in this section capture the peak compute throughput, memory bandwidth, memory capacity, and thermal design power (TDP) for each accelerator generation referenced in the book. These are datasheet numbers—actual workloads rarely sustain peak rates—but they set the ceiling that roofline analysis and utilization calculations measure against. The accelerator tables progress chronologically from Table 1 (Volta) and Table 2 (Turing) through more recent generations.
NVIDIA V100
| Constant | Value | Unit |
|---|---|---|
V100_FLOPS_FP16_TENSOR |
125 | TFLOPs/s |
V100_FLOPS_FP32 |
15.7 | TFLOPs/s |
V100_MEM_BW |
900 | GB/s |
V100_MEM_CAPACITY |
32 | GiB |
V100_TDP |
300 | W |
NVIDIA T4
| Constant | Value | Unit |
|---|---|---|
T4_FLOPS_FP16_TENSOR |
65 | TFLOPs/s |
T4_FLOPS_INT8 |
130 | TFLOPs/s |
T4_MEM_BW |
320 | GB/s |
T4_TDP |
70 | W |
NVIDIA A100
Table 3 lists the Ampere-generation specs that anchor most training examples in the book.
| Constant | Value | Unit |
|---|---|---|
A100_FLOPS_FP16_TENSOR |
312 | TFLOPs/s |
A100_FLOPS_FP32 |
19.5 | TFLOPs/s |
A100_FLOPS_INT8 |
624 | TFLOPs/s |
A100_FLOPS_TF32 |
156 | TFLOPs/s |
A100_MEM_BW |
2039 | GB/s |
A100_MEM_CAPACITY |
80 | GiB |
A100_TDP |
400 | W |
NVIDIA H100
Table 4 adds Hopper-generation FP8 Tensor Cores and the Transformer Engine, driving “current generation” estimates.
| Constant | Value | Unit |
|---|---|---|
H100_FLOPS_FP16_TENSOR |
989 | TFLOPs/s |
H100_FLOPS_FP32_CUDA |
67 | TFLOPs/s |
H100_FLOPS_FP8_TENSOR |
1979 | TFLOPs/s |
H100_FLOPS_INT8 |
1979 | TFLOPs/s |
H100_FLOPS_TF32 |
494 | TFLOPs/s |
H100_MEM_BW |
3.35 | TB/s |
H100_MEM_CAPACITY |
80 | GiB |
H100_TDP |
700 | W |
NVIDIA B200
Table 5 provides Blackwell-generation specs used in forward-looking capacity-planning examples.
| Constant | Value | Unit |
|---|---|---|
B200_FLOPS_FP16_TENSOR |
2250 | TFLOPs/s |
B200_FLOPS_FP8_TENSOR |
4500 | TFLOPs/s |
B200_FLOPS_INT4 |
9000 | TFLOPs/s |
B200_MEM_BW |
8 | TB/s |
B200_MEM_CAPACITY |
192 | GiB |
B200_TDP |
1000 | W |
AMD Instinct MI300X
Table 6 provides specifications for the MI300X, often used as the primary alternative to NVIDIA’s H100 in large-scale inference and training clusters.
| Constant | Value | Unit |
|---|---|---|
MI300X_FLOPS_FP16_TENSOR |
1307 | TFLOPs/s |
MI300X_MEM_BW |
5.3 | TB/s |
MI300X_MEM_CAPACITY |
192 | GiB |
MI300X_TDP |
750 | W |
Google TPU v4
Table 7 provides the ASIC-based alternative used when comparing training economics across accelerator families.
| Constant | Value | Unit |
|---|---|---|
TPUV4_FLOPS_BF16 |
275 | TFLOPs/s |
TPUV4_MEM_BW |
1200 | GB/s |
TPUV6_FLOPS_BF16 |
918 | TFLOPs/s |
TPUV6_MEM_BW |
1600 | GB/s |
CPU and mobile/edge processors
Table 8 grounds the edge and mobile ML examples, where the contrast with data center accelerator throughput illustrates why deployment target shapes every design decision.
| Constant | Value | Unit |
|---|---|---|
CPU_FLOPS_FP32 |
1 | TFLOPs/s |
SYSTEM_MEMORY_BW |
50 | GB/s |
MOBILE_NPU_TOPS_INT8 |
50 | TFLOPs/s |
MOBILE_NPU_MEM_BW |
100 | GB/s |
MOBILE_TDP_W |
3 | W |
OBJECT_DETECTOR_POWER_W |
2 | W |
PHONE_BATTERY_WH |
15 | h·W |
Model Specifications
With hardware specifications established, the next question is what workloads run on this hardware. These constants define the parameter counts, per-inference FLOP budgets, and, where applicable, training costs for the reference models used throughout the book. Table 9 captures the model architectures whose values appear in later calculations; when a chapter estimates “how long to train GPT-3,” these are the numbers it plugs in.
| Constant | Value | Unit |
|---|---|---|
BERT_BASE_FLOPs |
2.2e+10 | flop |
BERT_BASE_PARAMS |
1.1e+08 | param |
LLAMA3_8B_PARAMS |
8.03e+09 | param |
GPT2_HIDDEN_DIM |
1600 | - |
GPT2_LAYERS |
48 | - |
GPT2_PARAMS |
1.5e+09 | param |
GPT3_PARAMS |
1.75e+11 | param |
GPT3_TRAINING_DAYS_REF |
25 | d |
GPT3_TRAINING_OPS |
3.14e+23 | flop |
GPT4_TRAINING_GPU_DAYS |
2.5e+06 | - |
RESNET50_FLOPs |
4.1e+09 | flop |
RESNET50_PARAMS |
2.56e+07 | param |
MOBILENETV2_FLOPs |
3e+08 | flop |
MOBILENETV2_PARAMS |
3.5e+06 | param |
YOLOV8_NANO_FLOPs |
8.7e+09 | flop |
Knowing what hardware exists and what models run on it is necessary but not sufficient. The energy cost of computation—measured at the level of individual operations and memory accesses—determines whether a design is thermally viable and economically sustainable.
Energy Constants
Energy-per-operation and energy-per-access constants are illustrative reference values, primarily from Horowitz’s 45 nm energy table (Horowitz 2014), and are used in the book’s energy-efficiency and sustainability analyses. Table 10 quantifies this hierarchy from register access through DRAM, illustrating why the memory wall is fundamentally an energy wall.
| Constant | Value | Unit |
|---|---|---|
ENERGY_REG_PJ |
0.01 | pJ |
ENERGY_SRAM_L1_PJ |
0.5 | pJ |
ENERGY_SRAM_L2_PJ |
2 | pJ |
ENERGY_DRAM_ACCESS_PJ |
640 | pJ |
ENERGY_DRAM_PJ_PER_BYTE |
160 | pJ/B |
ENERGY_FLOP_PJ |
4.6 | pJ/flop |
ENERGY_FLOP_FP16_PJ |
1.1 | pJ/flop |
ENERGY_FLOP_FP32_PJ |
3.7 | pJ/flop |
ENERGY_FLOP_INT8_PJ |
0.2 | pJ/flop |
ENERGY_MOBILENET_INF_MJ |
0.1 | mJ |
NETWORK_5G_ENERGY_PER_MB_MJ |
100 | mJ/MB |
Energy costs operate at the chip level, but real ML systems also move data across interconnects—between accelerators, across racks, and over wide-area networks. The next table captures the bandwidth assumptions that determine communication overhead in distributed training and serving.
Interconnect and Network Bandwidth
Distributed training and multi-accelerator serving are bottlenecked by interconnect bandwidth as often as by compute. Table 11 captures the bandwidths for accelerator-to-accelerator links (NVLink), cross-node fabrics (InfiniBand), host buses (PCIe), Non-Volatile Memory Express (NVMe) storage, data center Ethernet, and the speed-of-light floor that sets the minimum latency for any network hop.
The bandwidth hierarchy—from NVLink within a node down to 10 GbE across a data center—shapes every distributed system design decision and provides the communication baseline for the cost assumptions that follow.
| Constant | Value | Unit |
|---|---|---|
NVLINK_V100_BW |
300 | GB/s |
NVLINK_A100_BW |
600 | GB/s |
NVLINK_H100_BW |
900 | GB/s |
INFINIBAND_HDR_BW |
200 | Gbps |
INFINIBAND_NDR_BW |
400 | Gbps |
INFINIBAND_XDR_BW |
800 | Gbps |
INFINIBAND_GXDR_BW |
1600 | Gbps |
PCIE_GEN4_BW |
32 | GB/s |
PCIE_GEN5_BW |
64 | GB/s |
NVME_SEQUENTIAL_BW |
7 | GB/s |
NETWORK_10G_BW |
10 | Gbps |
NETWORK_100G_BW |
100 | Gbps |
SPEED_OF_LIGHT_FIBER_KM_S |
200000 | km/s |
Economic Constants
Cost estimates throughout the book depend on the electricity and cloud pricing assumptions in Table 12. These are order-of-magnitude reference values; actual prices vary by region, provider, and contract terms, but the ratios between them are more stable than the absolute numbers.
| Constant | Value | Unit |
|---|---|---|
CLOUD_ELECTRICITY_PER_KWH |
0.12 | dollar/kWh |
CLOUD_EGRESS_PER_GB |
0.09 | dollar/GB |
Economic constants set the price per unit of compute and data transfer, but they mean little without a sense of the volumes involved. Production ML systems handle millions to billions of requests per day—numbers large enough to be difficult to internalize without concrete reference points.
Scale References
When reasoning about production ML systems, it helps to have concrete scale anchors for “how much traffic does a large service actually handle?” Table 13 provides order-of-magnitude reference points drawn from public disclosures.
| Constant | Value | Unit |
|---|---|---|
GMAIL_EMAILS_PER_DAY |
1.21e+11 | - |
GOOGLE_SEARCHES_PER_DAY |
8.5e+09 | - |
WAYMO_DATA_PER_HOUR_LOW |
1 | TB/h |
WAYMO_DATA_PER_HOUR_HIGH |
19 | TB/h |
VIDEO_1080P_WIDTH |
1920 | - |
VIDEO_1080P_HEIGHT |
1080 | - |
VIDEO_BYTES_PER_PIXEL_RGB |
3 | B |
VIDEO_FPS_STANDARD |
30 | Hz |
All the preceding constants—hardware specs, model parameters, energy costs, economic rates, and scale references—are expressed in specific units. For these values to combine correctly in calculations, every quantity must carry its units explicitly.
Unit Definitions
The constants file defines the base and derived units listed in Table 14 so that all dimensional analysis in the book uses a consistent unit system via the pint library. These are not assumptions per se, but they ensure that every computed value carries its units and that unit-conversion errors are caught automatically.
pint dimensional-analysis library throughout the book. Every computed value carries its units, so mixing incompatible quantities (for example, adding bytes to FLOP/s) raises an immediate error rather than producing a silent wrong answer.
| Constant | Value | Unit |
|---|---|---|
byte |
byte | - |
KB |
KB | - |
MB |
MB | - |
GB |
GB | - |
TB |
TB | - |
PB |
PB | - |
flop |
flop | - |
GFLOPs |
GFLOPs | - |
TFLOPs |
TFLOPs | - |
ZFLOPs |
ZFLOPs | - |
param |
param | - |
Mparam |
Mparam | - |
Gbps |
Gbps | - |
NS |
NS | - |
US |
US | - |
MS |
MS | - |
second |
second | - |
hour |
hour | - |
day |
day | - |
joule |
joule | - |
watt |
watt | - |
meter |
meter | - |
USD |
dollar | - |
With all constants, units, and scale references in place, the next concern is avoiding the most common mistakes practitioners make when applying these numbers to real-world estimates.
Fallacies and Pitfalls
Fallacy: Peak FLOPS predict real-world training throughput.
Datasheet FLOPS are measured under idealized conditions—perfectly aligned matrix dimensions, 100 percent occupancy, zero memory stalls. Real training workloads typically achieve 30–50 percent of peak (measured as Model FLOPS Utilization, or MFU). Using peak FLOPS to estimate training time without an MFU discount produces estimates that are 2–3.3\(\times\) too optimistic, leading to missed deadlines and budget overruns.
Pitfall: Using FP32 FLOPS when the workload runs in BF16 or FP8.
Modern accelerators have separate datapaths for different precisions, and the peak throughput varies dramatically: the H100 delivers 989 TFLOPS in FP16 tensor operations but only 67 TFLOPS in FP32 CUDA-core operations—about 15\(\times\) difference. Quoting the wrong precision’s peak when computing utilization or estimating training time produces meaningless results. Always match the constant to the precision the workload actually uses.
Fallacy: Hardware constants are stable enough to hardcode.
Accelerator specifications, cloud pricing, and energy costs change with every hardware generation and contract renegotiation. Hardcoding “the A100 has 2 TB/s bandwidth” in a calculation means the estimate silently rots as hardware evolves. This is precisely why the book uses mlsysim/core/constants.py—updating a single value propagates the correction everywhere.
Pitfall: Treating TDP as actual power consumption.
Thermal Design Power (TDP) is the maximum sustained power draw the cooling system must handle, not the power the accelerator actually consumes under a given workload. Real power consumption varies with utilization, memory access patterns, and clock frequency. Using TDP for energy calculations overestimates costs for inference workloads (which rarely sustain peak power) and may underestimate costs for sustained training workloads on newer hardware with dynamic boost.
Summary
Taken together, the reference values above turn the appendix from a catalogue into a set of checks for quantitative reasoning.
Key Takeaways: Auditing the Book's constants
- Every example traces to a constant: Every quantitative example in this book traces back to a specific constant in
mlsysim/core/constants.py. This appendix exposes all of those constants so the numbers that underpin the book’s reasoning are open to audit, verification, and update. - Hardware specs are ceilings: Hardware specs (peak FLOPS, memory bandwidth, TDP) set ceilings, not guarantees. Real utilization is typically 30–50 percent of peak for training workloads; using peak values without discounting produces dangerously optimistic estimates.
- Ratios are durable: The ratios between constants are often more stable and informative than the absolute values. The ridge point (FLOPS/bandwidth), the memory-per-parameter cost (16 bytes for mixed-precision Adam), and the order-of-magnitude energy hierarchy (tens of thousands of times between a register access and a DRAM access) persist across hardware generations.
- A single source of truth prevents drift: A single source of truth for constants eliminates the most common source of inconsistency in quantitative textbooks: the same number quoted differently in different chapters.