System Assumptions
Purpose
What assumptions sit underneath the “napkin math” throughout the book?
Every quantitative example in this book—from training-time estimates to energy-per-inference calculations—rests on a shared set of physical constants, hardware specifications, and economic assumptions. Rather than scatter these values across chapters (where they would inevitably diverge), we define them once in the book’s source code (mlsys/constants.py) and import them wherever a calculation needs them. This appendix exposes every constant in that file so you can audit, verify, or update the numbers that underpin the book’s reasoning.
Learning Objectives
- Locate the hardware specification (peak FLOPS, memory bandwidth, memory capacity, TDP) for any accelerator generation referenced in the book
- Trace any computed value in the book back to the specific constants and formulas that produced it
- Apply these constants in back-of-the-envelope calculations for training time, memory requirements, and energy costs
- Distinguish between peak (datasheet) values and achievable throughput, and explain why utilization gaps exist
- Update a single constant and understand how the change propagates to every chapter that depends on it
How to Use This Appendix
Use this appendix as a reference when you want to verify a napkin-math calculation, swap in an alternative assumption (for example, a different memory bandwidth or electricity price), or trace which constants a particular estimate depends on. Each constant name matches its Python identifier in mlsys/constants.py, so you can search for any name in the book’s source to find every chapter that uses it.
The following tables are organized into thematic groups—accelerator specifications, model parameters, energy constants, interconnect bandwidths, and so on—so you can quickly locate related assumptions without scanning an alphabetical list. Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).
Because hardware generations change faster than textbooks, these tables also serve as a change log of sorts: updating a single value in mlsys/constants.py propagates the change to every calculation in every chapter automatically. The following worked examples demonstrate how to combine these constants for quick back-of-the-envelope estimates.
Napkin Math 1.1: Napkin Math with These Constants
Is this workload compute bound or memory-bound? Divide peak FLOPS by memory bandwidth to get the ridge point (The roofline model). For an H100: 989 TFLOPS / 3.35 TB/s \(\approx\) 295 FLOP/byte. Any operation with arithmetic intensity above 295 FLOP/byte is compute bound; below it, memory bound. A large general matrix multiply (GEMM) with \(n=\) 4,096 has intensity \(n/3 \approx\) 1,365 FLOP/byte (compute bound). A single-token autoregressive decode has intensity \(\approx 1\) FLOP/byte (deeply memory bound).
How much memory does training a 7B model require? Mixed-precision Adaptive Moment Estimation (Adam) stores 2 bytes (BF16 weights) + 2 bytes (gradients) + 12 bytes (FP32 master weights + momentum + variance) = 16 bytes per parameter. For 7B parameters: 7 \(\times 10^9 \times\) 16 bytes = 112 GB. An A100 or H100 has 86 GB of HBM, so the model state alone exceeds a single accelerator—before accounting for activations.
How much energy does one GPT-3 training run cost? An A100 draws 400 W at TDP. GPT-3 training used roughly \(3.14 \times 10^{23}\) FLOPS across ~25 accelerator-days. At $0.12/kWh: ~25 accelerator-days \(\times\) 24 h/day \(\times\) 0.4 kW \(\times\) $0.12/kWh \(\approx\) ~\\\\$29 in electricity alone—a small fraction of the total cost, which is dominated by accelerator amortization.
Accelerator Specifications
The tables in this section capture the peak compute throughput, memory bandwidth, memory capacity, and thermal design power (TDP) for each accelerator generation referenced in the book. These are datasheet numbers—actual workloads rarely sustain peak rates—but they set the ceiling that roofline analysis and utilization calculations measure against. The accelerator tables progress chronologically from Table 1 (Volta) and Table 2 (Turing) through more recent generations.
NVIDIA V100
| Constant | Value | Unit |
|---|---|---|
| V100_FLOPS_FP16_TENSOR | 125 | TFLOPs/s |
| V100_FLOPS_FP32 | 15.7 | TFLOPs/s |
| V100_MEM_BW | 900 | GB/s |
| V100_MEM_CAPACITY | 32 | GiB |
| V100_TDP | 300 | W |
NVIDIA T4
| Constant | Value | Unit |
|---|---|---|
| T4_FLOPS_FP16_TENSOR | 65 | TFLOPs/s |
| T4_FLOPS_INT8 | 130 | TFLOPs/s |
| T4_MEM_BW | 320 | GB/s |
| T4_TDP | 70 | W |
NVIDIA A100
Table 3 lists the Ampere-generation specs that anchor most training examples in the book.
| Constant | Value | Unit |
|---|---|---|
| A100_FLOPS_FP16_TENSOR | 312 | TFLOPs/s |
| A100_FLOPS_FP32 | 19.5 | TFLOPs/s |
| A100_FLOPS_INT8 | 624 | TFLOPs/s |
| A100_FLOPS_TF32 | 156 | TFLOPs/s |
| A100_MEM_BW | 2039 | GB/s |
| A100_MEM_CAPACITY | 80 | GiB |
| A100_TDP | 400 | W |
NVIDIA H100
Table 4 adds Hopper-generation FP8 Tensor Cores and the Transformer Engine, driving “current generation” estimates.
| Constant | Value | Unit |
|---|---|---|
| H100_FLOPS_FP16_TENSOR | 989 | TFLOPs/s |
| H100_FLOPS_FP8_TENSOR | 1979 | TFLOPs/s |
| H100_FLOPS_INT8 | 1979 | TFLOPs/s |
| H100_FLOPS_TF32 | 494 | TFLOPs/s |
| H100_MEM_BW | 3.35 | TB/s |
| H100_MEM_CAPACITY | 80 | GiB |
| H100_TDP | 700 | W |
NVIDIA B200
Table 5 provides Blackwell-generation specs used in forward-looking capacity-planning examples.
| Constant | Value | Unit |
|---|---|---|
| B200_FLOPS_FP16_TENSOR | 2250 | TFLOPs/s |
| B200_FLOPS_FP8_TENSOR | 4500 | TFLOPs/s |
| B200_FLOPS_INT4 | 9000 | TFLOPs/s |
| B200_MEM_BW | 8 | TB/s |
| B200_MEM_CAPACITY | 192 | GiB |
| B200_TDP | 1000 | W |
AMD Instinct MI300X
Table 6 provides specifications for the MI300X, often used as the primary alternative to NVIDIA’s H100 in large-scale inference and training clusters.
| Constant | Value | Unit |
|---|---|---|
| MI300X_FLOPS_FP16_TENSOR | 1307 | TFLOPs/s |
| MI300X_MEM_BW | 5.3 | TB/s |
| MI300X_MEM_CAPACITY | 192 | GiB |
| MI300X_TDP | 750 | W |
Google TPU v4
Table 7 provides the ASIC-based alternative used when comparing training economics across accelerator families.
| Constant | Value | Unit |
|---|---|---|
| TPUV4_FLOPS_BF16 | 275 | TFLOPs/s |
| TPUV4_MEM_BW | 1200 | GB/s |
| TPUV6_FLOPS_BF16 | 2150 | TFLOPs/s |
| TPUV6_MEM_BW | 4.5 | TB/s |
CPU and mobile/edge processors
Table 8 grounds the edge and mobile ML examples, where the contrast with data center accelerator throughput illustrates why deployment target shapes every design decision.
| Constant | Value | Unit |
|---|---|---|
| CPU_FLOPS_FP32 | 1 | TFLOPs/s |
| SYSTEM_MEMORY_BW | 50 | GB/s |
| MOBILE_NPU_TOPS_INT8 | 50 | TFLOPs/s |
| MOBILE_NPU_MEM_BW | 100 | GB/s |
| MOBILE_TDP_W | 3 | W |
| OBJECT_DETECTOR_POWER_W | 2 | W |
| PHONE_BATTERY_WH | 15 | h·W |
With hardware specifications established, the next question is: what workloads run on this hardware? Table 9 captures the model architectures whose parameter counts, FLOP budgets, and training costs appear in the book’s calculations.
Model Specifications
These constants define the parameter counts, per-inference FLOP budgets, and (where applicable) training costs for the reference models used throughout the book. When a chapter estimates “how long to train GPT-3,” these are the numbers it plugs in.
| Constant | Value | Unit |
|---|---|---|
| BERT_BASE_FLOPs | 2.2e+10 | flop |
| BERT_BASE_PARAMS | 1.1e+08 | param |
| LLAMA3_8B_PARAMS | 8.03e+09 | param |
| GPT2_HIDDEN_DIM | 1600 | - |
| GPT2_LAYERS | 48 | - |
| GPT2_PARAMS | 1.5e+09 | param |
| GPT3_PARAMS | 1.75e+11 | param |
| GPT3_TRAINING_DAYS_REF | 25 | d |
| GPT3_TRAINING_OPS | 3.14e+23 | flop |
| GPT4_TRAINING_ACCELERATOR_DAYS | 2.5e+06 | - |
| RESNET50_FLOPs | 4.1e+09 | flop |
| RESNET50_PARAMS | 2.56e+07 | param |
| MOBILENETV2_FLOPs | 3e+08 | flop |
| MOBILENETV2_PARAMS | 3.5e+06 | param |
| YOLOV8_NANO_FLOPs | 8.7e+09 | flop |
Knowing what hardware exists and what models run on it is necessary but not sufficient. The energy cost of computation—measured at the level of individual operations and memory accesses—determines whether a design is thermally viable and economically sustainable.
Energy Constants
Energy-per-operation and energy-per-access constants come from semiconductor process measurements (primarily 45 nm and 7 nm nodes) and are used in the book’s energy-efficiency and sustainability analyses. Table 10 quantifies this hierarchy from register access through DRAM, illustrating why the memory wall is fundamentally an energy wall.
| Constant | Value | Unit |
|---|---|---|
| ENERGY_REG_PJ | 0.01 | pJ |
| ENERGY_SRAM_L1_PJ | 0.5 | pJ |
| ENERGY_SRAM_L2_PJ | 2 | pJ |
| ENERGY_DRAM_ACCESS_PJ | 640 | pJ |
| ENERGY_DRAM_PJ_PER_BYTE | 160 | pJ/B |
| ENERGY_FLOP_PJ | 4.6 | pJ/flop |
| ENERGY_FLOP_FP16_PJ | 1.1 | pJ/flop |
| ENERGY_FLOP_FP32_PJ | 3.7 | pJ/flop |
| ENERGY_FLOP_INT8_PJ | 0.2 | pJ/flop |
| ENERGY_MOBILENET_INF_MJ | 0.1 | mJ |
| NETWORK_5G_ENERGY_PER_MB_MJ | 100 | mJ/MB |
Energy costs operate at the chip level, but real ML systems also move data across interconnects—between accelerators, across racks, and over wide-area networks. The next table captures the bandwidth assumptions that determine communication overhead in distributed training and serving.
Interconnect and Network Bandwidth
Distributed training and multi-accelerator serving are bottlenecked by interconnect bandwidth as often as by compute. Table 11 captures the bandwidths for accelerator-to-accelerator links (NVLink), cross-node fabrics (InfiniBand), host buses (PCIe), Non-Volatile Memory Express (NVMe) storage, data center Ethernet, and the speed-of-light floor that sets the minimum latency for any network hop.
| Constant | Value | Unit |
|---|---|---|
| NVLINK_V100_BW | 300 | GB/s |
| NVLINK_A100_BW | 600 | GB/s |
| NVLINK_H100_BW | 900 | GB/s |
| INFINIBAND_HDR_BW | 200 | Gbps |
| INFINIBAND_NDR_BW | 400 | Gbps |
| INFINIBAND_XDR_BW | 800 | Gbps |
| INFINIBAND_GXDR_BW | 1600 | Gbps |
| PCIE_GEN4_BW | 32 | GB/s |
| PCIE_GEN5_BW | 64 | GB/s |
| NVME_SEQUENTIAL_BW | 7 | GB/s |
| NETWORK_10G_BW | 10 | Gbps |
| NETWORK_100G_BW | 100 | Gbps |
| SPEED_OF_LIGHT_FIBER_KM_S | 200000 | km/s |
The bandwidth hierarchy—from NVLink within a node down to ten GbE across a data center—shapes every distributed system design decision. But bandwidth is only one dimension of cost. Building and operating ML infrastructure also requires reasoning about electricity prices and cloud service fees, which the next table quantifies.
Economic Constants
Cost estimates throughout the book depend on the electricity and cloud pricing assumptions in Table 12. These are order-of-magnitude reference values; actual prices vary by region, provider, and contract terms, but the ratios between them are more stable than the absolute numbers.
| Constant | Value | Unit |
|---|---|---|
| CLOUD_ELECTRICITY_PER_KWH | 0.12 | dollar/kWh |
| CLOUD_EGRESS_PER_GB | 0.09 | dollar/GB |
Economic constants set the price per unit of compute and data transfer, but they mean little without a sense of the volumes involved. Production ML systems handle millions to billions of requests per day—numbers large enough to be difficult to internalize without concrete reference points.
Scale References
When reasoning about production ML systems, it helps to have concrete scale anchors for “how much traffic does a large service actually handle?” Table 13 provides order-of-magnitude reference points drawn from public disclosures.
| Constant | Value | Unit |
|---|---|---|
| GMAIL_EMAILS_PER_DAY | 1.21e+11 | - |
| GOOGLE_SEARCHES_PER_DAY | 8.5e+09 | - |
| WAYMO_DATA_PER_HOUR_LOW | 1 | TB/h |
| WAYMO_DATA_PER_HOUR_HIGH | 19 | TB/h |
| VIDEO_1080P_WIDTH | 1920 | - |
| VIDEO_1080P_HEIGHT | 1080 | - |
| VIDEO_BYTES_PER_PIXEL_RGB | 3 | B |
| VIDEO_FPS_STANDARD | 30 | Hz |
All the preceding constants—hardware specs, model parameters, energy costs, economic rates, and scale references—are expressed in specific units. For these values to combine correctly in calculations, every quantity must carry its units explicitly.
Unit Definitions
The constants file defines the base and derived units listed in Table 14 so that all dimensional analysis in the book uses a consistent unit system via the pint library. These are not assumptions per se, but they ensure that every computed value carries its units and that unit-conversion errors are caught automatically.
pint dimensional-analysis library throughout the book. Every computed value carries its units, so mixing incompatible quantities (for example, adding bytes to FLOP/s) raises an immediate error rather than producing a silent wrong answer.
| Constant | Value | Unit |
|---|---|---|
| byte | byte | - |
| KB | KB | - |
| MB | MB | - |
| GB | GB | - |
| TB | TB | - |
| PB | PB | - |
| flop | flop | - |
| GFLOPs | GFLOPs | - |
| TFLOPs | TFLOPs | - |
| ZFLOPs | ZFLOPs | - |
| param | param | - |
| Mparam | Mparam | - |
| Gbps | Gbps | - |
| NS | NS | - |
| US | US | - |
| MS | MS | - |
| second | second | - |
| hour | hour | - |
| day | day | - |
| joule | joule | - |
| watt | watt | - |
| meter | meter | - |
| USD | dollar | - |
With all constants, units, and scale references in place, it is worth pausing to consider the most common mistakes practitioners make when applying these numbers to real-world estimates.
Fallacies and Pitfalls
Fallacy: Peak FLOPS predict real-world training throughput.
Datasheet FLOPS are measured under idealized conditions—perfectly aligned matrix dimensions, 100 percent occupancy, zero memory stalls. Real training workloads typically achieve 30–50 percent of peak (measured as Model FLOPS Utilization, or MFU). Using peak FLOPS to estimate training time without an MFU discount produces estimates that are 2–3\(\times\) too optimistic, leading to missed deadlines and budget overruns.
Pitfall: Using FP32 FLOPS when the workload runs in BF16 or FP8.
Modern accelerators have separate datapaths for different precisions, and the peak throughput varies dramatically: the H100 delivers 990 TFLOPS in FP16 tensor operations but only sixty TFLOPS in FP32 non-tensor operations—a 16\(\times\) difference. Quoting the wrong precision’s peak when computing utilization or estimating training time produces meaningless results. Always match the constant to the precision your workload actually uses.
Fallacy: Hardware constants are stable enough to hardcode.
Accelerator specifications, cloud pricing, and energy costs change with every hardware generation and contract renegotiation. Hardcoding “the A100 has 2 TB/s bandwidth” in a calculation means the estimate silently rots as hardware evolves. This is precisely why the book uses mlsys/constants.py—updating a single value propagates the correction everywhere.
Pitfall: Treating TDP as actual power consumption.
Thermal Design Power (TDP) is the maximum sustained power draw the cooling system must handle, not the power the accelerator actually consumes under a given workload. Real power consumption varies with utilization, memory access patterns, and clock frequency. Using TDP for energy calculations overestimates costs for inference workloads (which rarely sustain peak power) and may underestimate costs for sustained training workloads on newer hardware with dynamic boost.
Summary
Key Takeaways: Auditing the Book's Constants
- Every quantitative example in this book traces back to a specific constant in
mlsys/constants.py. This appendix exposes all of those constants so you can audit, verify, or update the numbers that underpin the book’s reasoning. - Hardware specs (peak FLOPS, memory bandwidth, TDP) set ceilings, not guarantees. Real utilization is typically 30–50 percent of peak for training workloads; using peak values without discounting produces dangerously optimistic estimates.
- The ratios between constants are often more stable and informative than the absolute values. The ridge point (FLOPS/bandwidth), the memory-per-parameter cost (16 bytes for mixed-precision Adam), and the energy hierarchy (200\(\times\) between register and DRAM access) persist across hardware generations.
- A single source of truth for constants eliminates the most common source of inconsistency in quantitative textbooks: the same number quoted differently in different chapters.