System Assumptions

Purpose

What assumptions sit underneath the “napkin math” throughout the book?

Every quantitative example in this book—from training-time estimates to energy-per-inference calculations—rests on a shared set of physical constants, hardware specifications, and economic assumptions. Rather than scatter these values across chapters (where they would inevitably diverge), we define them once in the book’s source code (mlsys/constants.py) and import them wherever a calculation needs them. This appendix exposes every constant in that file so you can audit, verify, or update the numbers that underpin the book’s reasoning.

Learning Objectives
  • Locate the hardware specification (peak FLOPS, memory bandwidth, memory capacity, TDP) for any accelerator generation referenced in the book
  • Trace any computed value in the book back to the specific constants and formulas that produced it
  • Apply these constants in back-of-the-envelope calculations for training time, memory requirements, and energy costs
  • Distinguish between peak (datasheet) values and achievable throughput, and explain why utilization gaps exist
  • Update a single constant and understand how the change propagates to every chapter that depends on it

How to Use This Appendix

Use this appendix as a reference when you want to verify a napkin-math calculation, swap in an alternative assumption (for example, a different memory bandwidth or electricity price), or trace which constants a particular estimate depends on. Each constant name matches its Python identifier in mlsys/constants.py, so you can search for any name in the book’s source to find every chapter that uses it.

The following tables are organized into thematic groups—accelerator specifications, model parameters, energy constants, interconnect bandwidths, and so on—so you can quickly locate related assumptions without scanning an alphabetical list. Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).

Because hardware generations change faster than textbooks, these tables also serve as a change log of sorts: updating a single value in mlsys/constants.py propagates the change to every calculation in every chapter automatically. The following worked examples demonstrate how to combine these constants for quick back-of-the-envelope estimates.

Napkin Math 1.1: Napkin Math with These Constants
The constants in this appendix are not just for auditing—they are designed for quick calculations. Three examples illustrate the pattern.

Is this workload compute bound or memory-bound? Divide peak FLOPS by memory bandwidth to get the ridge point (The roofline model). For an H100: 989 TFLOPS / 3.35 TB/s \(\approx\) 295 FLOP/byte. Any operation with arithmetic intensity above 295 FLOP/byte is compute bound; below it, memory bound. A large general matrix multiply (GEMM) with \(n=\) 4,096 has intensity \(n/3 \approx\) 1,365 FLOP/byte (compute bound). A single-token autoregressive decode has intensity \(\approx 1\) FLOP/byte (deeply memory bound).

How much memory does training a 7B model require? Mixed-precision Adaptive Moment Estimation (Adam) stores 2 bytes (BF16 weights) + 2 bytes (gradients) + 12 bytes (FP32 master weights + momentum + variance) = 16 bytes per parameter. For 7B parameters: 7 \(\times 10^9 \times\) 16 bytes = 112 GB. An A100 or H100 has 86 GB of HBM, so the model state alone exceeds a single accelerator—before accounting for activations.

How much energy does one GPT-3 training run cost? An A100 draws 400 W at TDP. GPT-3 training used roughly \(3.14 \times 10^{23}\) FLOPS across ~25 accelerator-days. At $0.12/kWh: ~25 accelerator-days \(\times\) 24 h/day \(\times\) 0.4 kW \(\times\) $0.12/kWh \(\approx\) ~\\\\$29 in electricity alone—a small fraction of the total cost, which is dominated by accelerator amortization.

Accelerator Specifications

The tables in this section capture the peak compute throughput, memory bandwidth, memory capacity, and thermal design power (TDP) for each accelerator generation referenced in the book. These are datasheet numbers—actual workloads rarely sustain peak rates—but they set the ceiling that roofline analysis and utilization calculations measure against. The accelerator tables progress chronologically from Table 1 (Volta) and Table 2 (Turing) through more recent generations.

NVIDIA V100

Table 1: NVIDIA V100 (Volta): The V100 introduced Tensor Cores for mixed-precision training. These specs anchor the “baseline generation” comparisons used in several training-cost estimates.
Constant Value Unit
V100_FLOPS_FP16_TENSOR 125 TFLOPs/s
V100_FLOPS_FP32 15.7 TFLOPs/s
V100_MEM_BW 900 GB/s
V100_MEM_CAPACITY 32 GiB
V100_TDP 300 W

NVIDIA T4

Table 2: NVIDIA T4 (Turing): A low-power inference accelerator widely deployed in cloud serving. Its 70 W TDP makes it the go-to comparison point for cost-per-inference calculations.
Constant Value Unit
T4_FLOPS_FP16_TENSOR 65 TFLOPs/s
T4_FLOPS_INT8 130 TFLOPs/s
T4_MEM_BW 320 GB/s
T4_TDP 70 W

NVIDIA A100

Table 3 lists the Ampere-generation specs that anchor most training examples in the book.

Table 3: NVIDIA A100 (Ampere): The A100 is the most commonly cited accelerator in the book’s training examples. Its 80 GB HBM2e capacity and TF32 Tensor Cores set the reference point for memory-capacity and compute-intensity calculations.
Constant Value Unit
A100_FLOPS_FP16_TENSOR 312 TFLOPs/s
A100_FLOPS_FP32 19.5 TFLOPs/s
A100_FLOPS_INT8 624 TFLOPs/s
A100_FLOPS_TF32 156 TFLOPs/s
A100_MEM_BW 2039 GB/s
A100_MEM_CAPACITY 80 GiB
A100_TDP 400 W

NVIDIA H100

Table 4 adds Hopper-generation FP8 Tensor Cores and the Transformer Engine, driving “current generation” estimates.

Table 4: NVIDIA H100 (Hopper): The H100 adds FP8 Tensor Cores and the Transformer Engine. Its specs drive the “current generation” training and serving estimates.
Constant Value Unit
H100_FLOPS_FP16_TENSOR 989 TFLOPs/s
H100_FLOPS_FP8_TENSOR 1979 TFLOPs/s
H100_FLOPS_INT8 1979 TFLOPs/s
H100_FLOPS_TF32 494 TFLOPs/s
H100_MEM_BW 3.35 TB/s
H100_MEM_CAPACITY 80 GiB
H100_TDP 700 W

NVIDIA B200

Table 5 provides Blackwell-generation specs used in forward-looking capacity-planning examples.

Table 5: NVIDIA B200 (Blackwell): The B200 represents the next-generation reference point. Its HBM3e bandwidth and FP8 throughput appear in forward-looking capacity-planning examples.
Constant Value Unit
B200_FLOPS_FP16_TENSOR 2250 TFLOPs/s
B200_FLOPS_FP8_TENSOR 4500 TFLOPs/s
B200_FLOPS_INT4 9000 TFLOPs/s
B200_MEM_BW 8 TB/s
B200_MEM_CAPACITY 192 GiB
B200_TDP 1000 W

AMD Instinct MI300X

Table 6 provides specifications for the MI300X, often used as the primary alternative to NVIDIA’s H100 in large-scale inference and training clusters.

Table 6: AMD Instinct MI300X: The MI300X features high HBM capacity (192 GB) and bandwidth, making it a common baseline for comparing memory-bound workload performance across vendors.
Constant Value Unit
MI300X_FLOPS_FP16_TENSOR 1307 TFLOPs/s
MI300X_MEM_BW 5.3 TB/s
MI300X_MEM_CAPACITY 192 GiB
MI300X_TDP 750 W

Google TPU v4

Table 7 provides the ASIC-based alternative used when comparing training economics across accelerator families.

Table 7: Google TPU v4 and v6 (Trillium): Tensor Processing Unit (TPU) specifications for comparing ASIC-based training economics across generations. TPU v6 (Trillium) represents a significant jump in compute density and bandwidth.
Constant Value Unit
TPUV4_FLOPS_BF16 275 TFLOPs/s
TPUV4_MEM_BW 1200 GB/s
TPUV6_FLOPS_BF16 2150 TFLOPs/s
TPUV6_MEM_BW 4.5 TB/s

CPU and mobile/edge processors

Table 8 grounds the edge and mobile ML examples, where the contrast with data center accelerator throughput illustrates why deployment target shapes every design decision.

Table 8: CPU, Mobile NPU, and Edge Device Specs: These constants ground the edge and mobile ML examples. The contrast between mobile NPU throughput and data center accelerator throughput illustrates why deployment target shapes every design decision.
Constant Value Unit
CPU_FLOPS_FP32 1 TFLOPs/s
SYSTEM_MEMORY_BW 50 GB/s
MOBILE_NPU_TOPS_INT8 50 TFLOPs/s
MOBILE_NPU_MEM_BW 100 GB/s
MOBILE_TDP_W 3 W
OBJECT_DETECTOR_POWER_W 2 W
PHONE_BATTERY_WH 15 h·W

With hardware specifications established, the next question is: what workloads run on this hardware? Table 9 captures the model architectures whose parameter counts, FLOP budgets, and training costs appear in the book’s calculations.

Model Specifications

These constants define the parameter counts, per-inference FLOP budgets, and (where applicable) training costs for the reference models used throughout the book. When a chapter estimates “how long to train GPT-3,” these are the numbers it plugs in.

Table 9: Reference Model Specifications: Parameter counts and FLOP budgets for the models used in worked examples. These span four orders of magnitude—from MobileNetV2 on a phone to GPT-4 across thousands of accelerators—illustrating how model scale drives every systems decision from memory planning to cluster sizing.
Constant Value Unit
BERT_BASE_FLOPs 2.2e+10 flop
BERT_BASE_PARAMS 1.1e+08 param
LLAMA3_8B_PARAMS 8.03e+09 param
GPT2_HIDDEN_DIM 1600 -
GPT2_LAYERS 48 -
GPT2_PARAMS 1.5e+09 param
GPT3_PARAMS 1.75e+11 param
GPT3_TRAINING_DAYS_REF 25 d
GPT3_TRAINING_OPS 3.14e+23 flop
GPT4_TRAINING_ACCELERATOR_DAYS 2.5e+06 -
RESNET50_FLOPs 4.1e+09 flop
RESNET50_PARAMS 2.56e+07 param
MOBILENETV2_FLOPs 3e+08 flop
MOBILENETV2_PARAMS 3.5e+06 param
YOLOV8_NANO_FLOPs 8.7e+09 flop

Knowing what hardware exists and what models run on it is necessary but not sufficient. The energy cost of computation—measured at the level of individual operations and memory accesses—determines whether a design is thermally viable and economically sustainable.

Energy Constants

Energy-per-operation and energy-per-access constants come from semiconductor process measurements (primarily 45 nm and 7 nm nodes) and are used in the book’s energy-efficiency and sustainability analyses. Table 10 quantifies this hierarchy from register access through DRAM, illustrating why the memory wall is fundamentally an energy wall.

Table 10: Energy per Operation and Access: These constants quantify the energy hierarchy from register file through DRAM and across precision formats. The 200\(\times\) gap between a register read and a DRAM access explains why data reuse (tiling, fusion) dominates ML kernel optimization.
Constant Value Unit
ENERGY_REG_PJ 0.01 pJ
ENERGY_SRAM_L1_PJ 0.5 pJ
ENERGY_SRAM_L2_PJ 2 pJ
ENERGY_DRAM_ACCESS_PJ 640 pJ
ENERGY_DRAM_PJ_PER_BYTE 160 pJ/B
ENERGY_FLOP_PJ 4.6 pJ/flop
ENERGY_FLOP_FP16_PJ 1.1 pJ/flop
ENERGY_FLOP_FP32_PJ 3.7 pJ/flop
ENERGY_FLOP_INT8_PJ 0.2 pJ/flop
ENERGY_MOBILENET_INF_MJ 0.1 mJ
NETWORK_5G_ENERGY_PER_MB_MJ 100 mJ/MB

Energy costs operate at the chip level, but real ML systems also move data across interconnects—between accelerators, across racks, and over wide-area networks. The next table captures the bandwidth assumptions that determine communication overhead in distributed training and serving.

Interconnect and Network Bandwidth

Distributed training and multi-accelerator serving are bottlenecked by interconnect bandwidth as often as by compute. Table 11 captures the bandwidths for accelerator-to-accelerator links (NVLink), cross-node fabrics (InfiniBand), host buses (PCIe), Non-Volatile Memory Express (NVMe) storage, data center Ethernet, and the speed-of-light floor that sets the minimum latency for any network hop.

Table 11: Interconnect and Network Bandwidth: Ordered from fastest (intra-node NVLink) to slowest (data center Ethernet), these bandwidths determine gradient synchronization time, pipeline-parallel bubble overhead, and data-loading throughput. The speed of light in fiber sets the physical floor for cross-data center latency.
Constant Value Unit
NVLINK_V100_BW 300 GB/s
NVLINK_A100_BW 600 GB/s
NVLINK_H100_BW 900 GB/s
INFINIBAND_HDR_BW 200 Gbps
INFINIBAND_NDR_BW 400 Gbps
INFINIBAND_XDR_BW 800 Gbps
INFINIBAND_GXDR_BW 1600 Gbps
PCIE_GEN4_BW 32 GB/s
PCIE_GEN5_BW 64 GB/s
NVME_SEQUENTIAL_BW 7 GB/s
NETWORK_10G_BW 10 Gbps
NETWORK_100G_BW 100 Gbps
SPEED_OF_LIGHT_FIBER_KM_S 200000 km/s

The bandwidth hierarchy—from NVLink within a node down to ten GbE across a data center—shapes every distributed system design decision. But bandwidth is only one dimension of cost. Building and operating ML infrastructure also requires reasoning about electricity prices and cloud service fees, which the next table quantifies.

Economic Constants

Cost estimates throughout the book depend on the electricity and cloud pricing assumptions in Table 12. These are order-of-magnitude reference values; actual prices vary by region, provider, and contract terms, but the ratios between them are more stable than the absolute numbers.

Table 12: Economic Assumptions: Electricity and egress pricing used in total cost of ownership (TCO) calculations. These are representative cloud rates; on-premise costs differ but the relative magnitudes guide the same design decisions.
Constant Value Unit
CLOUD_ELECTRICITY_PER_KWH 0.12 dollar/kWh
CLOUD_EGRESS_PER_GB 0.09 dollar/GB

Economic constants set the price per unit of compute and data transfer, but they mean little without a sense of the volumes involved. Production ML systems handle millions to billions of requests per day—numbers large enough to be difficult to internalize without concrete reference points.

Scale References

When reasoning about production ML systems, it helps to have concrete scale anchors for “how much traffic does a large service actually handle?” Table 13 provides order-of-magnitude reference points drawn from public disclosures.

Table 13: Production Scale and Data Rate References: These anchors ground the “how big is big?” questions that arise in capacity planning. Waymo data rates illustrate sensor-fusion throughput requirements; Gmail and Google Search volumes calibrate serving-infrastructure estimates.
Constant Value Unit
GMAIL_EMAILS_PER_DAY 1.21e+11 -
GOOGLE_SEARCHES_PER_DAY 8.5e+09 -
WAYMO_DATA_PER_HOUR_LOW 1 TB/h
WAYMO_DATA_PER_HOUR_HIGH 19 TB/h
VIDEO_1080P_WIDTH 1920 -
VIDEO_1080P_HEIGHT 1080 -
VIDEO_BYTES_PER_PIXEL_RGB 3 B
VIDEO_FPS_STANDARD 30 Hz

All the preceding constants—hardware specs, model parameters, energy costs, economic rates, and scale references—are expressed in specific units. For these values to combine correctly in calculations, every quantity must carry its units explicitly.

Unit Definitions

The constants file defines the base and derived units listed in Table 14 so that all dimensional analysis in the book uses a consistent unit system via the pint library. These are not assumptions per se, but they ensure that every computed value carries its units and that unit-conversion errors are caught automatically.

Table 14: Unit Definitions: Base and derived units used by the pint dimensional-analysis library throughout the book. Every computed value carries its units, so mixing incompatible quantities (for example, adding bytes to FLOP/s) raises an immediate error rather than producing a silent wrong answer.
Constant Value Unit
byte byte -
KB KB -
MB MB -
GB GB -
TB TB -
PB PB -
flop flop -
GFLOPs GFLOPs -
TFLOPs TFLOPs -
ZFLOPs ZFLOPs -
param param -
Mparam Mparam -
Gbps Gbps -
NS NS -
US US -
MS MS -
second second -
hour hour -
day day -
joule joule -
watt watt -
meter meter -
USD dollar -

With all constants, units, and scale references in place, it is worth pausing to consider the most common mistakes practitioners make when applying these numbers to real-world estimates.

Fallacies and Pitfalls

Fallacy: Peak FLOPS predict real-world training throughput.

Datasheet FLOPS are measured under idealized conditions—perfectly aligned matrix dimensions, 100 percent occupancy, zero memory stalls. Real training workloads typically achieve 30–50 percent of peak (measured as Model FLOPS Utilization, or MFU). Using peak FLOPS to estimate training time without an MFU discount produces estimates that are 2–3\(\times\) too optimistic, leading to missed deadlines and budget overruns.

Pitfall: Using FP32 FLOPS when the workload runs in BF16 or FP8.

Modern accelerators have separate datapaths for different precisions, and the peak throughput varies dramatically: the H100 delivers 990 TFLOPS in FP16 tensor operations but only sixty TFLOPS in FP32 non-tensor operations—a 16\(\times\) difference. Quoting the wrong precision’s peak when computing utilization or estimating training time produces meaningless results. Always match the constant to the precision your workload actually uses.

Fallacy: Hardware constants are stable enough to hardcode.

Accelerator specifications, cloud pricing, and energy costs change with every hardware generation and contract renegotiation. Hardcoding “the A100 has 2 TB/s bandwidth” in a calculation means the estimate silently rots as hardware evolves. This is precisely why the book uses mlsys/constants.py—updating a single value propagates the correction everywhere.

Pitfall: Treating TDP as actual power consumption.

Thermal Design Power (TDP) is the maximum sustained power draw the cooling system must handle, not the power the accelerator actually consumes under a given workload. Real power consumption varies with utilization, memory access patterns, and clock frequency. Using TDP for energy calculations overestimates costs for inference workloads (which rarely sustain peak power) and may underestimate costs for sustained training workloads on newer hardware with dynamic boost.

Summary

Key Takeaways: Auditing the Book's Constants
  • Every quantitative example in this book traces back to a specific constant in mlsys/constants.py. This appendix exposes all of those constants so you can audit, verify, or update the numbers that underpin the book’s reasoning.
  • Hardware specs (peak FLOPS, memory bandwidth, TDP) set ceilings, not guarantees. Real utilization is typically 30–50 percent of peak for training workloads; using peak values without discounting produces dangerously optimistic estimates.
  • The ratios between constants are often more stable and informative than the absolute values. The ridge point (FLOPS/bandwidth), the memory-per-parameter cost (16 bytes for mixed-precision Adam), and the energy hierarchy (200\(\times\) between register and DRAM access) persist across hardware generations.
  • A single source of truth for constants eliminates the most common source of inconsistency in quantitative textbooks: the same number quoted differently in different chapters.
Back to top
` HTML comment in this docstring block. HTML disallows nested comments, so the inner `-->` would terminate the outer comment early and cause the example markup below it (the `` and `