System Assumptions

Purpose

What assumptions sit underneath the “napkin math” throughout the book?

Every quantitative example in this book—from training-time estimates to energy-per-inference calculations—rests on a shared set of physical constants, hardware specifications, and economic assumptions. Rather than scatter these values across chapters (where they would inevitably diverge), we define them once in the book’s source code (mlsysim/core/constants.py) and import them wherever a calculation needs them. This appendix exposes the main constants used in the volume’s worked examples so the numbers that underpin the book’s reasoning are open to audit, verification, and update.

How to Use This Appendix

This appendix serves as a reference for verifying a napkin-math calculation, swapping in an alternative assumption (for example, a different memory bandwidth or electricity price), or tracing which constants a particular estimate depends on. Each constant name matches its Python identifier in mlsysim/core/constants.py, so searching for any name in the book’s source surfaces every chapter that uses it.

The reference tables are organized into thematic groups—accelerator specifications, model parameters, energy constants, interconnect bandwidths, and so on—so related assumptions surface quickly without scanning an alphabetical list. Conventions used here follow the book-wide notation (for example, we reserve $B$ for batch size and use $\text{BW}$ for bandwidth).

Because hardware generations change faster than textbooks, these tables also serve as a change log of sorts: updating a single value in mlsysim/core/constants.py propagates the change to every calculation in every chapter automatically. A few worked estimates show how these constants combine for quick back-of-the-envelope calculations.

Napkin Math 1.1: Napkin math with these constants

The constants in this appendix are not just for auditing—they are designed for quick calculations. Three examples illustrate the pattern.

Is this workload compute bound or memory bound? Divide peak FLOPS by memory bandwidth to get the ridge point (The roofline model). For an H100: 989 TFLOPS / 3.35 TB/s $\approx$ 295 FLOP/byte. Any operation with arithmetic intensity above 295 FLOP/byte is compute bound; below it, memory bound. A large general matrix multiply (GEMM) with $n=$ 4,096 has intensity $n/3 \approx$ 1,365 FLOP/byte (compute bound). A single-token autoregressive decode has intensity $\approx 1$ FLOP/byte (deeply memory bound).

How much memory does training a 7B model require? Mixed-precision Adaptive Moment Estimation (Adam) stores 2 bytes (BF16 weights) + 2 bytes (gradients) + 12 bytes (FP32 master weights + momentum + variance) = 16 bytes per parameter. For 7B parameters: 7 $\times 10^9 \times$ 16 bytes = 112 GB. An H100 has 80 GB of HBM, so the model state alone exceeds a single accelerator—before accounting for activations.

How much energy would a GPT-3-scale training run cost on the reference cluster? An A100 draws 400 W at TDP. Using 1,024 A100s as a normalization for a run requiring roughly $3.14 \times 10^{23}$ FLOPs and the imported reference duration of ~25 wall-clock days gives an A100-equivalent energy estimate. At $0.12/kWh: ~25 wall-clock days $\times$ 1,024 A100s $\times$ 24 h/day $\times$ 0.4 kW $\times$ $0.12/kWh $\approx$ ~USD 29,491 in electricity alone—a small fraction of the total cost, which is dominated by accelerator amortization. The original GPT-3 run used V100-era infrastructure; this calculation is an A100-equivalent reference estimate, not a derivation of the run duration from peak FLOPS alone.

Accelerator Specifications

The tables in this section capture the peak compute throughput, memory bandwidth, memory capacity, and thermal design power (TDP) for each accelerator generation referenced in the book. These are datasheet numbers—actual workloads rarely sustain peak rates—but they set the ceiling that roofline analysis and utilization calculations measure against. The accelerator tables progress chronologically from Table 1 (Volta) and Table 2 (Turing) through more recent generations.

NVIDIA V100

Table 1: NVIDIA V100 (Volta): The V100 introduced Tensor Cores for mixed-precision training. These specs anchor the “baseline generation” comparisons used in several training-cost estimates.

Constant	Value	Unit
`V100_FLOPS_FP16_TENSOR`	125	TFLOPs/s
`V100_FLOPS_FP32`	15.7	TFLOPs/s
`V100_MEM_BW`	900	GB/s
`V100_MEM_CAPACITY`	32	GiB
`V100_TDP`	300	W

NVIDIA T4

Table 2: NVIDIA T4 (Turing): A low-power inference accelerator widely deployed in cloud serving. Its 70 W TDP makes it a canonical reference point for cost-per-inference calculations.

Constant	Value	Unit
`T4_FLOPS_FP16_TENSOR`	65	TFLOPs/s
`T4_FLOPS_INT8`	130	TFLOPs/s
`T4_MEM_BW`	320	GB/s
`T4_TDP`	70	W

NVIDIA A100

Table 3 lists the Ampere-generation specs that anchor most training examples in the book.

Table 3: NVIDIA A100 (Ampere): The A100 is the most commonly cited accelerator in the book’s training examples. Its 80 GB HBM2e capacity and TF32 Tensor Cores set the reference point for memory-capacity and compute-intensity calculations.

Constant	Value	Unit
`A100_FLOPS_FP16_TENSOR`	312	TFLOPs/s
`A100_FLOPS_FP32`	19.5	TFLOPs/s
`A100_FLOPS_INT8`	624	TFLOPs/s
`A100_FLOPS_TF32`	156	TFLOPs/s
`A100_MEM_BW`	2039	GB/s
`A100_MEM_CAPACITY`	80	GiB
`A100_TDP`	400	W

NVIDIA H100

Table 4 adds Hopper-generation FP8 Tensor Cores and the Transformer Engine, driving “current generation” estimates.

Table 4: NVIDIA H100 (Hopper): The H100 adds FP8 Tensor Cores and the Transformer Engine. Its specs drive the “current generation” training and serving estimates.

Constant	Value	Unit
`H100_FLOPS_FP16_TENSOR`	989	TFLOPs/s
`H100_FLOPS_FP32_CUDA`	67	TFLOPs/s
`H100_FLOPS_FP8_TENSOR`	1979	TFLOPs/s
`H100_FLOPS_INT8`	1979	TFLOPs/s
`H100_FLOPS_TF32`	494	TFLOPs/s
`H100_MEM_BW`	3.35	TB/s
`H100_MEM_CAPACITY`	80	GiB
`H100_TDP`	700	W

NVIDIA B200

Table 5 provides Blackwell-generation specs used in forward-looking capacity-planning examples.

Table 5: NVIDIA B200 (Blackwell): The B200 represents the next-generation reference point. Its HBM3e bandwidth and FP8 throughput appear in forward-looking capacity-planning examples.

Constant	Value	Unit
`B200_FLOPS_FP16_TENSOR`	2250	TFLOPs/s
`B200_FLOPS_FP8_TENSOR`	4500	TFLOPs/s
`B200_FLOPS_INT4`	9000	TFLOPs/s
`B200_MEM_BW`	8	TB/s
`B200_MEM_CAPACITY`	192	GiB
`B200_TDP`	1000	W

AMD Instinct MI300X

Table 6 provides specifications for the MI300X, often used as the primary alternative to NVIDIA’s H100 in large-scale inference and training clusters.

Table 6: AMD Instinct MI300X: This accelerator features high HBM capacity (192 GB) and bandwidth, making it a common baseline for comparing memory-bound workload performance across vendors.

Constant	Value	Unit
`MI300X_FLOPS_FP16_TENSOR`	1307	TFLOPs/s
`MI300X_MEM_BW`	5.3	TB/s
`MI300X_MEM_CAPACITY`	192	GiB
`MI300X_TDP`	750	W

Google TPU v4

Table 7 provides the ASIC-based alternative used when comparing training economics across accelerator families.

Table 7: Google TPU v4 and v6e (Trillium): Tensor Processing Unit (TPU) specifications for comparing ASIC-based training economics across generations. TPU v6e (Trillium) represents a significant jump in compute density and bandwidth.

Constant	Value	Unit
`TPUV4_FLOPS_BF16`	275	TFLOPs/s
`TPUV4_MEM_BW`	1200	GB/s
`TPUV6_FLOPS_BF16`	918	TFLOPs/s
`TPUV6_MEM_BW`	1600	GB/s

CPU and mobile/edge processors

Table 8 grounds the edge and mobile ML examples, where the contrast with data center accelerator throughput illustrates why deployment target shapes every design decision.

Table 8: CPU, Mobile NPU, and Edge Device Specs: These constants ground the edge and mobile ML examples. The contrast between mobile NPU throughput and data center accelerator throughput illustrates why deployment target shapes every design decision.

Constant	Value	Unit
`CPU_FLOPS_FP32`	1	TFLOPs/s
`SYSTEM_MEMORY_BW`	50	GB/s
`MOBILE_NPU_TOPS_INT8`	50	TFLOPs/s
`MOBILE_NPU_MEM_BW`	100	GB/s
`MOBILE_TDP_W`	3	W
`OBJECT_DETECTOR_POWER_W`	2	W
`PHONE_BATTERY_WH`	15	h·W

Model Specifications

With hardware specifications established, the next question is what workloads run on this hardware. These constants define the parameter counts, per-inference FLOP budgets, and, where applicable, training costs for the reference models used throughout the book. Table 9 captures the model architectures whose values appear in later calculations; when a chapter estimates “how long to train GPT-3,” these are the numbers it plugs in.

Table 9: Reference Model Specifications: Parameter counts and FLOP budgets for the models used in worked examples. Displayed parameter counts alone span nearly five orders of magnitude—from MobileNetV2 on a phone to GPT-3-scale models in the cloud—illustrating how model scale drives every systems decision from memory planning to cluster sizing.

Constant	Value	Unit
`BERT_BASE_FLOPs`	2.2e+10	flop
`BERT_BASE_PARAMS`	1.1e+08	param
`LLAMA3_8B_PARAMS`	8.03e+09	param
`GPT2_HIDDEN_DIM`	1600	-
`GPT2_LAYERS`	48	-
`GPT2_PARAMS`	1.5e+09	param
`GPT3_PARAMS`	1.75e+11	param
`GPT3_TRAINING_DAYS_REF`	25	d
`GPT3_TRAINING_OPS`	3.14e+23	flop
`GPT4_TRAINING_GPU_DAYS`	2.5e+06	-
`RESNET50_FLOPs`	4.1e+09	flop
`RESNET50_PARAMS`	2.56e+07	param
`MOBILENETV2_FLOPs`	3e+08	flop
`MOBILENETV2_PARAMS`	3.5e+06	param
`YOLOV8_NANO_FLOPs`	8.7e+09	flop

Knowing what hardware exists and what models run on it is necessary but not sufficient. The energy cost of computation—measured at the level of individual operations and memory accesses—determines whether a design is thermally viable and economically sustainable.

Energy Constants

Energy-per-operation and energy-per-access constants are illustrative reference values, primarily from Horowitz’s 45 nm energy table (Horowitz 2014), and are used in the book’s energy-efficiency and sustainability analyses. Table 10 quantifies this hierarchy from register access through DRAM, illustrating why the memory wall is fundamentally an energy wall.

Horowitz, Mark. 2014. “1.1 Computing’s Energy Problem (and What We Can Do about It).” 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 10–14. https://doi.org/10.1109/isscc.2014.6757323.

Table 10: Energy per Operation and Access: These constants quantify the energy hierarchy from register file through DRAM and across precision formats. The roughly 64,000$\times$ gap between a 0.01 pJ register access and a 640 pJ DRAM access explains why data reuse (tiling, fusion) dominates ML kernel optimization.

Constant	Value	Unit
`ENERGY_REG_PJ`	0.01	pJ
`ENERGY_SRAM_L1_PJ`	0.5	pJ
`ENERGY_SRAM_L2_PJ`	2	pJ
`ENERGY_DRAM_ACCESS_PJ`	640	pJ
`ENERGY_DRAM_PJ_PER_BYTE`	160	pJ/B
`ENERGY_FLOP_PJ`	4.6	pJ/flop
`ENERGY_FLOP_FP16_PJ`	1.1	pJ/flop
`ENERGY_FLOP_FP32_PJ`	3.7	pJ/flop
`ENERGY_FLOP_INT8_PJ`	0.2	pJ/flop
`ENERGY_MOBILENET_INF_MJ`	0.1	mJ
`NETWORK_5G_ENERGY_PER_MB_MJ`	100	mJ/MB

Energy costs operate at the chip level, but real ML systems also move data across interconnects—between accelerators, across racks, and over wide-area networks. The next table captures the bandwidth assumptions that determine communication overhead in distributed training and serving.

Interconnect and Network Bandwidth

Distributed training and multi-accelerator serving are bottlenecked by interconnect bandwidth as often as by compute. Table 11 captures the bandwidths for accelerator-to-accelerator links (NVLink), cross-node fabrics (InfiniBand), host buses (PCIe), Non-Volatile Memory Express (NVMe) storage, data center Ethernet, and the speed-of-light floor that sets the minimum latency for any network hop.

The bandwidth hierarchy—from NVLink within a node down to 10 GbE across a data center—shapes every distributed system design decision and provides the communication baseline for the cost assumptions that follow.

Table 11: Interconnect and Network Bandwidth: Grouped by link family (accelerator interconnects, cluster fabrics, host buses, storage, and Ethernet), these bandwidths determine gradient synchronization time, pipeline-parallel bubble overhead, and data-loading throughput. The speed of light in fiber sets the physical floor for cross-data center latency.

Constant	Value	Unit
`NVLINK_V100_BW`	300	GB/s
`NVLINK_A100_BW`	600	GB/s
`NVLINK_H100_BW`	900	GB/s
`INFINIBAND_HDR_BW`	200	Gbps
`INFINIBAND_NDR_BW`	400	Gbps
`INFINIBAND_XDR_BW`	800	Gbps
`INFINIBAND_GXDR_BW`	1600	Gbps
`PCIE_GEN4_BW`	32	GB/s
`PCIE_GEN5_BW`	64	GB/s
`NVME_SEQUENTIAL_BW`	7	GB/s
`NETWORK_10G_BW`	10	Gbps
`NETWORK_100G_BW`	100	Gbps
`SPEED_OF_LIGHT_FIBER_KM_S`	200000	km/s

Economic Constants

Cost estimates throughout the book depend on the electricity and cloud pricing assumptions in Table 12. These are order-of-magnitude reference values; actual prices vary by region, provider, and contract terms, but the ratios between them are more stable than the absolute numbers.

Table 12: Economic Assumptions: Electricity and egress pricing used in total cost of ownership (TCO) calculations. These are representative cloud rates; on-premises costs differ but the relative magnitudes guide the same design decisions.

Constant	Value	Unit
`CLOUD_ELECTRICITY_PER_KWH`	0.12	dollar/kWh
`CLOUD_EGRESS_PER_GB`	0.09	dollar/GB

Economic constants set the price per unit of compute and data transfer, but they mean little without a sense of the volumes involved. Production ML systems handle millions to billions of requests per day—numbers large enough to be difficult to internalize without concrete reference points.

Scale References

When reasoning about production ML systems, it helps to have concrete scale anchors for “how much traffic does a large service actually handle?” Table 13 provides order-of-magnitude reference points drawn from public disclosures.

Table 13: Production Scale and Data Rate References: These anchors ground the “how big is big?” questions that arise in capacity planning. Waymo data rates illustrate sensor-fusion throughput requirements; Gmail and Google Search volumes calibrate serving-infrastructure estimates.

Constant	Value	Unit
`GMAIL_EMAILS_PER_DAY`	1.21e+11	-
`GOOGLE_SEARCHES_PER_DAY`	8.5e+09	-
`WAYMO_DATA_PER_HOUR_LOW`	1	TB/h
`WAYMO_DATA_PER_HOUR_HIGH`	19	TB/h
`VIDEO_1080P_WIDTH`	1920	-
`VIDEO_1080P_HEIGHT`	1080	-
`VIDEO_BYTES_PER_PIXEL_RGB`	3	B
`VIDEO_FPS_STANDARD`	30	Hz

All the preceding constants—hardware specs, model parameters, energy costs, economic rates, and scale references—are expressed in specific units. For these values to combine correctly in calculations, every quantity must carry its units explicitly.

Unit Definitions

The constants file defines the base and derived units listed in Table 14 so that all dimensional analysis in the book uses a consistent unit system via the pint library. These are not assumptions per se, but they ensure that every computed value carries its units and that unit-conversion errors are caught automatically.

Table 14: Unit Definitions: Base and derived units used by the pint dimensional-analysis library throughout the book. Every computed value carries its units, so mixing incompatible quantities (for example, adding bytes to FLOP/s) raises an immediate error rather than producing a silent wrong answer.

Constant	Value	Unit
`byte`	byte	-
`KB`	KB	-
`MB`	MB	-
`GB`	GB	-
`TB`	TB	-
`PB`	PB	-
`flop`	flop	-
`GFLOPs`	GFLOPs	-
`TFLOPs`	TFLOPs	-
`ZFLOPs`	ZFLOPs	-
`param`	param	-
`Mparam`	Mparam	-
`Gbps`	Gbps	-
`NS`	NS	-
`US`	US	-
`MS`	MS	-
`second`	second	-
`hour`	hour	-
`day`	day	-
`joule`	joule	-
`watt`	watt	-
`meter`	meter	-
`USD`	dollar	-

With all constants, units, and scale references in place, the next concern is avoiding the most common mistakes practitioners make when applying these numbers to real-world estimates.

Fallacies and Pitfalls

Fallacy: Peak FLOPS predict real-world training throughput.

Datasheet FLOPS are measured under idealized conditions—perfectly aligned matrix dimensions, 100 percent occupancy, zero memory stalls. Real training workloads typically achieve 30–50 percent of peak (measured as Model FLOPS Utilization, or MFU). Using peak FLOPS to estimate training time without an MFU discount produces estimates that are 2–3.3$\times$ too optimistic, leading to missed deadlines and budget overruns.

Pitfall: Using FP32 FLOPS when the workload runs in BF16 or FP8.

Modern accelerators have separate datapaths for different precisions, and the peak throughput varies dramatically: the H100 delivers 989 TFLOPS in FP16 tensor operations but only 67 TFLOPS in FP32 CUDA-core operations—about 15$\times$ difference. Quoting the wrong precision’s peak when computing utilization or estimating training time produces meaningless results. Always match the constant to the precision the workload actually uses.

Fallacy: Hardware constants are stable enough to hardcode.

Accelerator specifications, cloud pricing, and energy costs change with every hardware generation and contract renegotiation. Hardcoding “the A100 has 2 TB/s bandwidth” in a calculation means the estimate silently rots as hardware evolves. This is precisely why the book uses mlsysim/core/constants.py—updating a single value propagates the correction everywhere.

Pitfall: Treating TDP as actual power consumption.

Thermal Design Power (TDP) is the maximum sustained power draw the cooling system must handle, not the power the accelerator actually consumes under a given workload. Real power consumption varies with utilization, memory access patterns, and clock frequency. Using TDP for energy calculations overestimates costs for inference workloads (which rarely sustain peak power) and may underestimate costs for sustained training workloads on newer hardware with dynamic boost.

Summary

Taken together, the reference values above turn the appendix from a catalogue into a set of checks for quantitative reasoning.

Key Takeaways: Auditing the Book's constants

Every example traces to a constant: Every quantitative example in this book traces back to a specific constant in mlsysim/core/constants.py. This appendix exposes all of those constants so the numbers that underpin the book’s reasoning are open to audit, verification, and update.
Hardware specs are ceilings: Hardware specs (peak FLOPS, memory bandwidth, TDP) set ceilings, not guarantees. Real utilization is typically 30–50 percent of peak for training workloads; using peak values without discounting produces dangerously optimistic estimates.
Ratios are durable: The ratios between constants are often more stable and informative than the absolute values. The ridge point (FLOPS/bandwidth), the memory-per-parameter cost (16 bytes for mixed-precision Adam), and the order-of-magnitude energy hierarchy (tens of thousands of times between a register access and a DRAM access) persist across hardware generations.
A single source of truth prevents drift: A single source of truth for constants eliminates the most common source of inconsistency in quantitative textbooks: the same number quoted differently in different chapters.