Machine Foundations
Purpose
What reference numbers and physical laws should every ML systems engineer carry into design decisions?
In ML systems, performance failures often masquerade as software problems: a training step mysteriously slows down, a serving stack misses its latency objective, or an accelerator upgrade fails to deliver the expected speedup. Many of these surprises are not bugs but the predictable consequences of physics (latency, bandwidth, energy) and architecture (memory hierarchy, precision, parallel scaling). This appendix collects the reference numbers and compact models for quick quantitative reasoning: numbers to know, roofline analysis, dimensional analysis, scaling laws, and precision trade-offs. In D·A·M terms, these numbers define the machine axis, the physical limits that algorithm choices and data movement must respect.
How to Use This Appendix
This appendix is designed as a reference. When diagnosing performance issues, use this appendix to translate a vague symptom (“it’s slow”) into a specific constraint (“memory bound at batch size one”) and then choose the lever that can actually move.
Conventions used here follow the book-wide notation (for example, \(B\) is reserved for batch size and \(\text{BW}\) for bandwidth).
- Sanity-check feasibility: Start with section 1.1 for order-of-magnitude numbers.
- Diagnose the dominant ceiling: Use the Roofline Model in section 1.2.1 to decide whether the workload is compute bound or memory bound.
- Reason about scaling limits: Use Amdahl’s and Gustafson’s Laws in section 1.2.3 to understand why adding accelerators may not reduce time-to-train.
- Choose the right precision: Use section 1.4.1 to reason about FP32 vs. BF16/FP16 vs. INT8 as a systems trade-off.
- Cross-reference for depth: When you want the full narrative, jump back to Hardware Acceleration, Model Training, and Model Serving.
Numbers to Know
Just as Jeff Dean’s “Latency Numbers Every Programmer Should Know”1 shaped a generation of systems engineers, these reference numbers provide the order-of-magnitude intuition essential for ML systems design. While absolute values evolve with hardware generations, the ratios between categories remain remarkably stable. Memorize the relationships; use the specific numbers as sanity checks.
1 Jeff Dean: A Google Senior Fellow and one of the architects of Google’s distributed systems infrastructure, including MapReduce, BigTable, and TensorFlow. His latency numbers, originally presented with Peter Norvig around 2010, became a canonical reference for systems engineers. The numbers have been updated over the years as hardware evolved, but the hierarchy of latencies remains remarkably stable; Colin Scott’s interactive visualization shows the latency hierarchy across hardware generations (Scott 2012).
Systems Perspective 1.1: Three numbers that matter most
- Energy ratio: DRAM access costs ~581× more energy than an FP16 FLOP. This is why arithmetic intensity is everything.
- Training-state footprint: Model weights (2 bytes FP16) + gradients (2 bytes FP16) + master weights (4 bytes FP32) + optimizer states for Adaptive Moment Estimation (Adam) at 8 bytes. That totals 16 bytes per parameter, so a 7B model needs 112 GB just to start training.
- Fiber propagation limit: Light travels about 200 km/ms in fiber. Cross-country latency is ~40 ms. No optimization can reduce this—it is physics.
The invariants: Numbers that will not change
These relationships are governed by physics or arithmetic—they will still be true in 2035.
Speed of light tax
Table 1 shows the irreducible latency floor for any distributed system.
| Distance | Round-Trip Latency | Implication |
|---|---|---|
| Same data center | ~1 ms | Distributed training feasible |
| Cross-country (US) | ~40 ms | Edge needed for <100 ms apps |
| Cross-Atlantic | ~60 ms | CDN required for global users |
| Cross-Pacific | ~100 ms | Data locality is critical |
Energy hierarchy
Table 2 quantifies the energy cost of data movement vs. computation—the fundamental reason why arithmetic intensity dominates ML performance optimization.2
2 Energy Hierarchy Source: Energy numbers from Horowitz (2014), “Computing’s Energy Problem” (ISSCC, 45nm process). While absolute values scale with process node, the ratios between memory access and compute remain remarkably stable because wire capacitance (distance) dominates.
| Relationship | Ratio | Why It is Stable |
|---|---|---|
| DRAM access vs. FP16 compute | ~581× | Wire capacitance scales with distance |
| FP32 vs. INT8 energy | ~18× | Bit width determines switching energy |
| FP32 vs. FP16 energy | ~3.4× | Narrower arithmetic reduces switching and datapath energy |
| L1 SRAM vs. register | ~50× | Distance to ALU |
Memory hierarchy
Table 3 shows why the memory hierarchy is uneven: nearby on-package hops can differ by only a few times, while off-chip, storage, and network tiers introduce orders-of-magnitude jumps.
| Relationship | Ratio | Why It Persists |
|---|---|---|
| Accelerator memory (HBM) vs. register | ~1000× slower | On-chip vs. off-chip |
| SSD vs. register | ~300,000× slower | Electrical vs. mechanical/flash |
| Network vs. local memory | ~16× slower | Speed of light + switching |
| Accelerator memory BW vs. CPU↔︎Accelerator link | ~52× faster | Architectural investment priority |
Scaling laws
Table 4 collects the arithmetic relationships that govern memory and compute requirements for training and inference.3
3 Training Memory (Adam): The 16 bytes/parameter rule assumes mixed-precision training with Adam. ZeRO optimization can reduce per-accelerator memory by sharding optimizer states across accelerators, but the total memory across all accelerators remains ~16\(\times\) parameters.
| Rule | Formula | Example |
|---|---|---|
| Inference memory (FP16) | 2 bytes\(\times\) parameters | 7B params → 14 GB |
| Inference memory (INT8) | 1 byte\(\times\) parameters | 7B params → 7 GB |
| Training memory (Adam) | 16 bytes\(\times\) parameters | 7B params → 112 GB |
| Inference FLOPs (transformer) | ~2\(\times\) parameters per token | 7B model → ~14 GFLOPs/token |
| Training FLOPs | ~6\(\times\) parameters\(\times\) tokens | 7B on 1T tokens → \(4 \times 10^{22}\) FLOPs |
| Data center vs. edge compute | ~28.3× | Compute per watt\(\times\) power budget |
Latency budgets: The nonnegotiables
These budgets are set by physics (safety) or psychology (human perception)—not by engineering choice. Unlike hardware specs that improve each generation, these are constraints your system must meet (table 5).
| Application | Budget | Constraint |
|---|---|---|
| Autonomous braking | <10 ms | At 100 km/h, 10 ms = 28 cm of travel |
| Voice assistant | <100 ms | Human perception of “instant” |
| Web search | <200 ms | User patience threshold |
| Video streaming | <1 s | Buffer tolerance |
| Batch training | hours–days | Throughput dominates latency |
Current hardware reference (c. 2024)
These numbers reflect the current generation. Use them for back-of-envelope calculations, but expect them to improve ~2\(\times\) every 2–3 years.
Memory latency and bandwidth
Table 6 captures the full latency and bandwidth hierarchy for current-generation hardware.
| Level | Latency | Bandwidth |
|---|---|---|
| Register | ~0.3 ns | — |
| L1 Cache | ~1 ns | — |
| L2 Cache | ~4 ns | — |
| GPU HBM3 | ~300 ns | 3.4 TB/s |
| PCIe Gen5 (CPU↔︎GPU) | ~1000 ns | 64 GB/s |
| CPU DRAM | ~100 ns | 50 GB/s |
| InfiniBand (network) | ~5000 ns | 50 GB/s |
| NVMe SSD | ~100000 ns | 7 GB/s |
Compute throughput
Table 7 shows the raw throughput available at each tier of the deployment hierarchy.
| Platform | FP16/BF16 | INT8/FP8-class | Power |
|---|---|---|---|
| Data center GPU (H100) | 989 TFLOP/s | 1979 TFLOP/s (FP8 peak) | 700 W |
| Data center GPU (A100) | 312 TFLOP/s | 624 TOPS | 400 W |
| Mobile NPU | — | 35 TOPS | 3–5 W |
Roofline ridge points
Table 8 defines the arithmetic intensity thresholds that determine whether a workload is memory bound or compute bound.
| Accelerator | Ridge Point | Implication |
|---|---|---|
| A100 (FP16) | 153 FLOP/byte | Below → memory-bound |
| H100 (FP16) | 295 FLOP/byte | Higher bar for compute-bound |
Systems Perspective 1.2: A note on terminology: GPUs and accelerators
Knowing the numbers is only the first step. The real power comes from having compact models that tell you which number matters for your specific bottleneck. The next section provides exactly these diagnostic tools—starting with the Roofline Model, which translates raw hardware specs into actionable performance ceilings.
Physics of Computing
Raw hardware specs—TFLOP/s, TB/s, watt budgets—are necessary but insufficient for performance reasoning. Without compact analytical models, an engineer cannot distinguish a compute-bound workload from a memory-bound one, or predict whether doubling GPUs will halve training time. The models in this section provide exactly these diagnostic tools.
Systems Perspective 1.3: Why this matters
The roofline model
The Roofline Model (Williams et al. 2009) bounds how fast a workload can run on a given hardware target. The answer depends on whether you run out of compute or memory bandwidth first.
Every operation has an arithmetic intensity: the ratio of computations performed to bytes moved from memory. Matrix multiplication has high arithmetic intensity because each loaded element is reused many times. Element-wise operations like rectified linear unit (ReLU) have low intensity because each operation loads a number, performs one computation, and writes it back. As figure 1 illustrates, each workload is bounded by either memory bandwidth or compute throughput, and its arithmetic intensity determines which ceiling it hits first.
The ridge point determines the hardware’s balance. If a workload’s intensity falls below this point, it is memory-bound (sloped region). If above, it is compute-bound (flat region). \[ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{bytes accessed}} \]
\[ \text{Ridge Point} = \frac{\text{Peak FLOP/s}}{\text{Memory Bandwidth}} \]
Systems Perspective 1.4: Batch size controls arithmetic intensity
- FLOPs: \(2 \times B \times d_{\text{in}} \times d_{\text{out}}\) (multiply-adds)
- Bytes: Weights are loaded once: \(d_{\text{in}} \times d_{\text{out}} \times \text{bytes}_{\text{precision}}\)
Doubling the batch size \(B\) doubles FLOPs while keeping weight loads constant—directly increasing arithmetic intensity. This is why inference serving batches requests: batch size 1 is almost always memory bound, while batch size 64+ can approach the compute ceiling.
A concrete example: The A100 analysis
Consider an NVIDIA A100 GPU with FP16 Tensor Core performance of 312 TFLOP/s and HBM2e bandwidth of 2.04 TB/s. The ridge point is 312 TFLOP/s/2.04 TB/s = 153 FLOP/byte (the Tera prefixes cancel, yielding FLOP/byte).
Two common operations fall on opposite sides of that ridge. General matrix multiply (GEMM) on two square matrices of size 4096 by 4096 has arithmetic intensity of approximately 1365 FLOP/byte, so 1365 FLOP/byte > 153 FLOP/byte and the operation is compute bound. An element-wise ReLU on a square tensor of the same size has intensity of only 0.25 FLOP/byte, so 0.25 FLOP/byte ≪ 153 FLOP/byte and the operation is severely memory bound, achieving only about 0.16 percent of peak TFLOP/s. This contrast explains why modern frameworks fuse operations: combining ReLU with the preceding MatMul avoids writing intermediate results to memory, effectively increasing arithmetic intensity.
Dimensional analysis
The Roofline Model helps diagnose where a bottleneck lies. Before applying any performance equation, however, it must be verified as physically meaningful. Dimensional analysis provides this sanity check: any valid equation must be dimensionally homogeneous—every term must resolve to the same units. If they do not, the equation contains an error.
Consider the iron law of ML systems (principle 3) introduced in Iron Law of ML Systems: \[ T = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}} \]
We verify correctness by confirming that every term resolves to time (seconds): \[ T [s] = \underbrace{ \frac{D_{\text{vol}} [\text{bytes}]}{\text{BW} [\text{bytes/s}]} }_{\text{seconds}} + \underbrace{ \frac{O [\text{FLOPs}]}{R_{\text{peak}} [\text{FLOP/s}] \cdot \eta_{\text{hw}} [1]} }_{\text{seconds}} + \underbrace{ L_{\text{lat}} [s] }_{\text{seconds}} \]
- Data term: \(\frac{\text{bytes}}{\text{bytes/s}} = \text{bytes} \times \frac{\text{s}}{\text{bytes}} = \mathbf{s}\)
- Compute term: \(\frac{\text{FLOPs}}{\text{FLOP/s}} = \text{FLOPs} \times \frac{\text{s}}{\text{FLOP}} = \mathbf{s}\)
- Overhead Term: Already in seconds.
The equation is physically consistent. Apply this technique to any systems equation: if the dimensions do not match, the formula is wrong. “FLOPs” and “Bandwidth” cannot be traded directly because they have different units. Any such trade-off must convert through Time, which is precisely what the iron law quantifies.
The fundamental limits of scaling across multiple devices are the subject of section 1.2.3.
Amdahl’s Law and Gustafson’s Law
Parallelization is the primary tool for scaling ML, but its limits depend on how you scale. These two laws frame the fundamental tension in parallel computing. Amdahl’s Law is the pessimist’s view, governing how much faster a fixed task can run (optimizing latency). Gustafson’s Law is the optimist’s view, governing how much more work we can do in the same time (optimizing throughput).
Strong scaling (Amdahl’s Law)
Strong scaling measures how much faster a fixed-size problem runs as processors are added.
Amdahl’s Law (Amdahl 1967) states that the speedup is limited by the serial portion of the task.4 If a fraction \(s\) of your task is serial (cannot be parallelized) and \(p = 1-s\) is parallelizable, the maximum speedup with \(n\) processors is: \[ \text{Speedup}(n) = \frac{1}{s + \frac{1-s}{n}} \]
4 Gene Amdahl (1922–2015): A legendary computer architect at IBM, where he was the chief architect of the System/360. He later founded Amdahl Corporation to compete with IBM in the mainframe market.
As \(n \to \infty\), the term \(\frac{1-s}{n} \to 0\), and the speedup converges to \(1/s\).
To see Amdahl’s Law in action, suppose 5 percent of a training step is serial overhead (for example, Python global interpreter lock (GIL), kernel launch latency) and 95 percent is parallelizable matrix math:
- With \(n=1\), speedup is 1.
- With \(n=\) 8, speedup is 1/(0.05 + 0.95/8) ≈ 5.9×.
- With \(n \to \infty\), speedup is capped at 1/0.05 = 20×.
No matter how many accelerators are added, this fixed workload cannot run faster than 20×.
Weak scaling (Gustafson’s Law)
Weak scaling measures how much larger a problem can become while holding runtime fixed as processors are added.
This is the reality of Large Language Models. Rather than using 1,000 accelerators to train a model on a small dataset in milliseconds, they are used to train on a dataset 1,000\(\times\) larger in reasonable time.
Gustafson’s Law (Gustafson 1988) models this “scaled speedup”:5 \[ \text{Scaled Speedup}(n) = n - s(n - 1) \]
5 John Gustafson: A computer scientist known for his work in parallel computing and for introducing the Unum (universal number) format. His law was a direct response to the perceived “limits” of Amdahl’s Law when applied to massive scale.
Here, the parallel part of the workload grows linearly with \(n\), while the serial part \(s\) remains fixed.
Using the same 5 percent serial overhead (\(s\) = 0.05), Gustafson’s Law tells a very different story:
- With \(n=1\), speedup is 1.
- With \(n=\) 8, Scaled Speedup is 8 − 0.05 \(\times\) (7) = 8 − 0.35 = 7.65×.
- With \(n=\) 1000, Scaled Speedup is 1000 − 0.05 \(\times\) (999) ≈ 950×.
In weak scaling, efficiency remains high because the useful work (training the model) scales up to dwarf the fixed overheads.
The same scaling lens turns model size, data size, hardware count, and utilization into a single training-time estimate.
Napkin Math 1.1: The training time equation
- Training FLOP factor: The factor \(6\) derives from the forward pass (\(2PD\)) and backward pass (\(4PD\)) FLOPs per token.
- \(P\): Number of model parameters.
- \(D\): Number of training tokens.
- \(N_{\text{GPU}}\): Number of GPUs.
- \(R_{\text{peak}}\): Peak FLOP/s of one accelerator.
- \(\eta_{\text{hw}}\): Hardware utilization, typically 30 percent–50 percent in this training estimate.
Example: Training a 1B parameter model on 20B using 1 A100 (312 TFLOP/s) at 40 percent utilization. \(\text{Total FLOPs} = 6 \times 1 \times 10^{9} \times 2 \times 10^{10} = 1.2 \times 10^{20} \text{ FLOPs}\) \(\text{Throughput} = 1 \times (3.12 \times 10^{14}) \times 0.40 \approx 1.248 \times 10^{14} \text{ FLOP/s}\) \(T = \frac{1.2 \times 10^{20}}{1.248 \times 10^{14}} \approx 961,538.5 \text{ seconds} \approx 16,025.6 \text{ minutes}\)
The computed result is 961,538.5 seconds (≈ 16025.6 minutes, or about 11.1 days).
Before moving from scaling laws to queueing, pause on the consequences of these models for concrete design choices.
Checkpoint 1.1: Check your understanding: Performance models
A new accelerator doubles compute throughput but keeps memory bandwidth the same. For a workload that is memory-bound on the current hardware, how much speedup do you expect? What about a compute-bound workload?
Your training pipeline has 10 percent serial overhead. Using Amdahl’s Law, what is the maximum possible speedup regardless of how many accelerators you add? Using Gustafson’s Law with 256 accelerators, what is the scaled speedup?
An inference service must handle 500 queries per second (QPS) at 100 ms latency. Using Little’s Law, how many concurrent requests must the system support? If each request needs 2 GB of KV cache memory, what is the minimum accelerator memory required?
Little’s Law
For capacity planning in inference systems, Little’s Law (Little 1961) relates concurrency (\(Q_{\text{req}}\)), arrival rate (\(\lambda_{\text{arr}}\)), and time in system (\(T_{\text{lat}}\)):6 \[ Q_{\text{req}} = \lambda_{\text{arr}} \times T_{\text{lat}} \]
6 John Little: An Institute Professor at MIT and a pioneer in the field of operations research. His law, proved in 1961, is fundamental to queuing theory and is used across fields from manufacturing to computer network analysis.
To see this in practice, consider sustaining 1,000 QPS with 50 ms average latency. The law tells us the system must support 1000 \(\times\) 0.05 s = 50 concurrent requests.
This directly determines how to size inference worker pools. If serving one request requires 1 GB of temporary memory (KV cache, activations), handling 50 concurrent requests requires 50 GB of memory. If the accelerator only has 24 GB, the system is physically limited to 24 concurrent requests. Maximum throughput is capped at \(N_{\text{max}}/T_{\text{lat}} = 24 / 0.05 = 480 \text{ QPS}\), regardless of how many requests arrive.
These physics-based models—Roofline, Amdahl, Gustafson, and Little—diagnose where bottlenecks lie. Translating those diagnoses into actionable optimizations, however, requires understanding the concrete hardware structures that impose them: caches, memory buses, and interconnects.
Computer Architecture Essentials
A GPU advertises 1,000 TFLOP/s, yet your kernel achieves only 30 TFLOP/s. The missing 97 percent is not a software bug—it is the cost of moving data through a memory hierarchy that spans five orders of magnitude in latency. While physics sets theoretical performance bounds, computer architecture defines the machinery that determines how close a real workload can get. The following discussion covers the latency, bandwidth, and energy trade-offs that shape system design.
Latencies every programmer should know
The first step in systems intuition is understanding the cost of distance. Table 9 quantifies how long the processor waits for data from different levels of the memory hierarchy. If accessing a register is like picking up a pencil from your desk, fetching from HBM is walking across the office, and fetching from disk is flying to the moon.
| Component | Latency (ns) | Cycles (Approx) | Relative “Distance” |
|---|---|---|---|
| Register | ~0.3 ns | 1 cycle | 10 seconds |
| L1 Cache | ~1 ns | 3–4 cycles | 33.3 seconds |
| L2 Cache | ~4 ns | 12 cycles | 2.2 minutes |
| HBM3 (GPU Memory) | ~300 ns | 1,000 cycles | 2.8 hours |
| NVLink (GPU-GPU) | ~500 ns | 1,500 cycles | 4.6 hours |
| PCIe (CPU-GPU) | ~1000 ns | 3,000 cycles | 9.3 hours |
| InfiniBand (Network) | ~5000 ns | 15,000 cycles | 1.9 days |
| SSD (NVMe) | ~100000 ns | 300,000 cycles | 38.6 days |
The AI hardware cheat sheet (modern reference)
While latency tells us how long we wait for the first byte, bandwidth tells us how many bytes follow. Table 10 provides the constants for back-of-the-envelope “Roofline” calculations. These represent the “standard units of compute” for the current era of machine learning.
| Spec | NVIDIA H100 (SXM) | Google TPU v5p | System Impact |
|---|---|---|---|
| FP16/BF16 Peak | 989 TFLOP/s | 459 TFLOP/s | The “Speed Limit” (\(R_{\text{peak}}\)) |
| Memory Bandwidth | 3.35 TB/s | 2.76 TB/s | The “Width of the Pipe” (\(\text{BW}\)) |
| HBM Capacity | 85 GB | 102 GB | Max Model Size (\(P\))/Batch Size (\(B\)) |
| L2/SRAM Cache | 50 MB | ~100 MB | Critical for Operator Fusion |
| Interconnect | 900 GB/s (NVLink) | 1200 GB/s (Inter-Chip Interconnect, ICI) | Determines Model Parallelism Scaling |
The memory hierarchy
Computer systems use a hierarchy because no single technology provides both high capacity and low latency. Examine the pyramid in figure 2 to see how each level balances this trade-off: every technique that keeps data higher in the pyramid (registers/cache) directly improves performance.
The memory hierarchy is the fundamental physical constraint of machine learning systems. Table 11 consolidates the physical properties—latency, bandwidth, and energy—across the entire stack.
| Layer | Technology | Latency | Bandwidth | Energy (per 32b) |
|---|---|---|---|---|
| Registers | Flip-Flops | ~0.3 ns | — | 0.01 pJ |
| L1 Cache | SRAM | ~1 ns | — | 0.5 pJ |
| L2 Cache | SRAM | ~4 ns | — | 2 pJ |
| Memory (Local) | HBM3 | ~300 ns | 3350 GB/s | 640 pJ |
| Interconnect | NVLink 4.0 | ~500 ns | 900 GB/s | ~640 pJ |
| Host Link | PCIe Gen5 | ~1000 ns | 64 GB/s | ~640 pJ |
| System RAM | DDR5 | ~100 ns | 50 GB/s | ~640 pJ |
| Network (Fabric) | InfiniBand NDR | ~5000 ns | 50 GB/s | ~10,000 pJ |
| Storage (Local) | NVMe SSD | ~100000 ns | 7 GB/s | ~5,000 pJ |
The hierarchy’s energy costs reveal why data movement dominates modern system design.
Napkin Math 1.2: The high cost of data movement
Bandwidth vs. latency
Bandwidth (throughput) and latency (delay) are distinct constraints. Total transfer time follows: \[ T = L_{\text{lat}} + \frac{D_{\text{vol}}}{\text{BW}} \]
The crossover point is the transfer size where the two terms are equal: \[ D_{\text{vol,cross}} = L_{\text{lat}} \times \text{BW} \]
For transfers below \(D_{\text{vol,cross}}\) (for example, single-token inference), latency dominates. For transfers above it (for example, loading weights), bandwidth dominates.
Consider sending data over a 10 Gb/s link with 10 ms ping (latency). The crossover size is 12.5 MB, so the dominant bottleneck depends entirely on which side of that threshold the transfer falls:
- Latency-bound packet (1 KB):
- Transmission: 1 KB \(\times\) 8 / 10 Gb/s ≈ 0.8 μs.
- Total Time ≈ 10 ms + 0.8 μs ≈ 10 ms.
- Result: The bandwidth is irrelevant; the speed of light (ping) is the bottleneck.
- Bandwidth-bound checkpoint (1 GB):
- Transmission: \(1\text{ GB} \times 8 / 10\text{ Gbps} \approx 800\text{ ms}\).
- Total Time \(\approx 10\text{ ms} + 800\text{ ms} = 810\text{ ms}\).
- Result: The ping is negligible; the pipe size is the bottleneck.
Architecture determines how fast data can move, but there is another lever that directly controls how much data must move: the numerical precision of each value. Halving precision from FP32 to FP16 halves the bytes per parameter, which doubles effective bandwidth for free—if the model can tolerate the reduced precision. Understanding these trade-offs requires a closer look at how numbers are represented in hardware.
Numerical Representations
While statistics helps us understand data distributions, numerical representations determine how we store the values themselves. In ML systems, the choice of precision (FP32 vs. BF16 vs. INT8) is a direct trade-off between statistical fidelity and hardware throughput.
Systems Perspective 1.5: Why this matters
Floating-point format comparison
IEEE 754 formats such as FP32 and FP16, together with AI-specific formats such as BF16 and FP8, define different trade-offs between dynamic range (the span of representable values) and precision (the granularity of values within that range). Table 12 summarizes the key formats and their use cases, while figure 3 visualizes the bit allocations.
| Format | Bits | Exponent | Mantissa | Dynamic Range | Typical Use Case |
|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | \(\sim 10^{-38}\) to \(10^{38}\) | Training (full precision), reference inference |
| FP16 | 16 | 5 | 10 | \(\sim 10^{-5}\) to \(6.5 \times 10^{4}\) | Training with loss scaling, inference |
| BF16 | 16 | 8 | 7 | Same as FP32 | Training (preferred), avoids loss scaling |
| FP8 | 8 | 4 or 5 | 3 or 2 | Varies | Inference on recent hardware (H100 and later) |
| INT8 | 8 | N/A | N/A | -128 to 127 | Inference after quantization |
Beyond bit width, the allocation of bits between exponent and mantissa determines what range of values each format can represent.
Systems Perspective 1.6: The dynamic range wall
The choice of numerical format is a direct application of the iron law of ML systems (principle 3). Reducing precision from FP32 to BF16 or FP16 halves the data volume term, potentially doubling throughput on memory-bound workloads. However, the type of 16-bit format determines the engineering complexity:
- Dynamic range (the exponent): BF16 preserves the eight-bit exponent of FP32. This means it can represent the same range of extremely large and extremely small values (gradients) (Kalamkar et al. 2019).
- Precision (the mantissa): FP16 has a larger 10-bit mantissa than BF16 (7 bits), offering higher precision for values within its range. Its five-bit exponent, however, is a major constraint; gradients often “vanish” to zero (underflow) because the exponent cannot represent them. To solve this, FP16 training requires Loss Scaling, an operational overhead where gradients are multiplied by a large constant to push them into the representable range (Micikevicius et al. 2017).
- Energy efficiency: INT8 operations can be substantially more efficient than floating-point equivalents because they move less data and use simpler integer arithmetic paths. Moving to INT8 for inference is a primary lever for deploying neural networks under tight memory, latency, or power budgets (Jacob et al. 2018; Krishnamoorthi 2018).
7 BF16 (Brain Floating Point 16): Originally introduced with Google TPUv2 and later adopted by Intel, Arm, and NVIDIA. Its ML systems value is that it keeps FP32’s exponent range while halving storage and memory traffic, so large training runs can use lower precision without the loss-scaling machinery FP16 often needs.
Among these formats, BF167 deserves special attention (Cloud 2019; Kalamkar et al. 2019). By matching FP32’s eight-bit exponent while truncating the mantissa to just 7 bits, BF16 preserves the full dynamic range needed for gradient representation. This avoids the underflow problems that plague FP16 training, reducing the need for the loss-scaling machinery that FP16 training often requires.
Integer quantization
Quantization maps continuous floating-point values to discrete integers, typically INT8. The key challenge is choosing how to map the floating-point range to integers. Two approaches dominate.
Symmetric quantization centers the mapping at zero: \[ x_{\text{int}} = \text{round}\left(\frac{x}{s_{\text{quant}}} \times 127\right) \] where \(s_{\text{quant}}\) is the scale factor (typically the maximum absolute value). This works well for weight distributions centered around zero.
Asymmetric quantization handles distributions that are not centered (common after ReLU, which produces only nonnegative values) by shifting the range before scaling. A common activation variant maps to an unsigned 8-bit code. If \(x_{\min}\) is the minimum of the range and \(s_{\text{quant}}\) is the range width (\(x_{\max} - x_{\min}\)): \[ x_{\text{uint8}} = \text{round}\left(\frac{x - x_{\min}}{s_{\text{quant}}} \times 255\right) \]
The choice between symmetric and asymmetric quantization depends on your tensor’s distribution and has measurable accuracy implications.
With the full toolkit assembled—reference numbers, performance models, architectural constraints, and numerical trade-offs—use the reference numbers and models in this appendix as your first line of defense whenever a system behaves unexpectedly. A quick back-of-envelope calculation often reveals whether the culprit is physics, architecture, or a genuine software bug.
Summary
Key Takeaways: Numbers every engineer should know
- Energy dominates: Moving data costs ~600\(\times\) more energy than computing on it. Arithmetic intensity—the ratio of compute to data movement—is the single most important metric for ML workload performance.
- The Roofline model: The Roofline Model reveals whether a workload is compute bound or memory bound. Most inference workloads fall below the ridge point and are memory bound; batch size is the primary lever to shift toward compute-bound operation.
- Amdahl’s Law: Amdahl’s Law caps strong-scaling speedup at \(1/s\) (where \(s\) is the serial fraction). Gustafson’s Law shows that scaling the problem alongside hardware yields near-linear throughput gains—the paradigm that makes large-scale training feasible.
- Little’s Law: The relationship \(Q_{\text{req}} = \lambda_{\text{arr}} T_{\text{lat}}\) directly sizes inference infrastructure: concurrency, memory, and maximum throughput are all linked by this simple identity.
- Memory hierarchy: The memory hierarchy spans five orders of magnitude in latency (register at ~0.3 ns to SSD at ~100,000 ns). Keeping data close to compute is not an optimization—it is the optimization.
- Numerical precision: Numerical precision is a systems lever, not just a modeling choice. BF16 matches FP32’s dynamic range at half the memory cost; INT8 quantization can deliver 2–4\(\times\) inference speedup with careful calibration.
- Physics is nonnegotiable: Speed of light sets latency floors, energy ratios set efficiency ceilings, and no amount of software optimization can violate these constraints.