Machine Foundations

Purpose

What reference numbers and physical laws should every ML systems engineer carry into design decisions?

In ML systems, performance failures often masquerade as software problems: a training step mysteriously slows down, a serving stack misses its latency objective, or an accelerator upgrade fails to deliver the expected speedup. Many of these surprises are not bugs but the predictable consequences of physics (latency, bandwidth, energy) and architecture (memory hierarchy, precision, parallel scaling). This appendix collects the reference numbers and compact models for quick quantitative reasoning: numbers to know, roofline analysis, dimensional analysis, scaling laws, and precision trade-offs. In D·A·M terms, these numbers define the machine axis, the physical limits that algorithm choices and data movement must respect.

How to Use This Appendix

This appendix is designed as a reference. When diagnosing performance issues, use this appendix to translate a vague symptom (“it’s slow”) into a specific constraint (“memory bound at batch size one”) and then choose the lever that can actually move.

Conventions used here follow the book-wide notation (for example, \(B\) is reserved for batch size and \(\text{BW}\) for bandwidth).

  • Sanity-check feasibility: Start with section 1.1 for order-of-magnitude numbers.
  • Diagnose the dominant ceiling: Use the Roofline Model in section 1.2.1 to decide whether the workload is compute bound or memory bound.
  • Reason about scaling limits: Use Amdahl’s and Gustafson’s Laws in section 1.2.3 to understand why adding accelerators may not reduce time-to-train.
  • Choose the right precision: Use section 1.4.1 to reason about FP32 vs. BF16/FP16 vs. INT8 as a systems trade-off.
  • Cross-reference for depth: When you want the full narrative, jump back to Hardware Acceleration, Model Training, and Model Serving.

Numbers to Know

Just as Jeff Dean’s “Latency Numbers Every Programmer Should Know”1 shaped a generation of systems engineers, these reference numbers provide the order-of-magnitude intuition essential for ML systems design. While absolute values evolve with hardware generations, the ratios between categories remain remarkably stable. Memorize the relationships; use the specific numbers as sanity checks.

1 Jeff Dean: A Google Senior Fellow and one of the architects of Google’s distributed systems infrastructure, including MapReduce, BigTable, and TensorFlow. His latency numbers, originally presented with Peter Norvig around 2010, became a canonical reference for systems engineers. The numbers have been updated over the years as hardware evolved, but the hierarchy of latencies remains remarkably stable; Colin Scott’s interactive visualization shows the latency hierarchy across hardware generations (Scott 2012).

Scott, Colin. 2012. “Numbers Every Programmer Should Know by Year.”
Systems Perspective 1.1: Three numbers that matter most
  • Energy ratio: DRAM access costs ~581× more energy than an FP16 FLOP. This is why arithmetic intensity is everything.
  • Training-state footprint: Model weights (2 bytes FP16) + gradients (2 bytes FP16) + master weights (4 bytes FP32) + optimizer states for Adaptive Moment Estimation (Adam) at 8 bytes. That totals 16 bytes per parameter, so a 7B model needs 112 GB just to start training.
  • Fiber propagation limit: Light travels about 200 km/ms in fiber. Cross-country latency is ~40 ms. No optimization can reduce this—it is physics.

The invariants: Numbers that will not change

These relationships are governed by physics or arithmetic—they will still be true in 2035.

Speed of light tax

Table 1 shows the irreducible latency floor for any distributed system.

Table 1: Speed of Light Reference: Light in fiber travels ~200 km/ms. These latencies are physics—no optimization can reduce them.
Distance Round-Trip Latency Implication
Same data center ~1 ms Distributed training feasible
Cross-country (US) ~40 ms Edge needed for <100 ms apps
Cross-Atlantic ~60 ms CDN required for global users
Cross-Pacific ~100 ms Data locality is critical

Energy hierarchy

Table 2 quantifies the energy cost of data movement vs. computation—the fundamental reason why arithmetic intensity dominates ML performance optimization.2

2 Energy Hierarchy Source: Energy numbers from Horowitz (2014), “Computing’s Energy Problem” (ISSCC, 45nm process). While absolute values scale with process node, the ratios between memory access and compute remain remarkably stable because wire capacitance (distance) dominates.

Horowitz, Mark. 2014. “1.1 Computing’s Energy Problem (and What We Can Do about It).” 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 10–14. https://doi.org/10.1109/isscc.2014.6757323.
Table 2: The Energy Wall: Moving data costs ~580\(\times\) more energy than computing on it. This ratio is physics, not engineering.
Relationship Ratio Why It is Stable
DRAM access vs. FP16 compute ~581× Wire capacitance scales with distance
FP32 vs. INT8 energy ~18× Bit width determines switching energy
FP32 vs. FP16 energy ~3.4× Narrower arithmetic reduces switching and datapath energy
L1 SRAM vs. register ~50× Distance to ALU

Memory hierarchy

Table 3 shows why the memory hierarchy is uneven: nearby on-package hops can differ by only a few times, while off-chip, storage, and network tiers introduce orders-of-magnitude jumps.

Table 3: The Latency Hierarchy: Selected latency relationships in the memory hierarchy. The largest gaps come from crossing physical boundaries: off-chip memory, peripheral links, storage, and network fabric.
Relationship Ratio Why It Persists
Accelerator memory (HBM) vs. register ~1000× slower On-chip vs. off-chip
SSD vs. register ~300,000× slower Electrical vs. mechanical/flash
Network vs. local memory ~16× slower Speed of light + switching
Accelerator memory BW vs. CPU↔︎Accelerator link ~52× faster Architectural investment priority

Scaling laws

Table 4 collects the arithmetic relationships that govern memory and compute requirements for training and inference.3

3 Training Memory (Adam): The 16 bytes/parameter rule assumes mixed-precision training with Adam. ZeRO optimization can reduce per-accelerator memory by sharding optimizer states across accelerators, but the total memory across all accelerators remains ~16\(\times\) parameters.

Table 4: Scaling Rules: These are arithmetic, not hardware-specific. Training memory includes FP16 weights (2 bytes), FP16 gradients (2 bytes), FP32 master weights (4 bytes), and Adam optimizer states (8 bytes for momentum + variance).
Rule Formula Example
Inference memory (FP16) 2 bytes\(\times\) parameters 7B params → 14 GB
Inference memory (INT8) 1 byte\(\times\) parameters 7B params → 7 GB
Training memory (Adam) 16 bytes\(\times\) parameters 7B params → 112 GB
Inference FLOPs (transformer) ~2\(\times\) parameters per token 7B model → ~14 GFLOPs/token
Training FLOPs ~6\(\times\) parameters\(\times\) tokens 7B on 1T tokens → \(4 \times 10^{22}\) FLOPs
Data center vs. edge compute ~28.3× Compute per watt\(\times\) power budget

Latency budgets: The nonnegotiables

These budgets are set by physics (safety) or psychology (human perception)—not by engineering choice. Unlike hardware specs that improve each generation, these are constraints your system must meet (table 5).

Table 5: Latency Targets: Miss these and the application fails, regardless of accuracy.
Application Budget Constraint
Autonomous braking <10 ms At 100 km/h, 10 ms = 28 cm of travel
Voice assistant <100 ms Human perception of “instant”
Web search <200 ms User patience threshold
Video streaming <1 s Buffer tolerance
Batch training hours–days Throughput dominates latency

Current hardware reference (c. 2024)

These numbers reflect the current generation. Use them for back-of-envelope calculations, but expect them to improve ~2\(\times\) every 2–3 years.

Memory latency and bandwidth

Table 6 captures the full latency and bandwidth hierarchy for current-generation hardware.

Table 6: Memory Hierarchy (c. 2024): Specific values for current hardware.
Level Latency Bandwidth
Register ~0.3 ns
L1 Cache ~1 ns
L2 Cache ~4 ns
GPU HBM3 ~300 ns 3.4 TB/s
PCIe Gen5 (CPU↔︎GPU) ~1000 ns 64 GB/s
CPU DRAM ~100 ns 50 GB/s
InfiniBand (network) ~5000 ns 50 GB/s
NVMe SSD ~100000 ns 7 GB/s

Compute throughput

Table 7 shows the raw throughput available at each tier of the deployment hierarchy.

Table 7: Compute Reference (c. 2024): Using the current FP16 and mobile INT8 constants, data-center peak throughput is about 28.3× the mobile reference; exact ratios depend on precision mode and workload.
Platform FP16/BF16 INT8/FP8-class Power
Data center GPU (H100) 989 TFLOP/s 1979 TFLOP/s (FP8 peak) 700 W
Data center GPU (A100) 312 TFLOP/s 624 TOPS 400 W
Mobile NPU 35 TOPS 3–5 W

Roofline ridge points

Table 8 defines the arithmetic intensity thresholds that determine whether a workload is memory bound or compute bound.

Table 8: Arithmetic Intensity Thresholds (c. 2024): Most inference workloads are <10 FLOP/byte—firmly memory bound.
Accelerator Ridge Point Implication
A100 (FP16) 153 FLOP/byte Below → memory-bound
H100 (FP16) 295 FLOP/byte Higher bar for compute-bound

Systems Perspective 1.2: A note on terminology: GPUs and accelerators
Throughout this book, we often use “accelerator” when discussing hardware acceleration. However, the principles—roofline analysis, memory hierarchies, numerical precision, and performance modeling—apply equally to GPUs, Tensor Processing Units (TPUs), NPUs, custom ASICs, and other specialized AI accelerators. We use “accelerator” as the universal term, but readers should understand these concepts apply to GPUs unless we explicitly discuss vendor-specific features (for example, CUDA, NVLink).

Knowing the numbers is only the first step. The real power comes from having compact models that tell you which number matters for your specific bottleneck. The next section provides exactly these diagnostic tools—starting with the Roofline Model, which translates raw hardware specs into actionable performance ceilings.

Physics of Computing

Raw hardware specs—TFLOP/s, TB/s, watt budgets—are necessary but insufficient for performance reasoning. Without compact analytical models, an engineer cannot distinguish a compute-bound workload from a memory-bound one, or predict whether doubling GPUs will halve training time. The models in this section provide exactly these diagnostic tools.

Systems Perspective 1.3: Why this matters
Consider a model that achieves good accuracy, but inference takes 200 ms when the SLA requires 50 ms. Performance analysis models provide a systematic method to diagnose whether the system is limited by computation, memory bandwidth, or other factors. Without these models, optimization relies on guesswork.

The roofline model

The Roofline Model (Williams et al. 2009) bounds how fast a workload can run on a given hardware target. The answer depends on whether you run out of compute or memory bandwidth first.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://doi.org/10.1145/1498765.1498785.

Every operation has an arithmetic intensity: the ratio of computations performed to bytes moved from memory. Matrix multiplication has high arithmetic intensity because each loaded element is reused many times. Element-wise operations like rectified linear unit (ReLU) have low intensity because each operation loads a number, performs one computation, and writes it back. As figure 1 illustrates, each workload is bounded by either memory bandwidth or compute throughput, and its arithmetic intensity determines which ceiling it hits first.

Figure 1: The Roofline model: Performance ceiling for a hypothetical accelerator. The sloped line represents memory bandwidth limits; the horizontal line represents peak compute. Every workload can be plotted on this diagram to determine its optimization strategy.

The ridge point determines the hardware’s balance. If a workload’s intensity falls below this point, it is memory-bound (sloped region). If above, it is compute-bound (flat region). \[ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{bytes accessed}} \]

\[ \text{Ridge Point} = \frac{\text{Peak FLOP/s}}{\text{Memory Bandwidth}} \]

Systems Perspective 1.4: Batch size controls arithmetic intensity
For matrix multiplications, arithmetic intensity scales with the batch dimension. When you compute \(Y = XW\) where \(X\) is \((B\times d_{\text{in}})\) and \(W\) is \((d_{\text{in}}\times d_{\text{out}})\):

  • FLOPs: \(2 \times B \times d_{\text{in}} \times d_{\text{out}}\) (multiply-adds)
  • Bytes: Weights are loaded once: \(d_{\text{in}} \times d_{\text{out}} \times \text{bytes}_{\text{precision}}\)

Doubling the batch size \(B\) doubles FLOPs while keeping weight loads constant—directly increasing arithmetic intensity. This is why inference serving batches requests: batch size 1 is almost always memory bound, while batch size 64+ can approach the compute ceiling.

A concrete example: The A100 analysis

Consider an NVIDIA A100 GPU with FP16 Tensor Core performance of 312 TFLOP/s and HBM2e bandwidth of 2.04 TB/s. The ridge point is 312 TFLOP/s/2.04 TB/s = 153 FLOP/byte (the Tera prefixes cancel, yielding FLOP/byte).

Two common operations fall on opposite sides of that ridge. General matrix multiply (GEMM) on two square matrices of size 4096 by 4096 has arithmetic intensity of approximately 1365 FLOP/byte, so 1365 FLOP/byte > 153 FLOP/byte and the operation is compute bound. An element-wise ReLU on a square tensor of the same size has intensity of only 0.25 FLOP/byte, so 0.25 FLOP/byte ≪ 153 FLOP/byte and the operation is severely memory bound, achieving only about 0.16 percent of peak TFLOP/s. This contrast explains why modern frameworks fuse operations: combining ReLU with the preceding MatMul avoids writing intermediate results to memory, effectively increasing arithmetic intensity.

Dimensional analysis

The Roofline Model helps diagnose where a bottleneck lies. Before applying any performance equation, however, it must be verified as physically meaningful. Dimensional analysis provides this sanity check: any valid equation must be dimensionally homogeneous—every term must resolve to the same units. If they do not, the equation contains an error.

Consider the iron law of ML systems (principle 3) introduced in Iron Law of ML Systems: \[ T = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}} \]

We verify correctness by confirming that every term resolves to time (seconds): \[ T [s] = \underbrace{ \frac{D_{\text{vol}} [\text{bytes}]}{\text{BW} [\text{bytes/s}]} }_{\text{seconds}} + \underbrace{ \frac{O [\text{FLOPs}]}{R_{\text{peak}} [\text{FLOP/s}] \cdot \eta_{\text{hw}} [1]} }_{\text{seconds}} + \underbrace{ L_{\text{lat}} [s] }_{\text{seconds}} \]

  • Data term: \(\frac{\text{bytes}}{\text{bytes/s}} = \text{bytes} \times \frac{\text{s}}{\text{bytes}} = \mathbf{s}\)
  • Compute term: \(\frac{\text{FLOPs}}{\text{FLOP/s}} = \text{FLOPs} \times \frac{\text{s}}{\text{FLOP}} = \mathbf{s}\)
  • Overhead Term: Already in seconds.

The equation is physically consistent. Apply this technique to any systems equation: if the dimensions do not match, the formula is wrong. “FLOPs” and “Bandwidth” cannot be traded directly because they have different units. Any such trade-off must convert through Time, which is precisely what the iron law quantifies.

The fundamental limits of scaling across multiple devices are the subject of section 1.2.3.

Amdahl’s Law and Gustafson’s Law

Parallelization is the primary tool for scaling ML, but its limits depend on how you scale. These two laws frame the fundamental tension in parallel computing. Amdahl’s Law is the pessimist’s view, governing how much faster a fixed task can run (optimizing latency). Gustafson’s Law is the optimist’s view, governing how much more work we can do in the same time (optimizing throughput).

Strong scaling (Amdahl’s Law)

Strong scaling measures how much faster a fixed-size problem runs as processors are added.

Amdahl’s Law (Amdahl 1967) states that the speedup is limited by the serial portion of the task.4 If a fraction \(s\) of your task is serial (cannot be parallelized) and \(p = 1-s\) is parallelizable, the maximum speedup with \(n\) processors is: \[ \text{Speedup}(n) = \frac{1}{s + \frac{1-s}{n}} \]

Amdahl, Gene M. 1967. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Proceedings of the April 18-20, 1967, Spring Joint Computer Conference on - AFIPS ’67 (Spring), AFIPS ’67 (spring), 483–85. https://doi.org/10.1145/1465482.1465560.

4 Gene Amdahl (1922–2015): A legendary computer architect at IBM, where he was the chief architect of the System/360. He later founded Amdahl Corporation to compete with IBM in the mainframe market.

As \(n \to \infty\), the term \(\frac{1-s}{n} \to 0\), and the speedup converges to \(1/s\).

To see Amdahl’s Law in action, suppose 5 percent of a training step is serial overhead (for example, Python global interpreter lock (GIL), kernel launch latency) and 95 percent is parallelizable matrix math:

  • With \(n=1\), speedup is 1.
  • With \(n=\) 8, speedup is 1/(0.05 + 0.95/8) ≈ 5.9×.
  • With \(n \to \infty\), speedup is capped at 1/0.05 = 20×.

No matter how many accelerators are added, this fixed workload cannot run faster than 20×.

Weak scaling (Gustafson’s Law)

Weak scaling measures how much larger a problem can become while holding runtime fixed as processors are added.

This is the reality of Large Language Models. Rather than using 1,000 accelerators to train a model on a small dataset in milliseconds, they are used to train on a dataset 1,000\(\times\) larger in reasonable time.

Gustafson’s Law (Gustafson 1988) models this “scaled speedup”:5 \[ \text{Scaled Speedup}(n) = n - s(n - 1) \]

Gustafson, John L. 1988. “Reevaluating Amdahl’s Law.” Communications of the ACM 31 (5): 532–33. https://doi.org/10.1145/42411.42415.

5 John Gustafson: A computer scientist known for his work in parallel computing and for introducing the Unum (universal number) format. His law was a direct response to the perceived “limits” of Amdahl’s Law when applied to massive scale.

Here, the parallel part of the workload grows linearly with \(n\), while the serial part \(s\) remains fixed.

Using the same 5 percent serial overhead (\(s\) = 0.05), Gustafson’s Law tells a very different story:

  • With \(n=1\), speedup is 1.
  • With \(n=\) 8, Scaled Speedup is 8 − 0.05 \(\times\) (7) = 8 − 0.35 = 7.65×.
  • With \(n=\) 1000, Scaled Speedup is 1000 − 0.05 \(\times\) (999) ≈ 950×.

In weak scaling, efficiency remains high because the useful work (training the model) scales up to dwarf the fixed overheads.

The same scaling lens turns model size, data size, hardware count, and utilization into a single training-time estimate.

Napkin Math 1.1: The training time equation
Just as classical architecture has an “iron law” of performance, Large Language Model training has a fundamental governing equation. To estimate training time \(T\): \[ T \approx \frac{6 \cdot P \cdot D}{N_{\text{GPU}} \cdot R_{\text{peak}} \cdot \eta_{\text{hw}}} \] Where:

  • Training FLOP factor: The factor \(6\) derives from the forward pass (\(2PD\)) and backward pass (\(4PD\)) FLOPs per token.
  • \(P\): Number of model parameters.
  • \(D\): Number of training tokens.
  • \(N_{\text{GPU}}\): Number of GPUs.
  • \(R_{\text{peak}}\): Peak FLOP/s of one accelerator.
  • \(\eta_{\text{hw}}\): Hardware utilization, typically 30 percent–50 percent in this training estimate.

Example: Training a 1B parameter model on 20B using 1 A100 (312 TFLOP/s) at 40 percent utilization. \(\text{Total FLOPs} = 6 \times 1 \times 10^{9} \times 2 \times 10^{10} = 1.2 \times 10^{20} \text{ FLOPs}\) \(\text{Throughput} = 1 \times (3.12 \times 10^{14}) \times 0.40 \approx 1.248 \times 10^{14} \text{ FLOP/s}\) \(T = \frac{1.2 \times 10^{20}}{1.248 \times 10^{14}} \approx 961,538.5 \text{ seconds} \approx 16,025.6 \text{ minutes}\)

The computed result is 961,538.5 seconds (≈ 16025.6 minutes, or about 11.1 days).

Before moving from scaling laws to queueing, pause on the consequences of these models for concrete design choices.

Checkpoint 1.1: Check your understanding: Performance models
  1. A new accelerator doubles compute throughput but keeps memory bandwidth the same. For a workload that is memory-bound on the current hardware, how much speedup do you expect? What about a compute-bound workload?

  2. Your training pipeline has 10 percent serial overhead. Using Amdahl’s Law, what is the maximum possible speedup regardless of how many accelerators you add? Using Gustafson’s Law with 256 accelerators, what is the scaled speedup?

  3. An inference service must handle 500 queries per second (QPS) at 100 ms latency. Using Little’s Law, how many concurrent requests must the system support? If each request needs 2 GB of KV cache memory, what is the minimum accelerator memory required?

Little’s Law

For capacity planning in inference systems, Little’s Law (Little 1961) relates concurrency (\(Q_{\text{req}}\)), arrival rate (\(\lambda_{\text{arr}}\)), and time in system (\(T_{\text{lat}}\)):6 \[ Q_{\text{req}} = \lambda_{\text{arr}} \times T_{\text{lat}} \]

Little, John D. C. 1961. “A Proof for the Queuing Formula: <I>l</i> = \(\Lambda\)<i>w</i>.” Operations Research 9 (3): 383–87. https://doi.org/10.1287/opre.9.3.383.

6 John Little: An Institute Professor at MIT and a pioneer in the field of operations research. His law, proved in 1961, is fundamental to queuing theory and is used across fields from manufacturing to computer network analysis.

To see this in practice, consider sustaining 1,000 QPS with 50 ms average latency. The law tells us the system must support 1000 \(\times\) 0.05 s = 50 concurrent requests.

This directly determines how to size inference worker pools. If serving one request requires 1 GB of temporary memory (KV cache, activations), handling 50 concurrent requests requires 50 GB of memory. If the accelerator only has 24 GB, the system is physically limited to 24 concurrent requests. Maximum throughput is capped at \(N_{\text{max}}/T_{\text{lat}} = 24 / 0.05 = 480 \text{ QPS}\), regardless of how many requests arrive.

These physics-based models—Roofline, Amdahl, Gustafson, and Little—diagnose where bottlenecks lie. Translating those diagnoses into actionable optimizations, however, requires understanding the concrete hardware structures that impose them: caches, memory buses, and interconnects.

Computer Architecture Essentials

A GPU advertises 1,000 TFLOP/s, yet your kernel achieves only 30 TFLOP/s. The missing 97 percent is not a software bug—it is the cost of moving data through a memory hierarchy that spans five orders of magnitude in latency. While physics sets theoretical performance bounds, computer architecture defines the machinery that determines how close a real workload can get. The following discussion covers the latency, bandwidth, and energy trade-offs that shape system design.

Latencies every programmer should know

The first step in systems intuition is understanding the cost of distance. Table 9 quantifies how long the processor waits for data from different levels of the memory hierarchy. If accessing a register is like picking up a pencil from your desk, fetching from HBM is walking across the office, and fetching from disk is flying to the moon.

Table 9: The Latency Hierarchy: Access times for modern AI hardware. Note the massive jump from SRAM (Cache) to HBM. Any kernel that misses cache pays a heavy penalty.
Component Latency (ns) Cycles (Approx) Relative “Distance”
Register ~0.3 ns 1 cycle 10 seconds
L1 Cache ~1 ns 3–4 cycles 33.3 seconds
L2 Cache ~4 ns 12 cycles 2.2 minutes
HBM3 (GPU Memory) ~300 ns 1,000 cycles 2.8 hours
NVLink (GPU-GPU) ~500 ns 1,500 cycles 4.6 hours
PCIe (CPU-GPU) ~1000 ns 3,000 cycles 9.3 hours
InfiniBand (Network) ~5000 ns 15,000 cycles 1.9 days
SSD (NVMe) ~100000 ns 300,000 cycles 38.6 days

The AI hardware cheat sheet (modern reference)

While latency tells us how long we wait for the first byte, bandwidth tells us how many bytes follow. Table 10 provides the constants for back-of-the-envelope “Roofline” calculations. These represent the “standard units of compute” for the current era of machine learning.

Table 10: Reference Specs: Key constants for quantitative analysis. Always check specific datasheets, but these serve as standard units of compute.
Spec NVIDIA H100 (SXM) Google TPU v5p System Impact
FP16/BF16 Peak 989 TFLOP/s 459 TFLOP/s The “Speed Limit” (\(R_{\text{peak}}\))
Memory Bandwidth 3.35 TB/s 2.76 TB/s The “Width of the Pipe” (\(\text{BW}\))
HBM Capacity 85 GB 102 GB Max Model Size (\(P\))/Batch Size (\(B\))
L2/SRAM Cache 50 MB ~100 MB Critical for Operator Fusion
Interconnect 900 GB/s (NVLink) 1200 GB/s (Inter-Chip Interconnect, ICI) Determines Model Parallelism Scaling

The memory hierarchy

Computer systems use a hierarchy because no single technology provides both high capacity and low latency. Examine the pyramid in figure 2 to see how each level balances this trade-off: every technique that keeps data higher in the pyramid (registers/cache) directly improves performance.

Figure 2: The Memory Hierarchy: Performance depends on data proximity. Accessing HBM is roughly 1,000\(\times\) slower than registers; accessing SSD is roughly 300,000\(\times\) slower.

The memory hierarchy is the fundamental physical constraint of machine learning systems. Table 11 consolidates the physical properties—latency, bandwidth, and energy—across the entire stack.

Table 11: Physical Properties of the Memory Hierarchy (c. 2024): Consolidating latency, bandwidth, and energy across the memory hierarchy. The hierarchy spans five orders of magnitude in latency and six orders of magnitude in energy per access. For the ML engineer, this table defines the “silicon contract”: every optimization that moves data one layer higher in the hierarchy delivers an order-of-magnitude dividend in performance.
Layer Technology Latency Bandwidth Energy (per 32b)
Registers Flip-Flops ~0.3 ns 0.01 pJ
L1 Cache SRAM ~1 ns 0.5 pJ
L2 Cache SRAM ~4 ns 2 pJ
Memory (Local) HBM3 ~300 ns 3350 GB/s 640 pJ
Interconnect NVLink 4.0 ~500 ns 900 GB/s ~640 pJ
Host Link PCIe Gen5 ~1000 ns 64 GB/s ~640 pJ
System RAM DDR5 ~100 ns 50 GB/s ~640 pJ
Network (Fabric) InfiniBand NDR ~5000 ns 50 GB/s ~10,000 pJ
Storage (Local) NVMe SSD ~100000 ns 7 GB/s ~5,000 pJ

The hierarchy’s energy costs reveal why data movement dominates modern system design.

Napkin Math 1.2: The high cost of data movement
Fetching a 32-bit value from DRAM costs roughly 581× more energy than performing a floating-point operation on it (for example, ~640 pJ vs. ~1.1 pJ). This energy wall means that maximizing arithmetic intensity (doing many ops per loaded byte) is the only way to be energy efficient.

Bandwidth vs. latency

Bandwidth (throughput) and latency (delay) are distinct constraints. Total transfer time follows: \[ T = L_{\text{lat}} + \frac{D_{\text{vol}}}{\text{BW}} \]

The crossover point is the transfer size where the two terms are equal: \[ D_{\text{vol,cross}} = L_{\text{lat}} \times \text{BW} \]

For transfers below \(D_{\text{vol,cross}}\) (for example, single-token inference), latency dominates. For transfers above it (for example, loading weights), bandwidth dominates.

Consider sending data over a 10 Gb/s link with 10 ms ping (latency). The crossover size is 12.5 MB, so the dominant bottleneck depends entirely on which side of that threshold the transfer falls:

  • Latency-bound packet (1 KB):
    • Transmission: 1 KB \(\times\) 8 / 10 Gb/s ≈ 0.8 μs.
    • Total Time ≈ 10 ms + 0.8 μs ≈ 10 ms.
    • Result: The bandwidth is irrelevant; the speed of light (ping) is the bottleneck.
  • Bandwidth-bound checkpoint (1 GB):
    • Transmission: \(1\text{ GB} \times 8 / 10\text{ Gbps} \approx 800\text{ ms}\).
    • Total Time \(\approx 10\text{ ms} + 800\text{ ms} = 810\text{ ms}\).
    • Result: The ping is negligible; the pipe size is the bottleneck.

Architecture determines how fast data can move, but there is another lever that directly controls how much data must move: the numerical precision of each value. Halving precision from FP32 to FP16 halves the bytes per parameter, which doubles effective bandwidth for free—if the model can tolerate the reduced precision. Understanding these trade-offs requires a closer look at how numbers are represented in hardware.

Numerical Representations

While statistics helps us understand data distributions, numerical representations determine how we store the values themselves. In ML systems, the choice of precision (FP32 vs. BF16 vs. INT8) is a direct trade-off between statistical fidelity and hardware throughput.

Systems Perspective 1.5: Why this matters
A production model might run at 50 QPS in FP32 when the target is 200 QPS. Switching to INT8 could achieve this throughput, but accuracy may suffer. Understanding numerical formats enables a quantitative evaluation of this trade-off.

Floating-point format comparison

IEEE 754 formats such as FP32 and FP16, together with AI-specific formats such as BF16 and FP8, define different trade-offs between dynamic range (the span of representable values) and precision (the granularity of values within that range). Table 12 summarizes the key formats and their use cases, while figure 3 visualizes the bit allocations.

Table 12: Numerical Format Comparison: Each format trades off precision, dynamic range, memory footprint, and compute throughput. BF16 has emerged as the preferred training format because it matches FP32’s range while using half the memory.
Format Bits Exponent Mantissa Dynamic Range Typical Use Case
FP32 32 8 23 \(\sim 10^{-38}\) to \(10^{38}\) Training (full precision), reference inference
FP16 16 5 10 \(\sim 10^{-5}\) to \(6.5 \times 10^{4}\) Training with loss scaling, inference
BF16 16 8 7 Same as FP32 Training (preferred), avoids loss scaling
FP8 8 4 or 5 3 or 2 Varies Inference on recent hardware (H100 and later)
INT8 8 N/A N/A -128 to 127 Inference after quantization
Figure 3: Numerical Format Bit Layouts: A visual comparison of bit allocations. Note how BF16 (Brain Float 16) preserves the 8-bit exponent of FP32, ensuring the same dynamic range for training stability. FP16 trades range for precision, often requiring loss scaling to prevent underflow.

Beyond bit width, the allocation of bits between exponent and mantissa determines what range of values each format can represent.

Systems Perspective 1.6: The dynamic range wall

The choice of numerical format is a direct application of the iron law of ML systems (principle 3). Reducing precision from FP32 to BF16 or FP16 halves the data volume term, potentially doubling throughput on memory-bound workloads. However, the type of 16-bit format determines the engineering complexity:

  • Dynamic range (the exponent): BF16 preserves the eight-bit exponent of FP32. This means it can represent the same range of extremely large and extremely small values (gradients) (Kalamkar et al. 2019).
  • Precision (the mantissa): FP16 has a larger 10-bit mantissa than BF16 (7 bits), offering higher precision for values within its range. Its five-bit exponent, however, is a major constraint; gradients often “vanish” to zero (underflow) because the exponent cannot represent them. To solve this, FP16 training requires Loss Scaling, an operational overhead where gradients are multiplied by a large constant to push them into the representable range (Micikevicius et al. 2017).
  • Energy efficiency: INT8 operations can be substantially more efficient than floating-point equivalents because they move less data and use simpler integer arithmetic paths. Moving to INT8 for inference is a primary lever for deploying neural networks under tight memory, latency, or power budgets (Jacob et al. 2018; Krishnamoorthi 2018).
Krishnamoorthi, Raghuraman. 2018. “Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper.” arXiv Preprint arXiv:1806.08342 abs/1806.08342.
Jacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2704–13. https://doi.org/10.1109/cvpr.2018.00286.
Micikevicius, Paulius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, et al. 2017. “Mixed Precision Training.” arXiv Preprint arXiv:1710.03740.
Kalamkar, Dhiraj, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, et al. 2019. A Study of BFLOAT16 for Deep Learning Training.

7 BF16 (Brain Floating Point 16): Originally introduced with Google TPUv2 and later adopted by Intel, Arm, and NVIDIA. Its ML systems value is that it keeps FP32’s exponent range while halving storage and memory traffic, so large training runs can use lower precision without the loss-scaling machinery FP16 often needs.

Cloud, Google. 2019. BFloat16: The Secret to High Performance on Cloud TPUs.

Among these formats, BF167 deserves special attention (Cloud 2019; Kalamkar et al. 2019). By matching FP32’s eight-bit exponent while truncating the mantissa to just 7 bits, BF16 preserves the full dynamic range needed for gradient representation. This avoids the underflow problems that plague FP16 training, reducing the need for the loss-scaling machinery that FP16 training often requires.

Integer quantization

Quantization maps continuous floating-point values to discrete integers, typically INT8. The key challenge is choosing how to map the floating-point range to integers. Two approaches dominate.

Symmetric quantization centers the mapping at zero: \[ x_{\text{int}} = \text{round}\left(\frac{x}{s_{\text{quant}}} \times 127\right) \] where \(s_{\text{quant}}\) is the scale factor (typically the maximum absolute value). This works well for weight distributions centered around zero.

Asymmetric quantization handles distributions that are not centered (common after ReLU, which produces only nonnegative values) by shifting the range before scaling. A common activation variant maps to an unsigned 8-bit code. If \(x_{\min}\) is the minimum of the range and \(s_{\text{quant}}\) is the range width (\(x_{\max} - x_{\min}\)): \[ x_{\text{uint8}} = \text{round}\left(\frac{x - x_{\min}}{s_{\text{quant}}} \times 255\right) \]

The choice between symmetric and asymmetric quantization depends on your tensor’s distribution and has measurable accuracy implications.

With the full toolkit assembled—reference numbers, performance models, architectural constraints, and numerical trade-offs—use the reference numbers and models in this appendix as your first line of defense whenever a system behaves unexpectedly. A quick back-of-envelope calculation often reveals whether the culprit is physics, architecture, or a genuine software bug.

Summary

Key Takeaways: Numbers every engineer should know
  • Energy dominates: Moving data costs ~600\(\times\) more energy than computing on it. Arithmetic intensity—the ratio of compute to data movement—is the single most important metric for ML workload performance.
  • The Roofline model: The Roofline Model reveals whether a workload is compute bound or memory bound. Most inference workloads fall below the ridge point and are memory bound; batch size is the primary lever to shift toward compute-bound operation.
  • Amdahl’s Law: Amdahl’s Law caps strong-scaling speedup at \(1/s\) (where \(s\) is the serial fraction). Gustafson’s Law shows that scaling the problem alongside hardware yields near-linear throughput gains—the paradigm that makes large-scale training feasible.
  • Little’s Law: The relationship \(Q_{\text{req}} = \lambda_{\text{arr}} T_{\text{lat}}\) directly sizes inference infrastructure: concurrency, memory, and maximum throughput are all linked by this simple identity.
  • Memory hierarchy: The memory hierarchy spans five orders of magnitude in latency (register at ~0.3 ns to SSD at ~100,000 ns). Keeping data close to compute is not an optimization—it is the optimization.
  • Numerical precision: Numerical precision is a systems lever, not just a modeling choice. BF16 matches FP32’s dynamic range at half the memory cost; INT8 quantization can deliver 2–4\(\times\) inference speedup with careful calibration.
  • Physics is nonnegotiable: Speed of light sets latency floors, energy ratios set efficiency ceilings, and no amount of software optimization can violate these constraints.
Back to top