Principles
A working model is rarely an efficient one. Part II established how to construct ML systems that respect physical constraints; Part III addresses how to meet real-world demands on time, memory, and energy. Every optimization involves navigating a frontier: improving one metric (accuracy, latency, energy) while managing the cost to others. The principles here define the physics of efficiency—the laws that determine why some models are fast and affordable while others are slow and prohibitively expensive.
Principle 1: The Pareto Frontier
- Quantization trades numerical precision for reduced memory footprint.
- Pruning trades model capacity for smaller representations and can improve speed when the resulting sparsity or removed structures are supported by the target hardware.
- Distillation trades training compute for inference efficiency.
Implication: Systems engineers navigate this frontier to find the operating point appropriate to a specific deployment environment. There is no universal optimum.
Navigating the Pareto frontier requires knowing which resource to optimize. Before selecting a technique, engineers must diagnose whether the workload is limited by computation, memory movement, or dispatch latency. Arithmetic intensity supplies the first test.
Principle 2: Arithmetic Intensity Law
Implication: Adding compute power to a memory-bound model yields zero performance gain. Engineers must identify whether the bottleneck is compute (compute bound) or memory (bandwidth bound) before selecting an optimization technique.
Table 1 maps each bottleneck type to the optimization that addresses it—and, equally important, the optimization that would be wasted.
| If the workload is… | Dominant Term | Optimization That Works | Optimization That is Wasted |
|---|---|---|---|
| Memory-Bound | \(D_{\text{vol}}/\text{BW}\) | Quantization, pruning, batching | Faster GPU (more FLOP/s) |
| Compute-Bound | \(O/(R_{\text{peak}} \cdot \eta_{\text{hw}})\) | Better kernels, Tensor Cores, faster GPU | More memory bandwidth |
| Latency-Bound | \(L_{\text{lat}}\) | Kernel fusion, async dispatch, bounded microbatching under the latency SLO | More FLOP/s or bandwidth alone |
Many important ML kernels fall on the memory-bound side of the ridge point, especially low-reuse operations such as embedding lookup, normalization, softmax, depthwise convolution, and small-batch inference paths. Dense matrix multiplications and convolutions can instead be compute-bound when batching and hardware utilization are high. This split is a consequence of the memory wall: processor speed has historically outpaced memory bandwidth, and the cumulative gap has widened over three decades. Neural networks, with their massive weight tensors and uneven temporal locality, are especially vulnerable. The Arithmetic Intensity Law diagnoses where a workload sits relative to this wall. The cost of moving data explains why the wall is so punishing.
Principle 3: The Energy-Movement Invariant
Implication: Data locality is the primary driver of efficiency for memory-bound workloads. Optimization strategies should prioritize kernel fusion (keeping data in registers) and quantization (reducing data size) when movement dominates, while reducing raw operation counts and improving arithmetic kernels remain central for compute-bound workloads.
Even with perfect data locality and optimal bottleneck targeting, a final constraint limits how much speedup any optimization can deliver.
Principle 4: Amdahl's Law
Implication: If 95 percent of a model runs 100\(\times\) faster on a GPU, the total system speedup is capped at ~16.8\(\times\). This explains why data loading and preprocessing often become the ultimate bottlenecks in highly optimized systems.
Part III applies these principles systematically through the D·A·M taxonomy—Data, Algorithm, Machine—asking first whether the work is necessary, then whether it can be simplified, and finally how to do it faster (see The D·A·M Taxonomy for the full diagnostic framework).