Development Principles

Building a machine learning system requires more than assembling model components; it requires managing the flow of information and energy through silicon. Part I established that data is both the program and the physical anchor of every ML system. With that anchor in place, Part II turns to the algorithm-machine interaction: how mathematical models are co-designed around the physical limits of the hardware that must execute them. The principles here begin with the accounting that determines why certain architectures succeed while others fail at scale.

Principle 1: The Iron Law of ML Systems

Invariant: The total time (\(T\)) of any machine learning operation is governed by three components: data movement, compute, and fixed system overhead: \[ T = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}} \] where \(D_{\text{vol}}\) is data volume (bytes moved), \(\text{BW}\) is memory bandwidth, \(O\) is total floating-point operations, \(R_{\text{peak}}\) is peak compute rate, \(\eta_{\text{hw}}\) is hardware utilization efficiency, and \(L_{\text{lat}}\) is fixed latency overhead such as kernel launch or network round-trip time. The full treatment appears in Iron Law of ML Systems.

When these stages overlap on modern hardware, wall-clock time is dominated by whichever term is largest. This is why the equation’s practical lesson is about dominance, not summation: the term that takes longest sets the floor.

Implication: Optimization is rarely free of trade-offs. Reducing one term often shifts the bottleneck to another. For example, unstructured pruning reduces compute (\(O\)) but introduces irregular memory access patterns that can increase data movement (\(D_{\text{vol}}/\text{BW}\)). A “faster” algorithm on paper is faster in reality only if it reduces the dominant term for the target hardware.

The iron law identifies what to optimize, but not how. The architecture decides how before implementation begins.

Principle 2: The Silicon Contract

Invariant: Every model architecture and workload regime makes an implicit commitment to the hardware, a wager on which resource it will saturate first.

ResNet-50 assumes high-density floating-point compute. In batched training or inference on accelerators, it is often compute bound: performance is limited by \(O/(R_{\text{peak}} \cdot \eta_{\text{hw}})\).
8-billion-parameter Llama 3 assumes high-bandwidth memory access during autoregressive decoding. At small batch sizes, it is often bandwidth bound: performance is limited by \(D_{\text{vol}}/\text{BW}\).
DLRM assumes massive embedding tables and sparse lookups. It is shaped by both capacity and bandwidth: performance depends on whether embedding tables fit in the available memory hierarchy and how quickly sparse accesses can be served.

Implication: Designing a model without knowing which hardware resource it will saturate is like designing a bridge without knowing the strength of the steel. The design must target the bottleneck.

Together, the iron law and the silicon contract frame every design decision in Part II. The chapters that follow translate these principles into the components of the ML stack: the mathematical foundations of gradient flow, the architectural patterns that commit to specific hardware resources, the frameworks that automate the iron law, and the training systems that execute the same physics at scale.