Notation
Machine Learning Systems spans Machine Learning (computer science/statistics) and Systems (computer architecture/hardware). Each field developed its notation independently, and many symbols mean different things depending on the community, paper, or literature using them. This collision creates real confusion when the disciplines merge. The conventions below establish a single notation for eliminating ambiguity.
Consider a simple statement: “Increasing \(B\) improves throughput.” To an ML researcher, \(B\) means batch size. To a hardware engineer, \(B\) means bandwidth. Both interpretations are correct in their respective fields, but in ML Systems we need both concepts in the same equation—hence the need for a single, consistent convention.
The Iron Law of ML Systems
The fundamental performance equation of this book, introduced in the opening chapter, is:
\[T = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}}\]
Each variable was chosen deliberately to avoid collision with standard ML terminology.
| Symbol | Definition | Unit | Why This Symbol? |
|---|---|---|---|
| \(T\) | Time | seconds | Unambiguous. Wall-clock time for an operation. |
| \(D_{\text{vol}}\) | Data Volume | bytes | Avoids collision with \(D\) (Dataset Size). In scaling laws, \(D\) means training tokens. Here we need bytes moved through memory. The subscript disambiguates. |
| \(\text{BW}\) | Bandwidth | bytes/s | Avoids collision with \(B\) (Batch Size). Physics uses \(B\) for bandwidth, but every ML paper uses \(B\) for batch size. We preserve the ML convention. |
| \(O\) | Operations | FLOPs | Total floating-point operations. Clean in equations (vs. “\(\text{Ops}\)”). |
| \(R_{\text{peak}}\) | Peak Rate | FLOP/s | Avoids collision with \(P\) (Parameters). Roofline models use \(P\) for peak performance, but ML universally uses \(P\) for parameter count. We preserve the ML convention. |
| \(\eta_{\text{hw}}\) | Efficiency | — | Hardware utilization \((0 \le \eta_{\text{hw}} \le 1)\). Avoids collision with learning rate \((\eta)\). |
| \(L_{\text{lat}}\) | Latency | seconds | Avoids collision with \(\mathcal{L}\) (Loss). Fixed overhead time (kernel launch, network RTT). The subscript distinguishes from the loss function. |
Why these choices matter
Without careful notation, sentences become ambiguous:
“Reducing \(D\) improves performance.”
The sentence has two possible readings:
- Reducing dataset size (fewer training samples) can speed training but may reduce accuracy.
- Reducing data volume moved (for example, through compression or quantization) can speed inference, with accuracy effects that depend on the technique and calibration.
With our notation, we can write precisely:
“Reducing \(D_{\text{vol}}\) through FP32-to-INT8 quantization cuts parameter memory traffic to one quarter while \(D\) (training data) remains unchanged.”
Our notation makes such ambiguity explicit: “\(\text{BW}\) limits throughput” is unambiguous.
Subscripted variants
When a multi-letter quantity name takes a descriptive subscript (for example, distinguishing storage bandwidth from network bandwidth), wrap both the root and the subscript in \text{}:
- \(\text{BW}_{\text{disk}}\), \(\text{BW}_{\text{network}}\), \(\text{BW}_{\text{accelerator}}\)
- \(D_{\text{vol}}\), \(R_{\text{peak}}\), \(L_{\text{lat}}\), \(\eta_{\text{hw}}\)
- \(E_{\text{move}}\), \(E_{\text{compute}}\), \(E_{\text{total}}\)
Without \text{}, math mode renders each letter as an italic variable and the spacing collapses to look like a product, not a label.
The Degradation Equation
The silent failure mode of ML systems is captured by the degradation equation introduced in the opening chapter. Some symbols below (such as \(\tau\)) appear in chapter prose rather than in the block equation itself; defining them centrally keeps the canonical form unambiguous when they do appear. \[\text{Accuracy}(t) \approx \text{Accuracy}_0 - \lambda \cdot \mathcal{D}(P_t \lVert P_0)\]
| Symbol | Definition | Unit/Type | Notes |
|---|---|---|---|
| \(\text{Accuracy}(t)\) | Accuracy at Time \(t\) | Scalar | Model accuracy after the model has been deployed for time \(t\). |
| \(\text{Accuracy}_0\) | Initial Accuracy | Scalar | Model accuracy at deployment time. |
| \(\lambda\) | Sensitivity | Scalar | Model sensitivity to distribution shift. Architecture-dependent. (Not wavelength.) |
| \(P_t\) | Current Distribution | Distribution | The data distribution at time \(t\). (Not parameters—use \(P\) for parameter count.) |
| \(P_0\) | Training Distribution | Distribution | The data distribution at training time. |
| \(\mathcal{D}(P_t \lVert P_0)\) | Statistical Divergence | Scalar \(\ge 0\) | Measures how far \(P_t\) has drifted from \(P_0\). Common choices: KL divergence, total variation, Wasserstein. (Calligraphic to avoid collision with \(D\) = dataset size.) |
| \(\tau\) | Drift Threshold | Scalar \(> 0\) | Retraining is triggered when \(\mathcal{D}(P_t \lVert P_0) > \tau\). |
The Energy Corollary
The energy cost of ML workloads follows from the iron law of ML systems and decomposes as: \[E_{\text{total}} \approx D_{\text{vol}} \times E_{\text{move}} + O \times E_{\text{compute}}\]
| Symbol | Definition | Unit | Notes |
|---|---|---|---|
| \(E_{\text{total}}\) | Total Energy | joules | Total energy consumed by an ML workload, decomposed into data-movement and compute terms. |
| \(E_{\text{move}}\) | Energy per Byte Moved | joules/byte | Energy cost of data movement. Dominates total energy \((E_{\text{move}} \gg E_{\text{compute}})\). |
| \(E_{\text{compute}}\) | Energy per Operation | joules/FLOP | Energy cost of a single arithmetic operation. |
Deep Learning Notation
We follow standard deep learning conventions (Goodfellow et al. 2016) with explicit disambiguation for systems variables.
| Symbol | Definition | Dimensions/Type |
|---|---|---|
| \(B\) | Batch Size | Integer. The number of samples processed in parallel. (Never bandwidth.) |
| \(P\) | Parameters | Integer. The total count of trainable weights in a model. (Never peak FLOP/s.) |
| \(D\) | Dataset Size | Integer. Number of training samples or tokens. (Never data volume in bytes—use \(D_{\text{vol}}\).) |
| \(S\) | Sequence Length | Integer. Number of tokens or time steps. |
| \(d\) | Hidden Dimension | Integer. Size of the hidden state vector. |
| \(d_{\text{head}}\) | Attention Head Dimension | Integer. Per-head hidden dimension in attention layers. |
| \(N_L\) | Number of Layers | Integer. Total number of layers in a network. |
| \(N_{\text{heads}}\) | Number of Attention Heads | Integer. Number of attention heads in a multi-head attention layer. |
| \(H_{\text{KV}}\) | Number of Key-Value Heads | Integer. Number of key-value heads in grouped-query or multi-query attention. |
| \(\ell\) | Layer Index | Integer. Index for a layer. Use instead of bare \(L\) when indexing layers, since \(L\) collides with loss and latency. |
| \(\mathcal{L}\) | Loss Function | Scalar. The objective function minimized during training. |
| \(\eta\) | Learning Rate | Scalar. Step size for the optimizer. (Never bare for hardware efficiency—use \(\eta_{\text{hw}}\).) |
| \(\theta\) | Model Weights | Vector/Matrix. The set of all learnable parameters. |
| \(p(x)\) | Distribution | Probability mass/density for a random variable. Use lowercase \(p(\cdot)\) for generic distributions to avoid collision with \(P\) (parameter count). |
| \(p(y \mid x)\) | Conditional Distribution | Conditional probability/density. Use this form for generic label relationships; reserve \(P_0\) and \(P_t\) for the degradation equation’s training/current distributions. |
| \(\Pr(E)\) | Event Probability | Probability of an event \(E\). Use for event statements such as \(\Pr(\text{batch}=0)\); use \(p(x)\) and \(p(y \mid x)\) for distributions. |
Local matrix algebra may use \(A\), \(B\), and \(C\) for operands when the equation explicitly introduces matrices (for example, GEMM \(C=\alpha AB+\beta C\)). This is a local linear-algebra convention, not a global override of \(B\) as batch size.
Performance, Serving, and Memory Notation
Reusable performance and serving quantities follow the same collision-avoidance rule as the iron law. Rates use \(R\) or descriptive Greek symbols; request counts use \(Q_{\text{req}}\) rather than overloading \(N\); queue utilization uses a subscripted \(\rho\) so bare \(\rho\) remains available for other ratio models.
| Symbol | Definition | Unit/Type | Notes |
|---|---|---|---|
| \(I\) | Arithmetic Intensity | FLOP/byte | Workload FLOPs per byte moved. The roofline model uses \(I\) as the independent variable. |
| \(I_{\text{ridge}}\) | Roofline Ridge Point | FLOP/byte | \(I_{\text{ridge}} = R_{\text{peak}}/\text{BW}\). Prefer this explicit form over starred shorthand so the meaning remains clear in prose. |
| \(R_{\text{attain}}\) | Attainable Compute Rate | FLOP/s | Roofline rate: \(R_{\text{attain}} = \min(R_{\text{peak}}, I \times \text{BW})\). Uses \(R\), not \(T\), because this quantity is a rate. |
| \(\text{MFU}\) | Model FLOPs Utilization | Dimensionless | Useful model FLOP/s divided by available peak FLOP/s. Text acronym avoids overloading \(\eta\). |
| \(r_{\text{comp}}\) | Compression Ratio | Dimensionless | Uncompressed size divided by compressed size. A compressed payload has size \(M/r_{\text{comp}}\); the subscript avoids bare \(C\) collisions. |
| \(Q_{\text{req}}\) | Request Concurrency | requests | Average in-flight requests in a stable serving system. Avoids collision with \(N\) as device count in distributed settings. |
| \(\lambda_{\text{arr}}\) | Arrival Rate | requests/s | Request arrival rate. The subscript avoids collision with \(\lambda\) as degradation sensitivity or failure rate. |
| \(T_{\text{lat}}\) | Request Time in System | seconds | End-to-end queueing/serving latency for Little’s Law. Distinct from \(L_{\text{lat}}\), the fixed-latency term in the iron law. |
| \(T_{\text{svc}}(B)\) | Batch Service Time | seconds | Time to serve a batch of size \(B\). |
| \(\mu_{\text{eff}}(B)\) | Effective Service Rate | requests/s | Batched service rate, typically \(\mu_{\text{eff}}(B) = B/T_{\text{svc}}(B)\). |
| \(\rho_{\text{serv}}\) | Serving Utilization | Dimensionless | Queue/server utilization. Use instead of bare \(\rho\), which is reserved for communication-computation ratio in distributed contexts. |
| \(M_{\text{total}}\) | Total Memory Footprint | bytes | Sum of explicit memory components; avoids bare \(M\) ambiguity. |
| \(M_{\text{weights}}\) | Weight Memory | bytes | Memory occupied by model parameters. |
| \(M_{\text{gradients}}\) | Gradient Memory | bytes | Memory occupied by stored gradients. |
| \(M_{\text{optimizer}}\) | Optimizer-State Memory | bytes | Momentum, variance, master weights, and related optimizer buffers. |
| \(M_{\text{activations}}\) | Activation Memory | bytes | Retained activations for the backward pass or serving intermediates. |
| \(s_{\text{elem}}\) | Element Storage Size | bytes/element | Bytes per stored tensor element. |
Units and Precision
- Physical units: This book uses SI (metric) units throughout, including meters, kilograms, seconds, watts, and °C, consistent with standard engineering and scientific practice. Where source data was originally reported in imperial units, we convert to SI and note the original values parenthetically. A space always separates the number from the unit (for example, 100 ms, 2 TB/s).
- Data and memory: We use decimal SI prefixes only: KB = \(10^3\) bytes, MB = \(10^6\), GB = \(10^9\), TB = \(10^{12}\). We do not use binary-prefixed units in prose; all capacities, throughputs, and model sizes are reported in decimal units (for example, 80 GB, 2 TB/s, 102 MB).
- Compute: We distinguish operation counts from rates. Total work uses FLOPs (for example, GFLOPs, TFLOPs), while throughput uses FLOP/s with decimal prefixes (for example, GFLOP/s, TFLOP/s).
- 1 TFLOP = \(10^{12}\) FLOPs
- 1 TFLOP/s = \(10^{12}\) FLOPs per second
- Arithmetic intensity conventionally uses FLOP/byte as a unit ratio (floating-point operations per byte moved). This is a ratio unit, not a total-work symbol or throughput symbol.
- Currency: Dollar amounts use the dollar sign (
$); unless otherwise noted, dollar-denominated costs are U.S. dollars (USD). - Precision:
- FP64: Double precision (8 bytes)
- FP32: Single precision (4 bytes)
- TF32: TensorFloat-32 (19-bit compute format, commonly stored as FP32)
- FP16: Half precision (2 bytes, standard range)
- BF16: Brain float (2 bytes, wide dynamic range)
- FP8: Quarter precision (1 byte, E4M3 or E5M2 format)
- FP4: 4-bit floating-point format
- INT8: 8-bit integer (1 byte)
- INT4: 4-bit integer; lower integer precisions follow the same uppercase
INTnpattern (for example, INT3, INT2)
Quick Reference: Resolving Collisions
Common collision points in ML Systems literature include:
| Symbol | ML Meaning | Systems Meaning | Our Convention |
|---|---|---|---|
| \(B\) | Batch Size | Bandwidth | Batch Size. Use \(\text{BW}\) for bandwidth. |
| \(P\) | Parameters | Peak FLOP/s | Parameters. Use \(R_{\text{peak}}\) for peak rate. |
| \(D\) | Dataset Size | Data Volume | Dataset Size. Use \(D_{\text{vol}}\) for bytes moved. |
| \(L\) | Loss | Latency | Loss \((\mathcal{L})\). Use \(L_{\text{lat}}\) for latency. |
| \(\eta\) | Learning Rate | Efficiency | Learning Rate. Use \(\eta_{\text{hw}}\) for efficiency. |
The general principle: ML conventions take precedence for single letters; systems concepts get subscripts or multi-letter symbols. This reflects the primary audience (ML practitioners learning systems) and preserves compatibility with the vast ML literature.