Notation

Machine Learning Systems spans Machine Learning (computer science/statistics) and Systems (computer architecture/hardware). Each field developed its notation independently, and many symbols mean different things depending on the community, paper, or literature using them. This collision creates real confusion when the disciplines merge. The conventions below establish a single notation for eliminating ambiguity.

Consider a simple statement: “Increasing \(B\) improves throughput.” To an ML researcher, \(B\) means batch size. To a hardware engineer, \(B\) means bandwidth. Both interpretations are correct in their respective fields, but in ML Systems we need both concepts in the same equation—hence the need for a single, consistent convention.

The Iron Law of ML Systems

The fundamental performance equation of this book, introduced in the opening chapter, is:

\[T = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}}\]

Each variable was chosen deliberately to avoid collision with standard ML terminology.

Symbol Definition Unit Why This Symbol?
\(T\) Time seconds Unambiguous. Wall-clock time for an operation.
\(D_{\text{vol}}\) Data Volume bytes Avoids collision with \(D\) (Dataset Size). In scaling laws, \(D\) means training tokens. Here we need bytes moved through memory. The subscript disambiguates.
\(\text{BW}\) Bandwidth bytes/s Avoids collision with \(B\) (Batch Size). Physics uses \(B\) for bandwidth, but every ML paper uses \(B\) for batch size. We preserve the ML convention.
\(O\) Operations FLOPs Total floating-point operations. Clean in equations (vs. “\(\text{Ops}\)”).
\(R_{\text{peak}}\) Peak Rate FLOP/s Avoids collision with \(P\) (Parameters). Roofline models use \(P\) for peak performance, but ML universally uses \(P\) for parameter count. We preserve the ML convention.
\(\eta_{\text{hw}}\) Efficiency Hardware utilization \((0 \le \eta_{\text{hw}} \le 1)\). Avoids collision with learning rate \((\eta)\).
\(L_{\text{lat}}\) Latency seconds Avoids collision with \(\mathcal{L}\) (Loss). Fixed overhead time (kernel launch, network RTT). The subscript distinguishes from the loss function.

Why these choices matter

Without careful notation, sentences become ambiguous:

“Reducing \(D\) improves performance.”

The sentence has two possible readings:

  • Reducing dataset size (fewer training samples) can speed training but may reduce accuracy.
  • Reducing data volume moved (for example, through compression or quantization) can speed inference, with accuracy effects that depend on the technique and calibration.

With our notation, we can write precisely:

“Reducing \(D_{\text{vol}}\) through FP32-to-INT8 quantization cuts parameter memory traffic to one quarter while \(D\) (training data) remains unchanged.”

Our notation makes such ambiguity explicit: \(\text{BW}\) limits throughput” is unambiguous.

Subscripted variants

When a multi-letter quantity name takes a descriptive subscript (for example, distinguishing storage bandwidth from network bandwidth), wrap both the root and the subscript in \text{}:

  • \(\text{BW}_{\text{disk}}\), \(\text{BW}_{\text{network}}\), \(\text{BW}_{\text{accelerator}}\)
  • \(D_{\text{vol}}\), \(R_{\text{peak}}\), \(L_{\text{lat}}\), \(\eta_{\text{hw}}\)
  • \(E_{\text{move}}\), \(E_{\text{compute}}\), \(E_{\text{total}}\)

Without \text{}, math mode renders each letter as an italic variable and the spacing collapses to look like a product, not a label.

The Degradation Equation

The silent failure mode of ML systems is captured by the degradation equation introduced in the opening chapter. Some symbols below (such as \(\tau\)) appear in chapter prose rather than in the block equation itself; defining them centrally keeps the canonical form unambiguous when they do appear. \[\text{Accuracy}(t) \approx \text{Accuracy}_0 - \lambda \cdot \mathcal{D}(P_t \lVert P_0)\]

Symbol Definition Unit/Type Notes
\(\text{Accuracy}(t)\) Accuracy at Time \(t\) Scalar Model accuracy after the model has been deployed for time \(t\).
\(\text{Accuracy}_0\) Initial Accuracy Scalar Model accuracy at deployment time.
\(\lambda\) Sensitivity Scalar Model sensitivity to distribution shift. Architecture-dependent. (Not wavelength.)
\(P_t\) Current Distribution Distribution The data distribution at time \(t\). (Not parameters—use \(P\) for parameter count.)
\(P_0\) Training Distribution Distribution The data distribution at training time.
\(\mathcal{D}(P_t \lVert P_0)\) Statistical Divergence Scalar \(\ge 0\) Measures how far \(P_t\) has drifted from \(P_0\). Common choices: KL divergence, total variation, Wasserstein. (Calligraphic to avoid collision with \(D\) = dataset size.)
\(\tau\) Drift Threshold Scalar \(> 0\) Retraining is triggered when \(\mathcal{D}(P_t \lVert P_0) > \tau\).

The Energy Corollary

The energy cost of ML workloads follows from the iron law of ML systems and decomposes as: \[E_{\text{total}} \approx D_{\text{vol}} \times E_{\text{move}} + O \times E_{\text{compute}}\]

Symbol Definition Unit Notes
\(E_{\text{total}}\) Total Energy joules Total energy consumed by an ML workload, decomposed into data-movement and compute terms.
\(E_{\text{move}}\) Energy per Byte Moved joules/byte Energy cost of data movement. Dominates total energy \((E_{\text{move}} \gg E_{\text{compute}})\).
\(E_{\text{compute}}\) Energy per Operation joules/FLOP Energy cost of a single arithmetic operation.

Deep Learning Notation

We follow standard deep learning conventions (Goodfellow et al. 2016) with explicit disambiguation for systems variables.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Symbol Definition Dimensions/Type
\(B\) Batch Size Integer. The number of samples processed in parallel. (Never bandwidth.)
\(P\) Parameters Integer. The total count of trainable weights in a model. (Never peak FLOP/s.)
\(D\) Dataset Size Integer. Number of training samples or tokens. (Never data volume in bytes—use \(D_{\text{vol}}\).)
\(S\) Sequence Length Integer. Number of tokens or time steps.
\(d\) Hidden Dimension Integer. Size of the hidden state vector.
\(d_{\text{head}}\) Attention Head Dimension Integer. Per-head hidden dimension in attention layers.
\(N_L\) Number of Layers Integer. Total number of layers in a network.
\(N_{\text{heads}}\) Number of Attention Heads Integer. Number of attention heads in a multi-head attention layer.
\(H_{\text{KV}}\) Number of Key-Value Heads Integer. Number of key-value heads in grouped-query or multi-query attention.
\(\ell\) Layer Index Integer. Index for a layer. Use instead of bare \(L\) when indexing layers, since \(L\) collides with loss and latency.
\(\mathcal{L}\) Loss Function Scalar. The objective function minimized during training.
\(\eta\) Learning Rate Scalar. Step size for the optimizer. (Never bare for hardware efficiency—use \(\eta_{\text{hw}}\).)
\(\theta\) Model Weights Vector/Matrix. The set of all learnable parameters.
\(p(x)\) Distribution Probability mass/density for a random variable. Use lowercase \(p(\cdot)\) for generic distributions to avoid collision with \(P\) (parameter count).
\(p(y \mid x)\) Conditional Distribution Conditional probability/density. Use this form for generic label relationships; reserve \(P_0\) and \(P_t\) for the degradation equation’s training/current distributions.
\(\Pr(E)\) Event Probability Probability of an event \(E\). Use for event statements such as \(\Pr(\text{batch}=0)\); use \(p(x)\) and \(p(y \mid x)\) for distributions.

Local matrix algebra may use \(A\), \(B\), and \(C\) for operands when the equation explicitly introduces matrices (for example, GEMM \(C=\alpha AB+\beta C\)). This is a local linear-algebra convention, not a global override of \(B\) as batch size.

Performance, Serving, and Memory Notation

Reusable performance and serving quantities follow the same collision-avoidance rule as the iron law. Rates use \(R\) or descriptive Greek symbols; request counts use \(Q_{\text{req}}\) rather than overloading \(N\); queue utilization uses a subscripted \(\rho\) so bare \(\rho\) remains available for other ratio models.

Symbol Definition Unit/Type Notes
\(I\) Arithmetic Intensity FLOP/byte Workload FLOPs per byte moved. The roofline model uses \(I\) as the independent variable.
\(I_{\text{ridge}}\) Roofline Ridge Point FLOP/byte \(I_{\text{ridge}} = R_{\text{peak}}/\text{BW}\). Prefer this explicit form over starred shorthand so the meaning remains clear in prose.
\(R_{\text{attain}}\) Attainable Compute Rate FLOP/s Roofline rate: \(R_{\text{attain}} = \min(R_{\text{peak}}, I \times \text{BW})\). Uses \(R\), not \(T\), because this quantity is a rate.
\(\text{MFU}\) Model FLOPs Utilization Dimensionless Useful model FLOP/s divided by available peak FLOP/s. Text acronym avoids overloading \(\eta\).
\(r_{\text{comp}}\) Compression Ratio Dimensionless Uncompressed size divided by compressed size. A compressed payload has size \(M/r_{\text{comp}}\); the subscript avoids bare \(C\) collisions.
\(Q_{\text{req}}\) Request Concurrency requests Average in-flight requests in a stable serving system. Avoids collision with \(N\) as device count in distributed settings.
\(\lambda_{\text{arr}}\) Arrival Rate requests/s Request arrival rate. The subscript avoids collision with \(\lambda\) as degradation sensitivity or failure rate.
\(T_{\text{lat}}\) Request Time in System seconds End-to-end queueing/serving latency for Little’s Law. Distinct from \(L_{\text{lat}}\), the fixed-latency term in the iron law.
\(T_{\text{svc}}(B)\) Batch Service Time seconds Time to serve a batch of size \(B\).
\(\mu_{\text{eff}}(B)\) Effective Service Rate requests/s Batched service rate, typically \(\mu_{\text{eff}}(B) = B/T_{\text{svc}}(B)\).
\(\rho_{\text{serv}}\) Serving Utilization Dimensionless Queue/server utilization. Use instead of bare \(\rho\), which is reserved for communication-computation ratio in distributed contexts.
\(M_{\text{total}}\) Total Memory Footprint bytes Sum of explicit memory components; avoids bare \(M\) ambiguity.
\(M_{\text{weights}}\) Weight Memory bytes Memory occupied by model parameters.
\(M_{\text{gradients}}\) Gradient Memory bytes Memory occupied by stored gradients.
\(M_{\text{optimizer}}\) Optimizer-State Memory bytes Momentum, variance, master weights, and related optimizer buffers.
\(M_{\text{activations}}\) Activation Memory bytes Retained activations for the backward pass or serving intermediates.
\(s_{\text{elem}}\) Element Storage Size bytes/element Bytes per stored tensor element.

Units and Precision

  • Physical units: This book uses SI (metric) units throughout, including meters, kilograms, seconds, watts, and °C, consistent with standard engineering and scientific practice. Where source data was originally reported in imperial units, we convert to SI and note the original values parenthetically. A space always separates the number from the unit (for example, 100 ms, 2 TB/s).
  • Data and memory: We use decimal SI prefixes only: KB = \(10^3\) bytes, MB = \(10^6\), GB = \(10^9\), TB = \(10^{12}\). We do not use binary-prefixed units in prose; all capacities, throughputs, and model sizes are reported in decimal units (for example, 80 GB, 2 TB/s, 102 MB).
  • Compute: We distinguish operation counts from rates. Total work uses FLOPs (for example, GFLOPs, TFLOPs), while throughput uses FLOP/s with decimal prefixes (for example, GFLOP/s, TFLOP/s).
    • 1 TFLOP = \(10^{12}\) FLOPs
    • 1 TFLOP/s = \(10^{12}\) FLOPs per second
    • Arithmetic intensity conventionally uses FLOP/byte as a unit ratio (floating-point operations per byte moved). This is a ratio unit, not a total-work symbol or throughput symbol.
  • Currency: Dollar amounts use the dollar sign ($); unless otherwise noted, dollar-denominated costs are U.S. dollars (USD).
  • Precision:
    • FP64: Double precision (8 bytes)
    • FP32: Single precision (4 bytes)
    • TF32: TensorFloat-32 (19-bit compute format, commonly stored as FP32)
    • FP16: Half precision (2 bytes, standard range)
    • BF16: Brain float (2 bytes, wide dynamic range)
    • FP8: Quarter precision (1 byte, E4M3 or E5M2 format)
    • FP4: 4-bit floating-point format
    • INT8: 8-bit integer (1 byte)
    • INT4: 4-bit integer; lower integer precisions follow the same uppercase INTn pattern (for example, INT3, INT2)

Quick Reference: Resolving Collisions

Common collision points in ML Systems literature include:

Symbol ML Meaning Systems Meaning Our Convention
\(B\) Batch Size Bandwidth Batch Size. Use \(\text{BW}\) for bandwidth.
\(P\) Parameters Peak FLOP/s Parameters. Use \(R_{\text{peak}}\) for peak rate.
\(D\) Dataset Size Data Volume Dataset Size. Use \(D_{\text{vol}}\) for bytes moved.
\(L\) Loss Latency Loss \((\mathcal{L})\). Use \(L_{\text{lat}}\) for latency.
\(\eta\) Learning Rate Efficiency Learning Rate. Use \(\eta_{\text{hw}}\) for efficiency.

The general principle: ML conventions take precedence for single letters; systems concepts get subscripts or multi-letter symbols. This reflects the primary audience (ML practitioners learning systems) and preserves compatibility with the vast ML literature.

Distributed Systems Notation

Distributed-systems work introduces a second layer of notation for coordination, communication, and reliability. Some symbols below are reserved for synchronization, repair, and gradient-sizing contexts; defining them centrally keeps the canonical form unambiguous when they appear. The central equation is the Fleet Law (the distributed step time law):

\[T_{\text{step}}(N) = \frac{T_{\text{compute}}}{N} + T_{\text{comm}}(N) + T_{\text{sync}}(N) - T_{\text{overlap}}\]

Symbol Definition Unit Notes
\(N\) Number of Devices Integer Accelerator count in a distributed job. (Not parameters—use \(P\).)
\(T_{\text{step}}(N)\) Distributed Step Time seconds Wall-clock time for one training step at scale \(N\).
\(T_{\text{compute}}\) Single-Device Compute Time seconds Forward + backward pass on one device.
\(T_{\text{comm}}(N)\) Communication Time seconds Time for collective operations (AllReduce, AllGather). Grows with \(N\).
\(T_{\text{overlap}}\) Overlapped Time seconds Communication hidden behind computation. Reduces effective overhead.
\(T_{\text{sync}}\) Synchronization Time seconds Total nonoverlapped synchronization or coordination cost per step.
\(\rho\) Communication-Computation Ratio Dimensionless \(\rho = T_{\text{comm}}(N)/(T_{\text{compute}}/N)\). Values near or above 1 indicate communication-bound scaling.

The communication model

Network communication time decomposes into a fixed latency and a bandwidth-dependent transfer:

\[T(n) = \alpha + \frac{n}{\beta}\]

Symbol Definition Unit Notes
\(T(n)\) Communication Time seconds Wall-clock time to transmit a message of size \(n\) across one link.
\(\alpha\) Network Latency seconds Fixed per-message overhead. (Not learning rate—context disambiguates.)
\(\beta\) Link Bandwidth bytes/s Effective throughput per link.
\(n\) Message Size bytes Size of the payload (for example, gradient tensor).
\(n^*\) Crossover Point bytes \(n^* = \alpha \cdot \beta\). Below \(n^*\): latency-bound. Above: bandwidth-bound.

The LogGP extension

LogGP refines the \(\alpha\)-\(\beta\) model by separating network latency, processor overhead, and long-message injection gaps. The standard names are short, but the book uses descriptive subscripts for the highest-collision term.

Symbol Definition Unit Notes
\(T_{\text{long}}(n)\) LogGP Long-Message Time seconds Transfer time for an \(n\)-byte long message. Reuses \(n\) from the \(\alpha\)-\(\beta\) model.
\(L_{\text{LogGP}}\) LogGP Network Latency seconds Standard LogGP latency parameter; subscript avoids bare \(L\) collisions.
\(o_{\text{send}}\) Sender Overhead seconds Processor/NIC overhead to initiate a message.
\(o_{\text{recv}}\) Receiver Overhead seconds Processor/NIC overhead to complete a message.
\(g\) Message Injection Gap seconds Minimum interval between consecutive message injections.
\(G\) Long-Message Gap per Byte seconds/byte Per-byte long-message gap; distinct from \(\beta\) link bandwidth.

Scaling efficiency

\[\eta_{\text{scaling}} = \frac{T_1}{N \times T_N} \leq 1\]

Symbol Definition Unit Notes
\(\eta_{\text{scaling}}\) Scaling Efficiency Dimensionless Fraction of ideal linear speedup achieved. \(1.0\) is the theoretical limit.
\(T_1\) Single-Device Time seconds Baseline wall-clock time on one device.
\(T_N\) N-Device Time seconds Wall-clock time on \(N\) devices.

Fault tolerance and reliability

System reliability degrades with scale. Assuming independent components with exponential failure distributions, the system reliability across \(N\) components is:

\[R_{\text{system}}(t) = e^{-N\lambda t}\]

The aggregate exponent uses the system-level rate \(N\lambda\), where \(\lambda\) is the per-component failure rate under the independent, identical-component assumption. The exponent \(N\lambda t\) is dimensionless, so \(t\) must use the same time unit as \(\lambda\) (hours, when \(\lambda\) comes from FIT-derived rates below).

The optimal checkpoint interval balances I/O cost against rework cost:

\[\tau_{\text{opt}} = \sqrt{2 \cdot T_{\text{write}} \cdot \text{MTBF}_{\text{system}}}\]

The formula uses system-wide MTBF (the failure interval for the whole job), not component MTBF; substituting the per-component value would overestimate \(\tau_{\text{opt}}\) by \(\sqrt{N}\). For dimensional consistency, \(T_{\text{write}}\) and \(\text{MTBF}_{\text{system}}\) must be in the same time units. Industry-standard \(\text{MTBF}\) is reported in hours; to combine with seconds-scale write times, convert: \(\text{MTBF}_s = \text{MTBF}_h \times 3600\). The result \(\tau_{\text{opt}}\) then carries the same time unit chosen for the inputs.

Symbol Definition Unit Notes
\(R_{\text{system}}(t)\) Reliability Function Probability Probability of no failure across the whole system before time \(t\).
\(\lambda\) Per-Component Failure Rate failures/hour Per-component failure rate; the system-wide rate is \(N\lambda\) under independent, identical components. The industry-standard reporting unit is FIT (failures per \(10^9\) device-hours): \(\lambda_{\text{hour}} = \text{FIT} \times 10^{-9}\). (Not sensitivity—context disambiguates.)
\(\text{MTBF}\) Mean Time Between Failures hours Average time between consecutive failures. \(\text{MTBF}_{\text{system}} = \text{MTBF}_{\text{component}} / N\).
\(\text{MTTF}\) Mean Time To Failure hours Component lifetime metric derived from FIT rates. Distinct from system-level \(\text{MTBF}\) when repair/replacement cycles are modeled.
\(\text{MTTR}\) Mean Time To Repair hours Average recovery time after a failure.
\(\tau_{\text{opt}}\) Optimal Checkpoint Interval seconds Young-Daly formula. Minimizes total wasted time.
\(T_{\text{write}}\) Checkpoint Write Time seconds Time to persist model state to storage.

Parallelism dimensions

3D parallelism partitions the training workload across three orthogonal dimensions:

\[N_{\text{total}} = d \times p \times t\]

Symbol Definition Unit Notes
\(N_{\text{total}}\) Total Device Count Integer Product of the three parallelism dimensions. The total accelerators across the job.
\(d\) Data Parallelism Integer Number of model replicas. (Also hidden dimension—context disambiguates.)
\(p\) Physical Pipeline Stages Integer Number of physical pipeline stages in model-depth partitioning.
\(t\) Tensor Parallelism Integer Degree of intra-layer partitioning (model width). (Also time—context disambiguates.)
\(m\) Microbatch Count Integer Number of microbatches used to fill and drain the pipeline.
\(b_{\text{pipe}}\) Pipeline Bubble Fraction Dimensionless Fraction of pipeline GPU-time spent idle; avoids bare \(b\).
\(V\) Virtual Pipeline Stages per Device Integer Number of interleaved virtual chunks assigned to each physical pipeline stage.
\(M\) Message Payload Size bytes Total size of a gradient, model-state, activation, or compressed message payload to communicate.

Fleet capacity factors

Fleet-scale capacity calculations compound local model efficiency, scaling efficiency, and operational goodput.

Symbol Definition Unit Notes
\(f_{\text{compute}}\) Useful Compute-Time Fraction Dimensionless \((T_{\text{compute}}/N)/T_{\text{step}}(N)\). Measures how much step time is useful local arithmetic after distribution.
\(f_{\text{overhead}}\) Overhead Fraction Dimensionless Fraction of wall time lost to pipeline bubbles, checkpoints, failure recovery, maintenance, or other nonproductive overheads.
\(\eta_{\text{goodput}}\) Goodput Ratio Dimensionless Fraction of allocated wall time that produces useful training work; often \(\eta_{\text{goodput}} = 1 - f_{\text{overhead}}\).
\(R_{\text{eff}}\) Effective Fleet Throughput FLOP/s \(R_{\text{eff}} = (N R_{\text{peak}}) \times \text{MFU} \times \eta_{\text{scaling}} \times \eta_{\text{goodput}}\).

Compression economics

Compression models compare codec overhead against the communication time saved by reducing the payload.

Symbol Definition Unit Notes
\(T_{\text{transfer}}(x)\) Transfer Time for Payload \(x\) seconds Communication time for an \(x\)-byte payload under the relevant link or collective model.
\(T_{\text{compressed}}\) Compressed Communication Time seconds Total time after encode, transfer of \(M/r_{\text{comp}}\), and decode.
\(T_{\text{encode}}\) Codec Encode Time seconds Local compression overhead before transfer.
\(T_{\text{decode}}\) Codec Decode Time seconds Local decompression overhead after transfer.

Serving memory at scale

Symbol Definition Unit Notes
\(M_{\text{KV}}\) KV-Cache Memory bytes Key-value cache footprint for autoregressive serving.

Additional collision points

The distributed context introduces further symbol collisions beyond those noted in the base notation:

Symbol ML Meaning Distributed Meaning Our Convention
\(N\) Number of Devices Device Count in distributed context (accelerators, not host nodes).
\(\alpha\) Network Latency Network latency in \(\alpha\)-\(\beta\) model. Context disambiguates.
\(\lambda\) Sensitivity Failure Rate Context-dependent. Sensitivity in degradation; failure rate in reliability.
\(d\) Hidden Dimension Data Parallelism Degree Context-dependent. Parallelism in 3D notation; hidden dim in architectures.
\(t\) Tensor Parallelism/Time Context-dependent. Tensor parallelism in 3D notation; time in reliability.
\(\tau\) Drift Threshold Local Interval/Delay Avoid bare reuse. Use \(\tau_{\text{ckpt}}\) for generic checkpoint intervals, \(\tau_{\text{opt}}\) for the Young-Daly optimum, and \(\tau_{\text{stale}}\) for gradient staleness.

Additional units

  • Network throughput: Reported in both bytes/s (GB/s, TB/s) and bits/s (Gbps) depending on convention. InfiniBand and Ethernet specifications use Gbps; application-level throughput uses GB/s. We note the convention on first use.
  • Power in sustainability contexts: Physical power may use explicitly subscripted \(P_{\text{...}}\) variables (for example, \(P_{\text{IT}}\), \(P_{\text{dynamic}}\), \(P_{\text{average}}\)) when the surrounding section is about power or energy. Bare \(P\) remains reserved for parameter count.
Back to top