Glossary

Definitions for every term used in the MLSys·im documentation.

This page defines every technical term used across the MLSys·im documentation. When a term is first used on any page, it either links here or is defined inline. Terms marked with slide links point to the relevant lecture deck for deeper coverage.

Slide deck key

All slide links point to the Machine Learning Systems lecture decks. Vol I covers single-machine foundations; Vol II covers distributed and at-scale systems.

A

AllReduce: A collective communication primitive in which every device contributes a local tensor and receives the globally reduced (typically summed) result. The dominant synchronization pattern in data-parallel training. Ring-AllReduce and tree-AllReduce are common algorithms; performance is modeled by the Alpha-Beta Model. Slides: Vol II Ch 5 – Distributed Training, Vol II Ch 6 – Collective Communication
Alpha-Beta Model (\(\alpha\)-\(\beta\)): An analytical model for communication cost: \(T_\text{comm} = \alpha + n\beta\), where \(\alpha\) is the per-message latency (seconds), \(n\) is the message size (bytes), and \(\beta\) is the inverse bandwidth (seconds/byte). Used throughout MLSys·im to estimate collective communication overhead in distributed training. Slides: Vol II Ch 3 – Network Fabrics, Vol II Ch 6 – Collective Communication
Arithmetic Intensity (AI): The ratio of floating-point operations to bytes of memory accessed: \(I = \text{FLOPs} / \text{Bytes}\). High arithmetic intensity means the workload reuses data extensively (compute-bound); low arithmetic intensity means it streams data without reuse (memory-bound). Units: FLOP/byte. Determines which side of the Ridge Point a workload falls on in the Roofline Model. Slides: Vol I Ch 5 – Neural Network Computation, Vol I Ch 11 – Hardware Acceleration

B

Bandwidth (Memory Bandwidth): The rate at which data can be transferred between memory (DRAM/HBM) and compute units. Measured in GB/s or TB/s. The A100, for example, provides 2 TB/s of HBM bandwidth. Not to be confused with network bandwidth (inter-node communication rate) or bisection bandwidth (aggregate cross-section throughput of a network fabric). Slides: Vol II Ch 2 – Compute Infrastructure, Vol II Ch 3 – Network Fabrics
Batch Size: The number of inputs processed simultaneously in one forward pass. Larger batch sizes increase Arithmetic Intensity, shifting workloads from memory-bound toward compute-bound. In distributed training, the global batch size equals the per-device batch size multiplied by the number of data-parallel replicas. Slides: Vol I Ch 5 – Neural Network Computation, Vol I Ch 8 – Model Training
Bottleneck: The hardware resource that limits performance. For a given workload-hardware pair, either compute or memory bandwidth is the bottleneck, determined by comparing the workload’s Arithmetic Intensity to the hardware’s Ridge Point. Slides: Vol I Ch 11 – Hardware Acceleration, Vol II Ch 9 – Performance Engineering

C

CapEx (Capital Expenditure): The upfront cost of purchasing hardware. In TCO analysis, CapEx is amortized over the hardware’s useful lifetime (typically 3–5 years). Slides: Vol II Ch 15 – Sustainable AI
Carbon Intensity: The mass of CO₂-equivalent emissions per unit of electricity consumed, measured in gCO₂e/kWh. Varies dramatically by region: ~20 gCO₂e/kWh (Quebec hydro) to ~820 gCO₂e/kWh (Poland coal). MLSys·im uses per-region carbon intensity values from the sustainability registry to estimate training and inference emissions. Slides: Vol II Ch 15 – Sustainable AI
Compute-Bound: A workload whose performance is limited by the hardware’s peak FLOP/s rate rather than memory bandwidth. Occurs when Arithmetic Intensity exceeds the Ridge Point. Remedies include using tensor cores, upgrading to a faster accelerator, or reducing precision. Contrast with Memory-Bound. Slides: Vol I Ch 11 – Hardware Acceleration, Vol II Ch 9 – Performance Engineering
Continuous Batching: A serving optimization that dynamically inserts and retires requests from a running batch, rather than waiting for all sequences in a static batch to finish before starting new ones. Dramatically improves GPU utilization for LLM inference, where sequence lengths vary widely. Also called iteration-level batching. Slides: Vol I Ch 13 – Model Serving, Vol II Ch 10 – Inference at Scale
CUDA (Compute Unified Device Architecture): NVIDIA’s programming platform for writing GPU-accelerated programs. A “CUDA kernel” is a function that runs in parallel across thousands of GPU threads. Dispatch Tax is the per-kernel launch overhead inherent to this model. Slides: Vol I Ch 11 – Hardware Acceleration

D

Data Parallelism (DP): A distributed training strategy where the full model is replicated across \(N\) devices, each processing a different shard of the batch. Requires an AllReduce synchronization step after each backward pass to average gradients. Scales well for models that fit in a single device’s memory. See also Tensor Parallelism, Pipeline Parallelism, and 3D Parallelism. Slides: Vol II Ch 5 – Distributed Training
Dispatch Tax: The constant per-operation overhead of launching a GPU kernel (typically 0.01–0.1 ms for CUDA kernel launch). Becomes significant at small batch sizes where kernel launch time dominates actual compute time. Captured as the additive term in the Iron Law. Slides: Vol I Ch 12 – Benchmarking

F

FLOP/s (Floating-Point Operations per Second): The rate at which a device can perform floating-point arithmetic. The A100 achieves 312 TFLOP/s at FP16 via its Tensor Cores. Also written as TFLOP/s (tera-) or PFLOP/s (peta-). Not to be confused with FLOPs (a count, not a rate). Slides: Vol I Ch 5 – Neural Network Computation, Vol I Ch 12 – Benchmarking
FLOPs (Floating-Point Operations): A count of arithmetic operations (multiplies, adds, etc.) required to execute a single inference or training step. A ResNet-50 inference requires ~8 GFLOPs; a GPT-3 forward pass requires ~350 TFLOPs. Not the same as FLOP/s (the rate). Slides: Vol I Ch 5 – Neural Network Computation
Forward Pass / Backward Pass: In neural network training, the forward pass runs input data through the model to produce a prediction. The backward pass (backpropagation) computes gradients—the direction and magnitude of change needed for each parameter to reduce error. In distributed systems, gradients must be synchronized across all devices after each backward pass via AllReduce. Slides: Vol I Ch 5 – Neural Network Computation, Vol I Ch 8 – Model Training

G

GQA (Grouped Query Attention): A transformer attention variant where multiple query heads share a single key-value head, reducing KV-Cache memory by a factor equal to the group size without significantly affecting model quality. Used in Llama 3 and other modern LLMs. See also KV-Cache. Slides: Vol I Ch 6 – Network Architectures, Vol II Ch 10 – Inference at Scale

H

HBM (High-Bandwidth Memory): Stacked DRAM technology used in modern AI accelerators. Provides far higher bandwidth than GDDR (e.g., 2 TB/s on A100, 3.35 TB/s on H100) at the cost of limited capacity (40–80 GB per device). The bandwidth ceiling in the Roofline Model is set by HBM. Slides: Vol II Ch 2 – Compute Infrastructure

I

InfiniBand: A high-throughput, low-latency network fabric commonly used in GPU clusters for distributed training. Supports RDMA (Remote Direct Memory Access) for zero-copy data transfer that bypasses the CPU. NDR InfiniBand provides 400 Gb/s per port. See also NVLink (intra-node) vs. InfiniBand (inter-node). Slides: Vol II Ch 3 – Network Fabrics
Iron Law of ML Systems: The fundamental performance equation: \[T = \max\!\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\; \frac{\text{Bytes}}{\text{BW}}\right) + \text{Dispatch\_Tax}\] The \(\max\) captures the Roofline Model insight that performance is limited by whichever resource—compute or memory bandwidth—is the bottleneck. Named by analogy with the Iron Law of processor performance in computer architecture. Slides: Vol I Ch 8 – Model Training, Vol I Ch 11 – Hardware Acceleration
ITL (Inter-Token Latency): The time to generate each successive token after the first during LLM autoregressive decoding. Almost always Memory-Bound—each decode step loads the full model weights plus the KV-Cache. Measured in ms/token. See also TTFT. Slides: Vol I Ch 13 – Model Serving, Vol II Ch 10 – Inference at Scale

K

Knowledge Distillation: A model compression technique where a smaller “student” model is trained to match the output distribution of a larger “teacher” model. Reduces model size and inference cost while retaining much of the teacher’s accuracy. See also Quantization and Pruning. Slides: Vol I Ch 10 – Model Compression
KV-Cache: The cached Key and Value matrices from the transformer attention mechanism, retained across decoding steps to avoid recomputation. Memory footprint grows linearly with sequence length and batch size: \(\text{Bytes} = 2 \times L \times B \times d \times \text{layers} \times \text{bytes\_per\_param}\). GQA reduces KV-Cache size; PagedAttention manages it more efficiently. Slides: Vol I Ch 13 – Model Serving, Vol II Ch 10 – Inference at Scale

L

Latency: The wall-clock time to complete one inference or training step. In MLSys·im, latency is the primary output of the Iron Law equation. Measured in ms or \(\mu\)s. Maximizing Throughput often conflicts with minimizing latency. Slides: Vol I Ch 12 – Benchmarking, Vol I Ch 13 – Model Serving
LLM (Large Language Model): A transformer-based model trained on large text corpora, typically with billions of parameters. Examples: GPT-4, Llama 3, Gemini. Key serving metrics: TTFT and ITL. Key memory bottleneck: KV-Cache. Slides: Vol I Ch 6 – Network Architectures

M

Memory-Bound: A workload whose performance is limited by the hardware’s memory Bandwidth, not its peak FLOP/s. Occurs when Arithmetic Intensity falls below the Ridge Point. Remedies include lower Precision, Operator Fusion, or faster memory (e.g., HBM3). Contrast with Compute-Bound. Slides: Vol I Ch 11 – Hardware Acceleration, Vol II Ch 9 – Performance Engineering
MFU (Model FLOP Utilization): The fraction of theoretical peak FLOP/s actually achieved: \(\text{MFU} = \text{Achieved FLOP/s} / \text{Peak FLOP/s}\). Well-optimized training achieves 30–50% MFU; poorly optimized code may fall below 10%. MFU is the single most important efficiency metric for large-scale training runs. Slides: Vol I Ch 12 – Benchmarking, Vol II Ch 9 – Performance Engineering
Microbatch: A subdivision of the training batch used in Pipeline Parallelism. Increasing the number of microbatches \(M\) reduces the Pipeline Bubble fraction: \(\text{Bubble} = (P{-}1) / (P{-}1{+}M)\), where \(P\) is the pipeline depth. Slides: Vol II Ch 5 – Distributed Training
MTBF (Mean Time Between Failures): The average time a component operates before failing. For a fleet of \(N\) identical nodes, \(\text{MTBF}_\text{fleet} = \text{MTBF}_\text{node} / N\). A 1,024-node cluster with 100,000-hour node MTBF has a fleet MTBF of ~98 hours. Input to the Young-Daly Formula. Slides: Vol II Ch 7 – Fault Tolerance

N

NVLink: NVIDIA’s high-bandwidth interconnect for GPU-to-GPU communication within a server. Provides 900 GB/s bidirectional bandwidth per GPU in DGX H100 systems. Required for Tensor Parallelism, where low-latency intra-node communication is critical. Contrast with InfiniBand for inter-node communication. Slides: Vol II Ch 2 – Compute Infrastructure, Vol II Ch 3 – Network Fabrics

O

OpEx (Operational Expenditure): The ongoing costs of running hardware: electricity, networking, cooling, labor. In cloud pricing, OpEx dominates over a 3-year period by 2–5x over CapEx. Slides: Vol II Ch 15 – Sustainable AI
Operator Fusion: Combining multiple small GPU kernels into a single larger one to reduce memory transfers between operations. For example, fusing a matrix multiply followed by an activation function avoids writing and re-reading the intermediate result from HBM. A key optimization for reducing Memory-Bound overhead. Slides: Vol I Ch 10 – Model Compression, Vol II Ch 9 – Performance Engineering

P

Pipeline Bubble: The fraction of time a pipeline-parallel system spends idle waiting for the first Microbatch to propagate through all stages: \(\text{Bubble} = (P{-}1) / (P{-}1{+}M)\), where \(P\) is pipeline depth and \(M\) is microbatch count. Slides: Vol II Ch 5 – Distributed Training
Pipeline Parallelism (PP): A distributed training strategy that splits the model’s layers across devices, each device processing a different “stage.” Introduces a Pipeline Bubble of idle time. Complementary to Data Parallelism and Tensor Parallelism in 3D Parallelism. Slides: Vol II Ch 5 – Distributed Training
Precision: The numerical format used to represent weights and activations. fp32 (32-bit float) is most accurate; fp16/bf16 (16-bit) halves memory and doubles throughput on Tensor Cores; int8 and int4 further reduce memory at the cost of accuracy. Lower precision increases Arithmetic Intensity by reducing bytes per operation. Slides: Vol I Ch 10 – Model Compression
Progressive Lowering: MLSys·im’s architectural principle: workload specifications (demand) are progressively mapped onto hardware specifications (supply) through a chain of analytical transformations. The reverse of how hardware is typically specified—starting from the algorithm, not the chip.
Pruning: A model compression technique that removes redundant weights or entire structures (channels, attention heads) from a trained model. Unstructured pruning zeros out individual weights; structured pruning removes whole rows/columns for hardware-friendly speedups. See also Quantization and Knowledge Distillation. Slides: Vol I Ch 10 – Model Compression
PUE (Power Usage Effectiveness): \(\text{PUE} = \text{Total Facility Power} / \text{IT Equipment Power}\). A PUE of 1.0 is theoretical perfection; hyperscale datacenters achieve 1.1–1.4. Higher PUE means more energy wasted on cooling and facility overhead. Used in MLSys·im’s sustainability solver alongside Carbon Intensity and WUE. Slides: Vol II Ch 2 – Compute Infrastructure, Vol II Ch 15 – Sustainable AI

Q

Quantization: Reducing the numerical Precision of model weights and/or activations (e.g., FP32 to INT8 or INT4) to shrink memory footprint and increase throughput. Post-Training Quantization (PTQ) converts a pre-trained model without retraining; Quantization-Aware Training (QAT) simulates low-precision during training for higher accuracy. Slides: Vol I Ch 10 – Model Compression

R

Ridge Point: The Arithmetic Intensity at which a workload transitions from Memory-Bound to Compute-Bound on a given hardware platform: \(I^* = \text{Peak FLOP/s} / \text{Memory BW}\). For the A100 at FP16: \(I^* = 312 \text{ TFLOP/s} / 2 \text{ TB/s} = 156 \text{ FLOP/byte}\). Slides: Vol I Ch 11 – Hardware Acceleration, Vol II Ch 9 – Performance Engineering
Roofline Model: A visual and analytical tool that plots hardware performance ceilings (the “roofline”) and shows where workloads sit relative to them. The sloped region is Memory-Bound; the flat region is Compute-Bound; the inflection point is the Ridge Point. Introduced by Williams, Waterman, and Patterson (2009). MLSys·im implements a generalized roofline via the Iron Law. Slides: Vol I Ch 11 – Hardware Acceleration, Vol II Ch 9 – Performance Engineering

S

SLA (Service Level Agreement): A target performance guarantee, typically specifying maximum acceptable latency and minimum throughput. For LLM serving, common SLAs target TTFT < 200 ms and ITL < 50 ms/token. Slides: Vol I Ch 12 – Benchmarking, Vol I Ch 13 – Model Serving
Speculative Decoding: An inference optimization where a small, fast “draft” model generates candidate tokens that are then verified in parallel by the full model. Reduces ITL by converting sequential autoregressive steps into a single parallel verification pass, at the cost of occasional rejected tokens. Slides: Vol II Ch 10 – Inference at Scale, Vol II Ch 9 – Performance Engineering
SSoT (Single Source of Truth): The principle that each specification (chip peak FLOP/s, grid carbon intensity, etc.) has exactly one authoritative location—the MLSys Zoo. All computations derive from the Zoo, eliminating inconsistencies from stale copied values.
Systolic Array: A grid of processing elements that rhythmically pass data to their neighbors, performing a multiply-accumulate at each step. The dominant dataflow architecture in ML accelerators: Google TPUs use systolic arrays for matrix multiplication, and NVIDIA Tensor Cores implement a similar systolic-like pattern. Slides: Vol I Ch 11 – Hardware Acceleration

T

TCO (Total Cost of Ownership): The full cost of a system over its lifetime: \(\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}\). Includes hardware purchase, electricity, cooling, networking, and labor. MLSys·im’s TCO solver computes this from hardware registry specs and regional energy costs. Slides: Vol II Ch 2 – Compute Infrastructure, Vol II Ch 15 – Sustainable AI
TDP (Thermal Design Power): The maximum sustained power a chip is designed to dissipate under load, in Watts. Relevant for datacenter cooling capacity planning. An H100 SXM5 has a TDP of 700 W. Used in MLSys·im to compute energy consumption and TCO. Slides: Vol II Ch 2 – Compute Infrastructure
Tensor Core: A specialized hardware unit in NVIDIA GPUs designed for matrix-multiply-accumulate operations. Achieves much higher throughput than standard CUDA cores for ML workloads. The A100’s 312 TFLOP/s peak (FP16) comes from its tensor cores, not its CUDA cores. Functionally similar to a Systolic Array. Slides: Vol I Ch 5 – Neural Network Computation, Vol I Ch 11 – Hardware Acceleration
Tensor Parallelism (TP): A distributed training strategy that splits individual matrix multiplications across devices within a node. Requires high-bandwidth intra-node connectivity (NVLink). Combined with Data Parallelism and Pipeline Parallelism in 3D Parallelism. Slides: Vol II Ch 5 – Distributed Training
3D Parallelism: The combination of Data Parallelism, Tensor Parallelism, and Pipeline Parallelism to scale training across hundreds or thousands of GPUs. TP operates within a node (over NVLink), PP across a small group of nodes, and DP across the remaining replicas. The standard recipe for training frontier LLMs. Slides: Vol II Ch 5 – Distributed Training
Throughput: The number of samples (or tokens) processed per unit time: \(\text{Throughput} = \text{Batch Size} / \text{Latency}\). Maximizing throughput often conflicts with minimizing Latency—larger batches increase throughput but also increase per-request latency. Slides: Vol I Ch 12 – Benchmarking
TTFT (Time to First Token): The latency from receiving a user query to generating the first output token in an LLM serving system. Determined primarily by the prefill phase, which is Compute-Bound. Target: <200 ms for interactive applications. See also ITL. Slides: Vol I Ch 13 – Model Serving, Vol II Ch 10 – Inference at Scale

U

Utilization (\(\eta\)): The fraction of theoretical peak FLOP/s actually achieved in practice. Typical values: 30–50% for well-optimized training, 10–30% for inference. MLSys·im uses \(\eta\) as a parameter in the Iron Law; see the hardware registry for per-device defaults. Closely related to MFU. Slides: Vol I Ch 12 – Benchmarking, Vol II Ch 9 – Performance Engineering

W

WUE (Water Usage Effectiveness): Liters of water consumed per kilowatt-hour of energy. Relevant for datacenters using evaporative cooling. MLSys·im estimates water usage as: \(\text{Water (L)} = \text{Energy (kWh)} \times \text{WUE}\). Slides: Vol II Ch 15 – Sustainable AI

Y

Young-Daly Formula: The optimal checkpoint interval for fault-tolerant distributed training: \(\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}\), where \(\delta\) is the time to save one checkpoint and MTBF is the mean time between failures of the fleet. Named after Young (1974) and Daly (2006). Slides: Vol II Ch 7 – Fault Tolerance

This glossary is updated with each MLSys·im release. If a term is missing, please open an issue.