Glossary
Definitions for every term used in the MLSys·im documentation.
This page defines every technical term used across the MLSys·im documentation. When a term is first used on any page, it either links here or is defined inline.
A
- Arithmetic Intensity (AI)
- The ratio of floating-point operations to bytes of memory accessed: \(I = \text{FLOPs} / \text{Bytes}\). High arithmetic intensity means the workload reuses data (compute-efficient); low arithmetic intensity means it streams data without reuse (memory-constrained). Units: FLOP/byte.
B
- Bandwidth (Memory Bandwidth)
- The rate at which data can be transferred between memory (DRAM/HBM) and compute units. Measured in GB/s or TB/s. The A100, for example, has 2 TB/s of HBM bandwidth. Not to be confused with network bandwidth (how fast nodes communicate with each other).
- Batch Size
- The number of inputs processed simultaneously in one forward pass. Larger batch sizes increase arithmetic intensity, which tends to shift workloads from memory-bound to compute-bound.
- Bisection Bandwidth
- The total bandwidth available across the narrowest cut that divides a network into two equal halves. Determines worst-case all-to-all communication throughput. Fat-tree topologies provide full bisection bandwidth; oversubscribed networks reduce it.
- Bottleneck
- The hardware resource that limits performance. For a given workload-hardware pair, either compute or memory bandwidth is the bottleneck, determined by comparing the workload’s arithmetic intensity to the hardware’s roofline ridge point.
- Binding Constraint
-
The single hardware parameter whose improvement would yield the largest performance gain for a given workload-hardware pair. Identified by the
SensitivitySolverthrough numerical partial derivatives. Investing in non-binding parameters yields negligible returns.
C
- CapEx (Capital Expenditure)
- The upfront cost of purchasing hardware. In TCO analysis, CapEx is amortized over the hardware’s useful lifetime (typically 3–5 years).
- Carbon Intensity
- The mass of CO₂-equivalent emissions per unit of electricity consumed, measured in gCO₂e/kWh. Varies dramatically by region: ~20 g/kWh (Quebec hydro) to ~820 g/kWh (Poland coal).
- Compute-Bound
- A workload whose performance is limited by the hardware’s peak FLOP/s rate. Increasing batch size, using tensor cores, or upgrading to a faster GPU helps. Contrast with Memory-Bound.
- Continuous Batching
- A serving optimization that dynamically inserts and removes requests from a running batch as individual sequences complete, rather than waiting for the entire batch to finish. Paired with PagedAttention to eliminate KV-cache fragmentation.
- CUDA (Compute Unified Device Architecture)
- NVIDIA’s programming platform for writing GPU-accelerated programs. A “CUDA kernel” is a function that runs in parallel across thousands of GPU threads.
D
- Demand-Supply Separation
- MLSys·im’s architectural principle: computational demand (FLOPs, bytes, arithmetic intensity) is specified independently of hardware supply (FLOP/s, GB/s, memory capacity). This decoupling means the same workload evaluates on any hardware, and the same hardware evaluates any workload — analogous to how a compiler IR separates source semantics from target-specific code generation.
- Domain
- One of six categories in the wall taxonomy, grouping walls by the scope of constraint: Node (single accelerator, Walls 1–7), Data (movement and pipelines, Walls 8–10), Algorithm (scaling and compression, Walls 11–13), Fleet (multi-node coordination, Walls 14–16), Ops (economics, sustainability, and safety, Walls 17–20), and Analysis (cross-cutting diagnostics, Walls 21–22).
- Data Parallelism (DP)
- A distributed training strategy where the full model is replicated across \(N\) devices, each processing a different shard of the batch. Requires an all-reduce synchronization step after each backward pass. Scales well for smaller models.
- Dispatch Tax
- The constant per-operation overhead of launching a GPU kernel (e.g., CUDA kernel launch overhead, typically 0.01–0.1 ms). Becomes significant at small batch sizes where kernel launch time dominates actual compute time.
- DP-SGD (Differentially Private Stochastic Gradient Descent)
- A training algorithm that clips per-sample gradients and adds calibrated Gaussian noise to guarantee differential privacy. The noise scale σ is proportional to 1/ε, where ε is the privacy budget. Imposes computational overhead from per-sample gradient computation.
- Edge Inference
- Running ML models on-device (smartphone, microcontroller, embedded system) rather than in the cloud. Key advantages: lower latency, data privacy, no network dependency. Key constraints: limited memory (KiB–GiB vs. tens of GB), limited compute (MFLOP/s–TFLOP/s vs. hundreds of TFLOP/s), and strict power budgets (milliwatts–watts vs. hundreds of watts).
- Efficiency (η)
-
See Utilization. In MLSys·im, the
efficiencyparameter (default 0.5) represents the fraction of theoretical peak FLOP/s actually achieved. Calibration guidance: η ≈ 0.35–0.55 for well-optimized training, η ≈ 0.25–0.45 for inference. See Model Accuracy for detailed ranges. - Energy per Inference
- The energy consumed to process a single inference, typically measured in millijoules (mJ) or joules (J). Calculated as latency × power draw (TDP). A critical metric for battery-powered edge devices where the total inference budget is bounded by battery capacity.
F
- Forward Pass / Backward Pass
- In neural network training, the forward pass runs input data through the model to produce a prediction. The backward pass (backpropagation) then computes gradients — the direction and magnitude of change needed for each parameter to reduce error. After each backward pass, distributed systems must synchronize these gradients across all GPUs.
- FLOPs (Floating-Point Operations)
- A count of arithmetic operations (multiplies, adds, etc.) required to process a single inference or training step. Not the same as FLOP/s (the rate). A ResNet-50 inference requires ~8 GFLOPs; training a GPT-3 forward pass requires ~350 TFLOPs.
- FLOP/s (Floating-Point Operations per Second)
- The rate at which a device can perform floating-point arithmetic. The A100 achieves 312 TFLOP/s at fp16. Also written as TFLOP/s (tera-) or PFLOP/s (peta-).
H
- HBM (High-Bandwidth Memory)
- The stacked DRAM technology used in modern AI accelerators. Provides far higher bandwidth than GDDR at the cost of limited capacity (40–80 GB per device vs. 24+ GB in consumer cards). Used in A100, H100, MI300X, etc.
I
- Iron Law of ML Systems
- The fundamental performance equation: \(T = \max\left(\frac{\text{FLOPs}}{\text{Peak} \times \eta},\ \frac{\text{Bytes}}{\text{BW}}\right) + \text{Dispatch\_Tax}\). Named by analogy with the Iron Law of processor performance in computer architecture.
- ITL (Inter-Token Latency)
- The time to generate each successive token after the first during LLM autoregressive decoding. ITL is almost always memory-bound—each decode step loads the full model weights plus the KV-cache. Measured in ms/token.
K
- KV-Cache
- The cached Key and Value matrices from the transformer attention mechanism, retained across decoding steps to avoid recomputation. Memory footprint grows linearly with sequence length and batch size: \(\text{Bytes} = 2 \times L \times B \times d \times \text{layers} \times \text{bytes\_per\_param}\).
L
- Latency
- The wall-clock time to complete one inference or training step. In MLSys·im, latency is the output of the roofline equation. Measured in ms or μs.
- LLM (Large Language Model)
- A transformer-based model trained on large text corpora, typically with billions of parameters. Examples: GPT-4, Llama 3, Gemini. Key serving metrics: TTFT and ITL.
M
- Memory-Bound
- A workload whose performance is limited by the hardware’s memory bandwidth, not its peak FLOP/s. Adding more compute units does not help; you need faster memory, lower precision, or operator fusion. Contrast with Compute-Bound.
- MFU (Model FLOP Utilization)
- The fraction of theoretical peak FLOP/s actually achieved: \(\text{MFU} = \text{Achieved FLOP/s} / \text{Peak FLOP/s}\). Well-optimized training achieves 30–50% MFU; poorly optimized code may achieve <10%.
- MLPerf
- An industry-standard benchmark suite from MLCommons that provides reproducible, audited performance comparisons across hardware platforms for training and inference. MLSys·im validates its analytical predictions against MLPerf Inference results. See mlcommons.org.
- Model Compression
- An umbrella term for techniques that reduce model size and computational cost: quantization (lower numerical precision), pruning (removing weights), knowledge distillation (training a smaller student model), and neural architecture search (finding efficient architectures). Compression ratios compose multiplicatively: \(R_\text{total} = R_\text{quant} \times R_\text{prune}\).
O
- OpEx (Operational Expenditure)
- The ongoing costs of running hardware: electricity, networking, cooling, labor. In cloud pricing, OpEx dominates over a 3-year period by 2–5× over CapEx.
P
- PagedAttention
- A memory management technique for LLM serving that stores KV-cache in non-contiguous fixed-size pages, analogous to virtual memory paging in operating systems. Eliminates internal fragmentation from pre-allocated contiguous KV-cache slots. Introduced by vLLM (Kwon et al., 2023).
- Pipeline Parallelism (PP)
- A distributed training strategy that splits the model’s layers across devices, each device processing a different “stage.” Introduces a pipeline bubble of idle time at the start and end of each batch.
- Pipeline Bubble
- The fraction of time a pipeline-parallel system spends idle waiting for the first microbatch to propagate through all stages. \(\text{Bubble} = \frac{P-1}{P-1+M}\) where \(P\) is pipeline depth and \(M\) is microbatch count.
- Precision
-
The numerical format used to represent weights and activations.
fp32(32-bit float) is most accurate;fp16/bf16(16-bit) halves memory usage and doubles throughput on modern tensor cores;int8andint4further reduce memory at the cost of accuracy. - Progressive Lowering
- MLSys·im’s architectural principle: workload specifications (demand) are progressively mapped onto hardware specifications (supply) through a chain of analytical transformations. The reverse of how hardware is typically specified—starting from the algorithm, not the chip.
- PUE (Power Usage Effectiveness)
- \(\text{PUE} = \text{Total Facility Power} / \text{IT Equipment Power}\). A PUE of 1.0 is theoretical perfection; hyperscale datacenters achieve 1.1–1.4. Higher PUE means more energy wasted on cooling and facility overhead.
R
- Ridge Point
- The arithmetic intensity at which a workload transitions from memory-bound to compute-bound on a given hardware platform: \(I^* = \text{Peak\_FLOPs} / \text{Memory\_BW}\). For the A100 at fp16: \(I^* = 312 \text{ TFLOP/s} / 2 \text{ TB/s} = 156 \text{ FLOP/byte}\).
- Roofline Model
- A visual and analytical tool that plots hardware performance ceilings (the “roofline”) and shows where workloads sit relative to them. Introduced by Williams et al. (2009). MLSys·im implements a generalized roofline via the Iron Law.
- SRAM (Static Random-Access Memory)
- Fast, on-chip memory used for caches and registers. In GPUs, the L1/L2 cache and shared memory are SRAM. In microcontrollers (e.g., ESP32-S3), SRAM is the primary working memory (typically 256 KiB–2 MiB). SRAM is orders of magnitude faster but smaller than DRAM/HBM.
S
- SSoT (Single Source of Truth)
- The principle that each specification (chip peak FLOPs, grid carbon intensity, etc.) has exactly one authoritative location—the MLSys Zoo. All computations derive from the Zoo, eliminating inconsistencies from stale copied values.
- Systems Wall
- A physical or logical constraint that bounds ML system performance. MLSys·im identifies 22 such walls organized into 6 domains (Node, Data, Algorithm, Fleet, Ops, Analysis). Each wall is resolved by a dedicated solver. See Wall.
T
- TCO (Total Cost of Ownership)
- The full cost of a system over its lifetime: \(\text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}\). Includes hardware purchase, electricity, cooling, networking, and labor.
- TDP (Thermal Design Power)
- The maximum sustained power a chip is designed to dissipate under load, in Watts. Relevant for datacenter cooling capacity planning. An H100 SXM5 has a TDP of 700 W.
- Tensor Core
- A specialized hardware unit in NVIDIA GPUs designed for matrix-multiply-accumulate operations. Tensor cores achieve much higher throughput than standard CUDA cores for ML workloads. The A100’s 312 TFLOP/s peak (fp16) comes from its tensor cores, not its CUDA cores.
- Tensor Parallelism (TP)
- A distributed training strategy that splits individual matrix multiplications across devices. Requires high-bandwidth intra-node connectivity (NVLink). Used in combination with data and pipeline parallelism in 3D parallelism.
- Throughput
- The number of samples processed per second. \(\text{Throughput} = \text{Batch\_Size} / \text{Latency}\). Note: maximizing throughput often conflicts with minimizing latency.
- TTFT (Time to First Token)
- The latency from receiving a user query to generating the first output token in an LLM serving system. Determined primarily by the pre-fill phase, which is compute-bound. Target: <200 ms for interactive applications.
U
- Utilization (η)
- The fraction of theoretical peak FLOP/s actually achieved in practice. Typical values: 30–50% for well-optimized training, 10–30% for inference. MLSys·im uses η as a parameter; see the hardware registry for per-device defaults.
W
- Wall (Systems Wall)
- A physical or logical constraint that bounds ML system performance. MLSys·im identifies 22 such walls organized into 6 domains. Each wall is resolved by a dedicated solver (Walls 1 and 2 share the SingleNodeModel, yielding 22 resolvers (20 models and 2 solvers) for 22 walls). Examples: the Compute Wall (peak FLOP/s), the Memory Wall (HBM bandwidth), the Serving Wall (TTFT/ITL phases), the Capital Wall (TCO).
- Weight Streaming
- An inference architecture (used in wafer-scale systems like Cerebras) where model weights are streamed from external memory through the compute array rather than stored on-chip. Shifts the bottleneck from HBM bandwidth to injection interconnect bandwidth.
- WUE (Water Usage Effectiveness)
- Liters of water consumed per kilowatt-hour of energy. Relevant for datacenters using evaporative cooling. MLSys·im estimates water usage as: Water (liters) = Energy (kWh) x WUE.
Y
- Young-Daly Formula
- The optimal checkpoint interval for fault-tolerant distributed training: \(\tau_\text{opt} = \sqrt{2 \times \delta \times \text{MTBF}_\text{fleet}}\), where \(\delta\) is the time to save one checkpoint and MTBF is the mean time between failures of the fleet. Named after Young (1974) and Daly (2006).
Additional Terms
- GQA (Grouped Query Attention)
- A transformer attention variant where multiple query heads share a single key-value head, reducing KV-cache memory without significantly affecting model quality. Used in Llama-3 and other modern LLMs.
- Microbatch
- A subdivision of the training batch used in pipeline parallelism. Increasing the number of microbatches \(M\) reduces the pipeline bubble fraction: \(\text{Bubble} = \frac{P-1}{P-1+M}\).
- MTBF (Mean Time Between Failures)
- The average time a component operates before failing. For a fleet of \(N\) identical nodes, \(\text{MTBF}_\text{fleet} = \text{MTBF}_\text{node} / N\). A 1024-node cluster with 100,000-hour node MTBF has a fleet MTBF of about 98 hours.
- NVLink
- NVIDIA’s high-bandwidth interconnect for GPU-to-GPU communication within a server. Provides 900 GB/s bidirectional bandwidth per GPU in DGX H100 systems. Used for tensor parallelism, where low-latency intra-node communication is critical.
- Operator Fusion
- Combining multiple small GPU operations (kernels) into a single larger one to reduce memory transfers between operations. Fusing a matrix multiply followed by an activation function avoids writing and re-reading the intermediate result from HBM.
- SLA (Service Level Agreement)
- A target performance guarantee, typically specifying maximum acceptable latency and minimum throughput. For LLM serving, common SLAs target TTFT < 200 ms and ITL < 50 ms/token.
This glossary is updated with each MLSys·im release. If a term is missing, please open an issue.