Glossary

This glossary defines key terms used throughout this book and gathers them as a compact reference. Terms are organized alphabetically so they can be consulted alongside the chapters.

3

3dmark: Graphics performance benchmark suite that evaluates real-time 3D rendering capabilities, measuring triangle throughput, texture fill rates, and modern features like ray tracing and DLSS performance.

A

a/b testing: A controlled experimental method for comparing two versions of a system or model by randomly dividing users into groups and measuring performance differences between the variants
ablation study: An experiment that removes, disables, or isolates one component at a time to measure its contribution to model or system behavior.
activation-based pruning: A pruning method that evaluates the average activation values of neurons or filters over a dataset to identify and remove neurons that consistently produce low activations and contribute little information to the network’s decision process.
activation checkpointing: A memory optimization technique that reduces memory usage during backpropagation by selectively discarding and recomputing activations instead of storing all intermediate results.
activation function: A mathematical function applied to the weighted sum of inputs in a neural network neuron to introduce nonlinearity, enabling the network to learn complex patterns beyond simple linear combinations.
activation memory: Memory used to store intermediate layer outputs needed for gradient computation during training, scaling with batch size, sequence length, and model depth.
active learning: An approach that intelligently selects the most informative examples for human annotation based on model uncertainty, reducing the amount of labeled data needed for effective training.
adam optimization: An adaptive learning rate optimization algorithm that combines momentum and RMSprop by maintaining exponentially decaying averages of both gradients and squared gradients for each parameter.
AdamW: A variant of Adam that decouples weight decay from the adaptive gradient update, improving regularization while keeping the optimizer’s adaptive moments.
ai triad: A framework modeling ML systems as three interdependent components: data that guides behavior, algorithms that learn patterns, and computational infrastructure that enables training and inference. Limitations in any component constrain the capabilities of the others.
alerting: Automated notification systems that inform teams when metrics exceed predefined thresholds or anomalies are detected in production ML systems.
AlexNet: A landmark convolutional neural network architecture that won the 2012 ImageNet challenge, reducing error rates from 26 percent to 16 percent and sparking the deep learning renaissance.
all-reduce: A collective communication operation in distributed computing where each process contributes data and all processes receive the combined result, commonly used for gradient aggregation in distributed training.
all-to-all communication: A communication pattern where every device or process exchanges data with every other device, common in sharded embedding and expert-parallel workloads.
alphafold: A landmark AI system developed by DeepMind that predicts the three-dimensional structure of proteins from their amino acid sequences, solving the decades-old protein folding problem and demonstrating how large-scale ML systems can accelerate scientific discovery.
Amdahl’s Law: A speedup bound showing that overall improvement is limited by the portion of a workload or pipeline that remains serial or unoptimized.
AOT compilation: Ahead-of-time compilation that converts a model, graph, or program into optimized executable code before deployment or execution, trading flexibility for lower runtime overhead.
apache kafka: A distributed streaming platform that handles real-time data feeds using a publish-subscribe messaging system, commonly used for building ML data pipelines with high throughput and fault tolerance.
apache spark: An open-source distributed computing framework that enables large-scale data processing across clusters of computers, revolutionizing ETL operations with in-memory computing capabilities.
application-specific integrated circuit (ASIC): Custom chips designed for specific computational tasks that offer superior performance and energy efficiency compared to general-purpose processors by abandoning general-purpose flexibility. Examples include Google’s TPUs, Cerebras Wafer-Scale Engine, and Bitcoin mining ASICs.
architectural efficiency: The dimension of model optimization that focuses on how computations are performed efficiently during training and inference by exploiting sparsity, factorizing large components, and dynamically adjusting computation based on input complexity.
arithmetic intensity: The ratio of floating-point operations to bytes of memory accessed (FLOP/byte), used in roofline analysis to determine whether workloads are memory bound or compute bound and to guide optimization priorities.
artificial general intelligence: AI systems capable of matching human-level performance across all cognitive tasks, requiring novel distributed architectures, energy-efficient hardware, and unprecedented infrastructure scale.
artificial intelligence: The field of computer science focused on creating systems that can perform tasks typically requiring human intelligence, such as perception, reasoning, learning, and decision-making.
artificial neural network: A computational model inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that can learn patterns from data through adjustable weights and biases.
artificial neurons: Basic computational units in neural networks that mimic biological neurons, taking multiple inputs, applying weights and biases, and producing an output signal through an activation function.
attention mechanism: A neural network component that computes weighted connections between elements based on their content, allowing dynamic focus on relevant parts of the input rather than fixed architectural connections.
AUC: Area under the ROC curve, a threshold-independent classification metric that compares true positive and false positive rates across decision thresholds.
audit trail: An append-only record of who accessed or changed data, models, or system artifacts and when, supporting accountability, debugging, and compliance.
autograd tape: A transient record of operations and saved values built during forward execution so reverse-mode automatic differentiation can compute gradients during the backward pass.
automatic differentiation: A computational technique that automatically calculates exact derivatives of functions implemented as computer programs by systematically applying the chain rule at the elementary operation level, essential for training neural networks through gradient-based optimization.
automatic mixed precision: A training technique that automatically manages the use of different numerical precisions (FP16, FP32) to optimize memory usage and computational speed while maintaining model accuracy.
automl: Automated Machine Learning that uses machine learning itself to automate model design decisions, including architecture search, hyperparameter optimization, and feature selection to create efficient models without manual intervention.
autoscaling: Dynamic adjustment of compute resources based on workload demand, automatically scaling up during peak usage and scaling down during low usage to optimize costs and performance.

B

backpropagation: An algorithm that computes gradients of the loss function with respect to network weights by propagating error signals backward through the network layers, enabling systematic weight updates during training.
backward pass: The training phase that propagates gradients from the loss back through the network to compute parameter updates, pairing with the forward pass.
bandwidth: The maximum rate of data transfer across a communication channel or memory interface, typically measured in bytes per second and critical for optimizing data movement in AI accelerators.
batch inference: The process of running inference on a large dataset in bulk, typically as a scheduled job, as opposed to real-time inference on individual requests. Enables higher throughput by amortizing overhead across many inputs.
batch ingestion: A data processing pattern that collects and processes data in groups or batches at scheduled intervals, suitable for scenarios where real-time processing is not critical.
batch normalization: A technique that normalizes inputs to each layer to have zero mean and unit variance, which stabilizes training and often allows for higher learning rates and faster convergence, and subsequently applies a learnable scale and shift to preserve representational power.
batch processing: The technique of processing multiple data samples simultaneously to amortize computation and memory access costs, improving overall throughput in neural network training and inference
batch size: The number of training examples processed simultaneously during one iteration of neural network training, affecting both computational efficiency and gradient estimation quality.
batch throughput optimization: Techniques for maximizing the number of samples processed per unit time when handling multiple inputs simultaneously, using parallelism and batching efficiencies.
batched operations: Matrix computations that process multiple inputs simultaneously, converting matrix-vector operations into more efficient matrix-matrix operations to improve hardware utilization.
benchmark engineering: The systematic design and development of performance evaluation frameworks, involving test harness creation, metric selection, and result interpretation methodologies.
benchmark harness: Systematic infrastructure component that controls test execution, manages input delivery, and collects performance measurements under controlled conditions to ensure reproducible evaluations.
benchmarking: Systematic evaluation of compute performance, algorithmic effectiveness, and data quality in machine learning systems to optimize performance across diverse workloads and ensure reproducibility.
BERT: Bidirectional Encoder Representations from transformers, a transformer-based language model introduced by Google in 2018 that revolutionized natural language processing through masked language modeling pretraining.
BF16: A 16-bit floating-point format developed by Google Brain that maintains the same dynamic range as FP32 but with reduced precision, making it particularly suitable for deep learning training.
bias: A learnable parameter added to the weighted sum in each neuron that shifts the activation function, allowing neurons to activate even when all inputs are zero and providing additional flexibility for the network to fit complex patterns.
binarization: An extreme quantization technique that reduces neural network weights and activations to binary values (typically -1 and +1), achieving maximum compression but often requiring specialized training procedures and hardware support.
biological neuron: A cell in the nervous system that receives, processes, and transmits information through electrical and chemical signals, serving as inspiration for artificial neural networks.
bitter lesson: Richard Sutton’s 2019 observation that general methods using computation consistently outperform approaches encoding human expertise, suggesting that systems engineering enabling computational scale is central to AI advancement.
black box: A system where you can observe the inputs and outputs but cannot see or understand the internal workings, particularly problematic in AI when systems make important decisions affecting people’s lives without providing explanations for their reasoning.
blas: Basic Linear Algebra Subprograms, a specification for low-level routines that perform common linear algebra operations such as vector addition, scalar multiplication, dot products, and matrix operations, forming the computational foundation of modern ML frameworks.
block sparse matrix: A sparse matrix whose nonzero values appear in dense blocks rather than isolated elements, preserving structure that hardware kernels can exploit efficiently.
blue-green deployment: A deployment strategy using two production environments so traffic can switch to a validated new version while the old version remains available for rollback.
bounding box: A rectangular annotation that identifies object locations in images by drawing a box around each object of interest, commonly used in computer vision training datasets.
brittleness: The tendency of rule-based AI systems to fail completely when encountering inputs that fall outside their programmed scenarios, no matter how similar those inputs might be to what they were designed to handle.

C

caching: A technique for storing frequently accessed data in high-speed storage systems to reduce retrieval latency and improve system performance in ML pipelines.
caching allocator: A framework memory manager that reuses device-memory blocks instead of calling the hardware allocator for every tensor, reducing allocation latency and fragmentation.
calibration: The process in post-training quantization of analyzing a representative dataset to determine optimal quantization parameters, including scale factors and zero points, that minimize accuracy loss when converting from high to low precision
canary deployment: A deployment strategy where new model versions receive a small percentage of production traffic to validate behavior before full rollout, enabling early detection of issues with minimal user impact
CAP theorem: A distributed systems principle stating that a data store can only provide two of the three guarantees: Consistency, Availability, and Partition Tolerance.
carbon footprint: The total greenhouse gas emissions, typically measured in CO\(_2\) equivalent, produced directly and indirectly by training and operating an ML system.
carbon intensity: The emissions produced per unit of energy, usually measured as kg CO2e per kWh, used to convert ML energy use into carbon impact.
cerebras wafer-scale engine: In the CS-2 generation, a single-wafer processor containing 2.6 trillion transistors and 850,000 cores, designed to eliminate inter-device communication bottlenecks in large-scale machine learning training.
chain rule: The calculus rule enabling computation of derivatives for composite functions, fundamental to backpropagation.
channelwise quantization: A quantization granularity approach where each channel in a layer uses its own set of quantization parameters, providing more precise representation than layerwise quantization while maintaining hardware efficiency.
ci/cd pipelines: Continuous Integration and Continuous Delivery automated workflows that streamline model development by integrating testing, validation, and deployment processes.
circuit breaker pattern: A resilience pattern that stops or redirects calls to an unhealthy service after repeated failures to prevent cascading failure and enable fallback behavior.
classification labels: Simple categorical annotations that assign specific tags or categories to data examples, representing the most basic form of supervised learning annotation.
cloudsuite: Benchmark suite developed at EPFL that addresses modern data center workloads including web search, data analytics, and media streaming, measuring end-to-end performance across network, storage, and compute dimensions.
cold-start performance: Time required for a system to transition from idle state to active execution, particularly important in serverless environments where models are loaded on demand.
collaborative filtering: A technique used in recommendation systems that predicts user preferences by identifying patterns in interactions from many users, using the collective behavior of the crowd rather than just item properties.
communication tax: The performance penalty incurred in distributed systems due to the latency and bandwidth costs of synchronizing state between nodes.
compound ai systems: AI architectures that combine multiple specialized models and components working together, rather than relying on a single monolithic model, enabling modularity, specialization, and improved interpretability.
compressed sparse row (CSR): A memory-efficient storage format for sparse matrices that uses three arrays (values, column indices, row pointers) to store only nonzero elements, reducing memory from \(\mathcal{O}(N^2)\) (for an \(N{\times}N\) matrix) to \(\mathcal{O}(K + N)\) where \(K\) is the number of nonzeros.
computational graph: A directed graph representing the sequence of operations in neural network computation, enabling automatic differentiation.
computer engineering: An engineering discipline that emerged in the late 1960s to address the growing complexity of integrating hardware and software systems, combining expertise from electrical engineering and computer science to design and build complex computing systems.
concept drift: The phenomenon where the statistical relationship between input features and target outputs changes over time, distinct from data drift where only input distributions change, causing model performance to degrade
conditional computation: A dynamic optimization technique where different parts of a neural network are selectively activated based on input characteristics, reducing computational load by skipping unnecessary computations for specific inputs.
consensus labeling: A quality control approach that collects multiple annotations for the same data point to identify controversial cases and improve label reliability through inter-annotator agreement.
containerization: Packaging applications and their dependencies into portable, isolated containers using tools like Docker to ensure consistent execution across different environments.
continuous batching: An LLM serving technique that adds and removes requests from an inference batch between token-generation steps as sequences start and finish.
continuous integration: A software development practice where code changes are automatically integrated, tested, and validated multiple times per day to detect issues early in the development cycle.
convolution: A mathematical operation that slides a filter (kernel) across input data to extract features such as edges, textures, or patterns. Fundamental to convolutional neural networks and particularly effective for processing images and spatial data.
convolutional neural network: A specialized neural network architecture designed for processing grid-like data such as images, using convolutional layers that apply filters to detect local features.
coreset: A small subset of a dataset selected to preserve important statistical or training properties of the full dataset while reducing training or evaluation cost.
correction cascade: An ML technical-debt pattern where fixing one component creates new failures or redesign needs elsewhere because data and model dependencies propagate through the system.
covariate shift: A distribution shift where input feature distributions change while the relationship between features and labels remains stable.
cp decomposition: CANDECOMP/PARAFAC decomposition that expresses a tensor as a sum of rank-one components, used to compress neural network layers by reducing the number of parameters while preserving computational functionality.
credit assignment problem: The challenge of determining which weights in a multi-layer network contributed to prediction errors.
crisp-dm: Cross-Industry Standard Process for Data Mining, a structured methodology developed in 1996 that defines six phases for data projects: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
critical batch size: The batch size beyond which larger batches provide diminishing optimization benefit and may reduce generalization, limiting simple linear learning-rate scaling.
cross-entropy loss: A loss function commonly used in classification tasks that measures the difference between predicted probability distributions and true class labels, providing strong gradients for effective learning.
crowdsourcing: A collaborative data collection approach that uses distributed individuals via the internet to perform annotation tasks, enabling scalable dataset creation through platforms like Amazon Mechanical Turk.
cublas: NVIDIA’s CUDA Basic Linear Algebra Subprograms library that provides GPU-accelerated implementations of standard linear algebra operations, enabling high-performance matrix computations on NVIDIA graphics processing units.
CUDA (Compute Unified Device Architecture): NVIDIA’s parallel computing platform and programming model that enables general-purpose computing on graphics processing units (GPUs), allowing machine learning frameworks to use massive parallelism for accelerated tensor operations.
CUDA stream: An ordered queue of GPU work used to schedule kernels, memory transfers, and synchronization events so independent operations can overlap.

D

dam taxonomy: A diagnostic framework that classifies ML system bottlenecks into three mutually exclusive and collectively exhaustive (MECE) categories: Data (information flow, bounded by bandwidth), Algorithm (mathematical logic, bounded by total operations), and Machine (physical execution, bounded by peak throughput).
dartmouth conference: The legendary 8-week workshop at Dartmouth College in 1956 where AI was officially born, organized by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon, where the term artificial intelligence was first coined.
data as source code: The Software 2.0 principle that training data defines ML system behavior, so dataset changes act like behavior-changing code changes.
data augmentation: The process of artificially expanding training datasets by creating modified versions of existing data through transformations like rotation, scaling, or noise injection.
data cascades: Systemic failures where data quality issues compound over time, creating downstream negative consequences such as model failures, costly rebuilding, or project termination.
data center: A facility that houses computer systems and associated components such as telecommunications and storage systems, typically containing thousands of servers for cloud computing operations.
data-centric approach: A machine learning paradigm that prioritizes improving data quality, diversity, and curation rather than solely focusing on model architecture improvements to achieve better performance.
data-centric computing: Systems optimized for the efficient ingestion of data and iterative refinement of model parameters, where the programmer’s job is to curate data.
data contracts: Explicit agreements between data producers and consumers defining schema, quality expectations, and service levels to prevent downstream breakage.
data curation: The process of selecting, organizing, and maintaining high-quality datasets by removing irrelevant information, correcting errors, and ensuring data meets specific standards for machine learning applications.
data debt: Accumulated data quality, documentation, schema, lineage, or freshness liabilities that compound over time and degrade model behavior.
data drift: The phenomenon where the statistical properties of input data change over time, causing machine learning model performance to degrade even when the underlying code remains unchanged
data echoing: A training input-pipeline technique that reuses batches, often with fresh augmentation, when data preparation is slower than accelerator consumption.
data governance: The framework of policies, procedures, and technologies that ensure data security, privacy, compliance, and ethical use throughout the machine learning pipeline.
data gravity: The architectural pull created by large data volumes, transfer time, bandwidth limits, and egress cost, which tends to draw computation toward the data.
data ingestion: The process of collecting and importing raw data from various sources into a system where it can be stored, processed, and prepared for machine learning applications
data lake: A storage repository that holds structured, semi-structured, and unstructured data in its native format, using schema-on-read approaches for flexible data analysis.
data lakehouse: A storage architecture that combines data lake flexibility with warehouse-style query semantics, schema management, and transactional guarantees.
data lineage: The documentation and tracking of data flow through various transformations and processes, providing visibility into data origins and modifications for compliance and debugging
data locality: The principle that data and computation should be colocated when transfer latency, bandwidth, or cost dominates remote processing.
data parallelism: A distributed training strategy that splits the dataset across multiple devices while each device maintains a complete copy of the model, enabling parallel computation of gradients.
data pipeline: The infrastructure and workflows that automate the movement and transformation of data from sources through processing stages to final storage or consumption.
data quality: The degree to which data meets requirements for accuracy, completeness, consistency, and timeliness, directly impacting machine learning model performance.
data quality multiplier: The concept that improvements in data quality (like cleaner labels) yield multiplicative rather than additive gains in model performance compared to model tweaks.
data validation: The systematic verification that collected data meets quality standards, is properly formatted, and contains accurate information suitable for machine learning model training and evaluation
data versioning: The practice of tracking and managing different versions of datasets over time, similar to code versioning, to ensure reproducibility and enable rollback to previous data states when needed
data warehouse: A centralized repository optimized for analytical queries (OLAP) that stores integrated, structured data from multiple sources in a standardized schema.
dataflow architecture: Specialized computing architecture where instruction execution is determined by data availability rather than a program counter, enabling highly parallel processing of neural network operations.
dataflow challenges: Technical difficulties in managing data movement and dependencies in hardware accelerators, including memory bandwidth limitations and synchronization requirements.
datasheets for datasets: Documentation for training data that captures provenance, collection methodology, demographic composition, and known limitations affecting model behavior.
dead letter queue: A separate storage mechanism for data that fails processing, allowing for later analysis and potential reprocessing of problematic data without blocking the main pipeline.
deduplication: The removal of exact, near-duplicate, or semantically redundant samples from a dataset to reduce storage, training compute, and leakage risk.
deep learning: A subfield of machine learning that uses artificial neural networks with multiple layers to automatically learn hierarchical representations from data without explicit feature engineering.
demographic parity: A fairness criterion requiring that the probability of receiving a positive prediction is independent of group membership across protected attributes.
dennard scaling: The observation that as transistors became smaller, their power density remained constant, enabling higher performance at the same power envelope. Its breakdown around 2005 ended the era of free frequency scaling.
dense layer: A fully-connected neural network layer where each neuron receives input from all neurons in the previous layer, enabling comprehensive information integration across features.
deployment constraints: Operational limitations such as hardware resources, network connectivity, regulatory requirements, and integration requirements that influence how machine learning models are implemented in production environments.
deployment paradigm: A distinct approach to hosting and executing ML models characterized by specific resource constraints and operational properties, such as Cloud ML, Edge ML, Mobile ML, or TinyML.
depthwise separable convolution: A convolution factorization that applies spatial filtering independently per channel and then mixes channels with a pointwise convolution to reduce computation.
devops: Software development practice that combines development and operations teams to shorten development cycles and deliver high-quality software through automation and collaboration.
dhrystone: Integer-based benchmark introduced in 1984 that measures integer and string operations in DMIPS (Dhrystone MIPS), designed to complement floating-point benchmarks with typical programming constructs.
diabetic retinopathy: A diabetes complication that damages blood vessels in the retina, serving as a leading cause of preventable blindness and a key application area for medical AI screening systems.
differential privacy: A mathematical framework for quantifying and limiting the privacy loss when releasing statistical information about datasets, ensuring individual privacy while enabling useful data analysis
direct memory access (DMA): A mechanism that lets a device move data to or from memory without CPU-managed copying for each transfer, enabling overlap and reducing host overhead.
disaggregated evaluation: The practice of breaking down model performance metrics by demographic groups or other factors to reveal disparities that are hidden by aggregate measures.
dispatch tax: Runtime overhead from launching and orchestrating many small operations through the host framework instead of executing fused or compiled graph regions.
distributed computing: An approach that processes data across multiple machines or processors simultaneously, enabling scalable handling of large datasets through frameworks like Apache Spark.
distributed intelligence: The placement of computational capabilities across multiple devices and locations rather than relying on a single centralized system, enabling local processing and decision-making.
distributed training: A method of training machine learning models across multiple machines or devices to handle larger datasets and models that exceed single-device computational or memory capacity.
distribution shift: A change in the statistical properties of data between training and deployment, or over time during deployment. Types include covariate shift (input distribution changes), label shift (output distribution changes), and concept drift (relationship between inputs and outputs changes).
dlrm: Deep Learning Recommendation Model, an architecture developed by Meta that combines categorical embeddings with a bottom multi-layer perceptron (MLP) and top MLP to handle the massive scale and sparsity of recommendation system workloads.
domain-specific architecture: Hardware designs tailored to optimize specific computational workloads, trading flexibility for improved performance and energy efficiency compared to general-purpose processors.
dropout: A regularization technique that randomly sets a fraction of input units to zero during training to prevent overfitting and improve generalization.
dying relu problem: A failure mode where ReLU neurons become permanently inactive and output zero for all inputs, preventing them from contributing to learning when weighted inputs consistently produce negative preactivations.
dynamic batching: A serving technique that waits briefly to group independent requests into a batch, improving throughput while adding controlled batch-formation latency.
dynamic graph: A computational graph that is built and modified during program execution, allowing for flexible model architectures and easier debugging but potentially limiting optimization opportunities compared to static graphs.
dynamic pruning: A model optimization technique that removes unnecessary parameters from neural networks while maintaining predictive performance, reducing model size and computational cost by eliminating redundant weights, neurons, or layers.
dynamic quantization: The process of reducing numerical precision in neural networks by mapping high-precision weights and activations to lower-bit representations, significantly reducing memory usage and computational requirements
dynamic random access memory (dram): A type of volatile memory that stores data in capacitors and requires periodic refresh cycles, commonly used as main memory in computer systems.
dynamic voltage and frequency scaling (dvfs): Power management technique that adjusts processor voltage and clock frequency based on workload demands to optimize energy consumption while maintaining performance.

E

eager execution: An execution mode where operations are evaluated immediately as they are called in the code, providing intuitive debugging and development experience but potentially sacrificing some optimization opportunities available in graph-based execution.
early exit architectures: Neural network designs that include multiple prediction heads at different depths, allowing samples to exit early when confident predictions can be made, reducing average computational cost per inference.
edge computing: A distributed computing paradigm that brings computation and data storage closer to the sources of data, reducing latency and bandwidth usage.
edge deployment: A deployment strategy where machine learning models run locally on devices at the network edge rather than in centralized cloud servers, reducing latency and enabling operation without constant internet connectivity.
efficiency frontier: The optimal trade-off curve between accuracy and computational cost (latency, energy, or memory), where no model exists that is both more accurate and more efficient.
EfficientNet: A family of neural network architectures discovered through Neural Architecture Search that achieves better accuracy-efficiency trade-offs by using compound scaling to balance network depth, width, and input resolution.
EL2N (Error L2-Norm): A sample-scoring method that ranks examples by early prediction error, often using a proxy model, to identify informative boundary cases for data selection.
eliza: One of the first chatbots created by MIT’s Joseph Weizenbaum in 1966 that could simulate human conversation through pattern matching and substitution, notable because people began forming emotional attachments to this simple program.
elt (extract, load, transform): A data processing paradigm that first loads raw data into the target system before applying transformations, providing flexibility for evolving analytical needs.
embedded systems: Computer systems with dedicated functions within larger mechanical or electrical systems, typically designed for specific tasks with real-time computing constraints.
embedding sharding: Splitting large embedding tables across devices or nodes so they fit in aggregate memory, while adding communication to gather embeddings for each batch.
embedding table: A lookup table that maps discrete IDs or tokens to dense vectors, often dominating memory in recommendation models and large vocabularies.
emergent behaviors: Unexpected system-wide patterns or characteristics that arise from the interaction of individual components, often becoming apparent only when systems operate at scale or in real-world conditions.
encoder-decoder: An architectural pattern where an encoder processes input into a compressed representation and a decoder generates output from this representation, commonly used in sequence-to-sequence tasks.
end-to-end benchmarks: Comprehensive evaluation methodology that assesses entire AI system pipelines including data processing, model execution, postprocessing, and infrastructure components.
energy efficiency: The measure of computational work performed per unit of energy consumed, typically expressed as operations per joule and crucial for battery-powered and data center deployments
energy per inference: The total energy consumed to produce one model prediction, including model execution and relevant system overhead.
epoch: One complete pass through the entire training dataset during neural network training, consisting of multiple batch iterations depending on dataset size and batch size.
equal opportunity: A fairness criterion requiring equal true positive rates among qualified applicants across different demographic groups.
equalized odds: A fairness criterion requiring that both true positive and false positive rates are equal across different demographic groups.
etl (extract, transform, load): A traditional data processing paradigm that transforms data before loading it into a data warehouse, resulting in ready-to-query formatted data.
expected calibration error (ECE): A metric measuring the gap between predicted confidence and observed accuracy across confidence bins, used to assess probability calibration.
experiment tracking: The systematic recording and management of machine learning experiments, including hyperparameters, model versions, training data, and performance metrics, to enable comparison and reproducibility
expert systems: AI systems from the mid-1970s that captured human expert knowledge in specific domains, exemplified by MYCIN for diagnosing blood infections, representing a shift from general AI to domain-specific applications.

F

fairness laundering: A failure mode where removing explicit protected attributes makes a system appear compliant while correlated proxy variables still reproduce discriminatory outcomes.
false positive rate (FPR): The proportion of actual negative cases incorrectly classified as positive, computed as false positives divided by false positives plus true negatives.
farmbeats: A Microsoft Research project that applies machine learning and IoT technologies to agriculture, using edge computing to collect real-time data on soil conditions and crop health while demonstrating distributed AI systems in challenging real-world environments.
feature engineering: The process of manually designing and extracting relevant features from raw data to improve machine learning model performance, largely automated in deep learning systems.
feature map: The output of a convolutional layer representing the response of learned filters to different spatial locations in the input, capturing detected features at various positions.
feature store: A specialized data storage system that provides standardized, reusable features for machine learning, enabling feature sharing across multiple models and teams
federated learning: A machine learning approach that trains algorithms across decentralized edge devices or servers holding local data samples, without exchanging the raw data.
feedback loop: A cyclical process where outputs influence future inputs. In ML, this includes both beneficial cycles (where model outputs inform system improvement) and harmful ones (where a model’s predictions influence its own future training data, potentially reinforcing and amplifying initial biases over time).
feedforward network: A neural network architecture where information flows in one direction from input to output layers without cycles, forming the foundation for many deep learning models.
field-programmable gate array (FPGA): A reconfigurable integrated circuit that can be programmed after manufacturing to implement custom digital circuits and specialized computations, offering flexibility between general-purpose processors and ASICs.
fine-tuning: Adapting a pretrained model to a target task or domain by continuing training on task-specific data.
five-pillar framework: An organizational structure for ML systems engineering comprising five interconnected disciplines: Data Engineering, Training Systems, Deployment Infrastructure, Operations and Monitoring, and Ethics and Governance.
FlashAttention: An IO-aware attention algorithm that tiles computation to avoid materializing the full attention matrix in high-bandwidth memory, reducing memory traffic and enabling longer sequences.
floating-point unit (fpu): A specialized processor component designed to perform arithmetic operations on floating-point numbers with high precision and efficiency.
FLOP/s: Floating Point Operations Per Second, a measure of computational throughput that quantifies the number of mathematical operations involving decimal numbers a system can perform.
forgetting events: Events where a model changes from classifying an example correctly to misclassifying it later in training, marking difficult or informative samples.
forward pass: The process of computing neural network predictions by passing input data through successive layers, applying weights, biases, and activation functions at each stage to produce outputs. Also called forward propagation.
foundation model: Large-scale machine learning models trained on broad data that can be adapted to a wide range of downstream tasks, serving as a base for specialized applications.
four pillars framework: A data engineering framework organizing concerns into Quality, Reliability, Scalability, and Governance to manage the data lifecycle.
FP16: 16-bit floating-point numerical representation that reduces memory usage and accelerates computation on modern hardware accelerators while maintaining acceptable precision for many machine learning applications.
FP32: 32-bit floating-point numerical representation that provides standard precision for mathematical computations but requires more memory and computational resources than lower-precision formats.
FP32 to INT8: A common quantization transformation that converts 32-bit floating point weights and activations to 8-bit integers, achieving roughly 4\(\times\) memory reduction while maintaining acceptable accuracy for many models.
FP8: An 8-bit floating-point format family used on newer accelerators for high-throughput training and inference, requiring careful scaling because of limited precision and range.
framework decomposition: The systematic breakdown of neural network frameworks into hardware-mappable components, enabling efficient distribution of operations across processing elements.

G

gemm: General Matrix Multiply operations that follow the pattern \(C = \alpha AB + \beta C\), representing the fundamental computational kernel underlying most neural network operations including fully connected layers and convolutional layers.
gemv: General Matrix-Vector multiplication operations that compute the product of a matrix and a vector, commonly used in neural network computations and requiring careful optimization for memory access patterns.
generalization: The ability of a machine learning model to perform well on unseen data that differs from the training set, often improved through diverse and high-quality training data.
generalization gap: The difference between a model’s performance on training data and its performance on unseen real-world data. A large generalization gap indicates the model has memorized training examples rather than learning transferable patterns.
generative ai: A category of artificial intelligence systems capable of creating new content such as text, images, audio, or video based on learned patterns from training data.
Goodhart’s Law: The observation that when a measure becomes an optimization target, it can stop representing the underlying goal.
GPT-2: A 1.5-billion parameter autoregressive language model released by OpenAI in 2019, serving as a primary Lighthouse Model for analyzing memory bandwidth constraints in transformer inference.
graceful degradation: A system design principle where services continue functioning with reduced capabilities when faced with partial failures or data unavailability.
gradient accumulation: A technique that simulates larger batch sizes by accumulating gradients from multiple smaller batches before updating model parameters, enabling training with limited memory.
gradient-based pruning: A pruning method that uses gradient information during training to identify neurons or filters with smaller gradient magnitudes, which contribute less to reducing the loss function and can be safely removed.
gradient clipping: A regularization technique that prevents gradient explosion by limiting the magnitude of gradients during backpropagation, typically by scaling gradients when their norm exceeds a threshold.
gradient compression: A technique used in distributed training to reduce the communication overhead by compressing gradient information exchanged between computing nodes.
gradient descent: An optimization algorithm that iteratively adjusts neural network parameters in the direction that minimizes the loss function, using gradients to determine update directions and magnitudes.
gradient synchronization: The process in distributed training where locally computed gradients are aggregated across devices to ensure all devices update their parameters consistently.
GraNd (Gradient Normed): A data selection score based on the norm of an example’s gradient during early training, identifying samples with larger learning signal.
graph break: A point where a compiler cannot capture part of a program and must fall back to eager execution, splitting optimized graph regions.
graphics processing unit (GPU): A specialized processor originally designed for rendering graphics that provides massive parallel computing capabilities essential for efficient neural network computation, training, and inference.
green ai: A movement in AI research and practice that prioritizes computational efficiency and energy consumption as primary metrics alongside traditional performance metrics like accuracy.
green500: Ranking system that evaluates the world’s most powerful supercomputers based on energy efficiency measured in FLOP/s per watt rather than raw computational performance.
ground truth: The objective reality or actual state of the phenomenon being observed, used as the reference standard (labels) for training and evaluating machine learning models.
gru: Gated Recurrent Unit, a simplified variant of LSTM that uses fewer gates while maintaining the ability to capture long-term dependencies in sequential data.

H

hardware abstraction: The layer in ML frameworks that provides a unified interface to diverse computing hardware (CPUs, GPUs, TPUs, accelerators) while handling device-specific optimizations and memory management behind the scenes.
hardware acceleration: The use of specialized computing hardware to perform certain operations faster and more efficiently than software running on general-purpose processors.
hardware accelerator: Specialized computing hardware designed to efficiently execute specific types of computations, such as GPUs for parallel processing or TPUs for machine learning workloads.
hardware-aware design: The practice of designing neural network architectures specifically optimized for target hardware platforms, considering factors like memory hierarchy, compute units, and data movement patterns to maximize efficiency.
hidden layer: An intermediate layer in a neural network between input and output layers that learns abstract representations by transforming data through learned weights and activation functions.
hidden state: The internal memory of recurrent neural networks that carries information from previous time steps, enabling the network to maintain context across sequential inputs.
hierarchical processing: A multi-tier system architecture where data and intelligence flow between different levels of the computing stack, from sensors to edge devices to cloud systems.
high bandwidth memory (hbm): An advanced memory technology that provides much higher bandwidth than traditional DRAM by using 3D stacking and wide interfaces, critical for data-intensive AI workloads.
horizontal scaling: Increasing system capacity by adding more machines or instances rather than upgrading existing hardware, providing better fault tolerance and load distribution.
hybrid machine learning: The integration of multiple ML paradigms such as cloud, edge, mobile, and tiny ML to form unified distributed systems that combine complementary strengths.
hybrid parallelism: A distributed training approach that combines data parallelism and model parallelism to use the benefits of both strategies for training very large models.
hyperparameter: A configuration setting that controls the learning process but is not learned from data, such as learning rate, batch size, or network architecture choices.
hyperscale data center: Large-scale data center facilities containing thousands of servers and covering extensive floor space, designed to efficiently support massive computing workloads.

I

I/O bottleneck: A performance limit where storage, decoding, preprocessing, or data transfer cannot supply data fast enough to keep compute resources busy.
idempotent transformation: A data transformation that produces the same output when repeated on the same input, supporting safe retries and reprocessing.
im2col: A convolution-lowering technique that unfolds image patches into matrix columns so convolution can execute as GEMM.
ImageNet: A massive visual database containing over 14 million labeled images across 20,000+ categories, created by Stanford’s Fei-Fei Li starting in 2009, whose annual challenge became instrumental in driving breakthrough advances in computer vision
imperative programming: A programming paradigm where operations are executed immediately as they are encountered in the code, allowing for natural control flow and easier debugging but potentially limiting optimization opportunities.
inductive bias: A structural assumption built into a model or architecture that restricts the functions it can learn efficiently, such as locality in CNNs.
inference: The operational phase where trained neural networks make predictions on new data using fixed parameters, without weight updates.
InfiniBand: A high-throughput, low-latency networking technology commonly used for multi-node accelerator clusters and distributed training.
information-compute ratio (icr): A metric quantifying the efficiency of data selection, defined as the ratio of model performance gain to the computational cost (FLOPs) required to achieve it. Maximizing ICR is the primary goal of data selection strategies.
infrastructure as code: Practice of managing and provisioning computing infrastructure through machine-readable configuration files rather than manual processes, enabling version control and automation.
INT8: 8-bit integer numerical representation used in quantized neural networks, where model weights and activations are represented using 8-bit integers instead of 32-bit floating point, reducing memory usage by roughly 4\(\times\) and accelerating inference on specialized hardware while attempting to maintain model accuracy.
intermediate representation (IR): A compiler-internal representation between frontend model code and backend hardware code generation, used by ML frameworks for optimization and portability.
internet of things: A network of physical objects embedded with sensors, software, and other technologies that connect and exchange data with other devices and systems over the internet.
intersectional analysis: Evaluation that considers combinations of demographic attributes (for example, race and gender simultaneously) to detect concentrated harms not visible in single-factor analysis.
iops: Input/Output Operations Per Second; a storage performance metric measuring the number of read/write operations a device can handle per second, critical for random access workloads like training data loading.
iron law of ml systems: A quantitative framework decomposing ML system performance into three terms: Data (limited by bandwidth), Compute (limited by FLOP/s), and Latency (limited by overhead). Formulated as \(T = D_{\text{vol}}/\text{BW} + O/(R_{\text{peak}} \cdot \eta_{\text{hw}}) + L_{\text{lat}}\).
iterative pruning: A gradual pruning strategy that removes parameters in multiple stages with fine-tuning between each stage, allowing the model to adapt to reduced capacity and typically achieving better accuracy than one-shot pruning.

J

JAX: A numerical computing library developed by Google Research that combines NumPy’s API with functional programming transformations including automatic differentiation, just-in-time compilation, and automatic vectorization for high-performance machine learning research.
jit compilation: Just-In-Time compilation that analyzes and optimizes code at runtime, enabling frameworks to balance the flexibility of eager execution with the performance benefits of graph optimization by compiling frequently used functions.

K

k-anonymity: A privacy technique that ensures each record in a dataset is indistinguishable from at least k-1 other records by generalizing quasi-identifiers.
kernel: A small matrix of learnable weights used in convolutional layers to detect specific features through the convolution operation, also called a filter.
kernel fusion: An optimization technique that combines multiple computational operations into a single kernel to reduce memory transfers and improve performance on parallel processors.
keyword spotting (kws): A technology that detects specific wake words or phrases in audio streams, typically used in voice-activated devices with constraints on power consumption and latency.
knowledge distillation: A model compression technique where a smaller “student” network learns to mimic the behavior of a larger “teacher” network by training on the teacher’s soft output probabilities rather than just hard labels
Kolmogorov-Smirnov test (K-S test): A nonparametric test that measures the maximum distance between two cumulative distributions, often used to compare feature distributions for drift.
Kullback-Leibler divergence (KL divergence): An asymmetric information-theoretic measure of how one probability distribution differs from another, used in monitoring and drift analysis.
KV cache: Stored key and value tensors from previous transformer tokens used during autoregressive inference to avoid recomputing attention history.

L

L0 norm: The count of nonzero elements in a vector or tensor, used in pruning to express sparsity or a target budget for remaining weights.
label quality drift: Degradation in the reliability or consistency of labels over time, even when the underlying input distribution remains stable.
label shift: A distribution shift where label frequencies change while the feature patterns conditioned on labels remain stable.
LAPACK: Linear Algebra Package that extends BLAS with higher-level linear algebra operations including matrix decompositions, eigenvalue problems, and linear system solutions, providing essential mathematical foundations for machine learning computations.
latency: The time delay between a request for data and the delivery of that data, critical in real-time applications where immediate responses are required.
latency constraints: Real-time requirements that limit the maximum acceptable delay for model inference, driving optimization decisions in deployment scenarios where response time is critical.
layer normalization: A normalization technique that normalizes inputs across the features dimension for each sample, commonly used in transformer architectures to stabilize training.
layerwise quantization: A quantization granularity where all parameters within a single layer share the same quantization parameters, providing computational efficiency but potentially limiting representational precision compared to finer-grained approaches.
learning rate: A hyperparameter that determines the step size for weight updates during gradient descent optimization, critically affecting training stability and convergence speed.
learning rate scheduling: The systematic adjustment of learning rates during training, using strategies like step decay, exponential decay, or cosine annealing to improve convergence and final model performance.
lifecycle coherence: The principle that all stages of ML development should align with overall system objectives, maintaining consistency in data handling, model architecture, and evaluation criteria.
linear scaling failure: The phenomenon where increasing computing resources (for example, doubling accelerators) results in sub-linear performance gains due to communication overhead and synchronization costs.
LINPACK: Benchmark developed at Argonne National Laboratory that measures system performance by solving dense systems of linear equations, famous for its use in Top500 supercomputer rankings.
Little’s Law: A queuing relationship stating that average concurrency equals arrival rate times average time in the system, linking throughput, latency, and capacity.
Llama: Large Language Model Meta AI, a family of open foundation models that popularized efficient architectural choices like RMSNorm and SwiGLU, serving as a modern reference point for large-scale transformer efficiency.
load balancing: Distribution of incoming requests across multiple server instances to prevent bottlenecks, improve response times, and ensure high availability.
locality-sensitive hashing (LSH): A hashing family that maps similar items to the same buckets with high probability, enabling scalable approximate similarity search.
logits: The raw, unnormalized scores output by the last layer of a neural network before the activation function (like Softmax) converts them into probabilities.
loss function: A mathematical function that quantifies the difference between neural network predictions and true labels, providing the optimization objective for training algorithms.
loss scaling: A technique used in mixed-precision training that multiplies the loss by a large factor before backpropagation to prevent gradient underflow in reduced precision formats.
lottery ticket hypothesis: The theory that large neural networks contain sparse subnetworks that, when trained in isolation from proper initialization, can achieve comparable accuracy to the full network while being significantly smaller.
low-rank factorization: A matrix decomposition technique that approximates large weight matrices as products of smaller matrices, reducing the number of parameters and computational operations required for neural network layers.
LSTM: Long Short-Term Memory, a type of recurrent neural network architecture designed to handle long-term dependencies through gating mechanisms that control information flow.

M

machine learning: A subset of artificial intelligence that enables systems to automatically improve performance on tasks through experience and data rather than explicit programming.
machine learning framework: A software platform that provides tools and abstractions for designing, training, and deploying machine learning models, bridging user applications with infrastructure through computational graphs, hardware optimization, and workflow orchestration. Examples include TensorFlow, PyTorch, and JAX.
machine learning lifecycle: A structured, iterative process that encompasses all stages involved in developing, deploying, and maintaining machine learning systems, from problem definition through ongoing monitoring and improvement.
machine learning operations (MLOps): The engineering discipline and set of tools focused on operationalizing machine learning models through automation, monitoring, and management of the entire ML pipeline from development to production.
machine learning systems engineering: The engineering discipline focused on building reliable, efficient, and scalable AI systems across computational platforms, spanning the entire AI lifecycle from data acquisition through deployment and operations with emphasis on resource-awareness and system-level optimization.
macro benchmarks: Evaluation methodology that assesses complete machine learning models to understand how architectural choices and component interactions affect overall system behavior and performance.
magnitude-based pruning: The most common pruning method that removes parameters with the smallest absolute values, based on the assumption that weights with smaller magnitudes contribute less to the model’s output
masking: An anonymization technique that alters or obfuscates sensitive values so they cannot be directly traced back to the original data subject.
membership inference attack: A privacy attack that tries to determine whether a specific record was included in a model’s training data.
memory bandwidth: Rate at which data can be read from or written to memory, measured in bytes per second, which often becomes a bottleneck in memory-intensive machine learning workloads.
memory hierarchy: The organization of memory systems with different access speeds and capacities, from fast on-chip caches to slower off-chip main memory.
memory wall: The widening performance gap between processor speed and memory bandwidth, where computational capacity outpaces the rate at which data can be delivered to the processor, becoming a primary bottleneck for large ML models.
metadata: Descriptive information about datasets that includes details about data collection, quality metrics, validation status, and other contextual information essential for data management.
micro-batch: A smaller slice of an effective batch processed sequentially for gradient accumulation or pipeline parallelism to reduce peak memory.
micro benchmarks: Specialized evaluation tools that assess individual components or specific operations within machine learning systems, such as tensor operations or neural network layers.
microcontroller: A small computer on a single integrated circuit containing a processor core, memory, and programmable input/output peripherals, commonly used in embedded systems.
MinHash: A sketching method that approximates set similarity with compact signatures, commonly used for scalable near-duplicate detection.
mini-batch gradient descent: A training approach that computes gradients and updates weights using a small subset of training examples simultaneously, balancing computational efficiency with gradient estimation quality.
mini-batch processing: An optimization approach that computes gradients over small batches of examples, balancing the computational efficiency of batch processing with the memory constraints of stochastic methods.
mixed-precision computing: A technique that uses different numerical precisions at various stages of computation, such as FP16 for matrix multiplications and FP32 for accumulations.
mixed-precision training: A training methodology that combines different numerical precisions (typically FP16 and FP32) to optimize memory usage and computational speed while maintaining training stability.
ML node: A single production ML application treated as an operational unit, including data pipelines, feature computation, model training, serving infrastructure, and monitoring.
ml systems: Integrated computing systems comprising three core components: data that guides algorithmic behavior, learning algorithms that extract patterns from data, and computing infrastructure that enables both training and inference processes.
ml systems spectrum: The range of machine learning system deployments from cloud-based systems with abundant resources to tiny embedded devices with severe constraints, each requiring different optimization strategies and trade-offs.
ML Test Score: A production-readiness rubric for ML systems that evaluates data, model, infrastructure, and monitoring tests.
MLCommons: Organization that develops and maintains industry-standard benchmarks for machine learning systems, including the MLPerf suite for training and inference evaluation.
MLPerf: Industry-standard benchmark suite that provides standardized tests for training and inference across various deep learning workloads, enabling fair comparisons of machine learning systems.
MLPerf execution scenarios: Standard MLPerf inference traffic scenarios that model sequential, synchronized, server, batch, and interactive workloads with different latency and throughput requirements.
MLPerf Inference: Benchmark framework that evaluates machine learning inference performance across different deployment environments, from cloud data centers to mobile devices and embedded systems.
MLPerf Mobile: Specialized benchmark that extends MLPerf evaluation to smartphones and mobile devices, measuring latency and responsiveness under strict power and memory constraints.
MLPerf Power: An MLPerf methodology for measuring power and energy efficiency of ML workloads alongside performance.
MLPerf Tiny: Benchmark designed for embedded and ultra-low-power AI systems such as IoT devices, wearables, and microcontrollers operating with minimal processing capabilities.
MLPerf Training: Standardized benchmark that evaluates machine learning training performance by measuring time-to-accuracy, throughput, and resource utilization across different hardware platforms.
mobile machine learning: The execution of machine learning models directly on portable, battery-powered devices like smartphones and tablets, enabling personalized and responsive applications.
model cards: A standardized format for documenting machine learning models, capturing information essential for responsible deployment, including intended use, performance factors, and ethical considerations.
model-centric ai: A research paradigm where the dataset is treated as fixed and engineering effort focuses on optimizing model architecture.
model compression: Techniques used to reduce the size and computational requirements of machine learning models while preserving accuracy, enabling deployment on resource-constrained devices.
model deployment: The process of integrating trained machine learning models into production systems where they can make predictions on new data and provide value to end users
model drift: The degradation of machine learning model performance over time due to changes in data patterns, user behavior, or environmental conditions that differ from the original training conditions
model evaluation: The systematic assessment of machine learning model performance using various metrics and validation techniques to determine whether the model meets requirements and is ready for deployment.
model FLOPs utilization (MFU): The ratio of achieved floating-point operations per second to the hardware’s theoretical peak, measuring how effectively a training or inference workload uses the available compute. Considered the single most important metric for large-scale training efficiency.
model optimization: The systematic refinement of machine learning models to enhance their efficiency while maintaining effectiveness, balancing trade-offs between accuracy, computational cost, memory usage, latency, and energy efficiency
model parallelism: A distributed training strategy that splits a neural network model across multiple devices, with each device responsible for computing a portion of the network.
model quantization: The process of reducing the precision of numerical representations in machine learning models, typically from 32-bit to 8-bit integers, to decrease model size and increase inference speed.
model registry: Centralized repository for storing, versioning, and managing trained machine learning models with associated metadata, facilitating model governance and deployment.
model serving: Infrastructure and systems that expose deployed machine learning models through APIs to handle prediction requests at scale with appropriate latency and throughput.
model training: The process of using machine learning algorithms to learn patterns from training data, adjusting model parameters to minimize prediction errors and create a functional predictive system.
model validation: The process of testing machine learning models on independent datasets to assess their generalization ability and ensure they perform reliably on unseen data
model versioning: The systematic tracking and management of different versions of machine learning models, including their parameters, training data, and performance metrics, to enable comparison and rollback capabilities
momentum: An optimization technique that accumulates a velocity vector across iterations to help gradient descent navigate through local minima and accelerate convergence in consistent gradient directions.
monitoring: The continuous observation and measurement of machine learning system performance, data quality, and operational metrics in production to detect issues and trigger maintenance actions.
multi-head attention: An attention mechanism that uses multiple parallel attention heads, each focusing on different aspects of the input to capture diverse types of relationships simultaneously.
multilayer perceptron (MLP): A feedforward neural network with one or more hidden layers between input and output layers, capable of learning nonlinear mappings through dense connections and activation functions.
multiply-accumulate (MAC): The operation of multiplying two values and accumulating the result into a running sum, dominating linear layers and accelerator datapaths.
mycin: One of the first large-scale expert systems developed at Stanford in 1976 to diagnose blood infections, representing the shift toward capturing human expert knowledge in specific domains rather than pursuing general artificial intelligence.

N

N:M sparsity: A structured sparsity pattern where exactly N of every M values are nonzero, giving hardware more regular zero patterns to exploit.
neural architecture search: An automated approach that uses machine learning algorithms to discover optimal neural network architectures by searching through possible combinations of layers, connections, and hyperparameters for specific constraints.
neural network: A computational model consisting of interconnected nodes organized in layers that can learn to map inputs to outputs through adjustable connection weights.
neural processing unit (npu): Specialized processors designed specifically for accelerating neural network operations and machine learning computations, optimized for parallel processing of AI workloads.
neuromorphic computing: A computing approach that mimics the structure and function of biological neural networks, potentially offering more energy-efficient processing for AI applications.
NoSQL: A category of database systems designed to handle large volumes of unstructured or semi-structured data with flexible schemas, often used in big data applications.
numerical precision optimization: The dimension of model optimization that addresses how numerical values are represented and processed, including quantization techniques that map high-precision values to lower-bit representations.
NVLink: NVIDIA’s high-bandwidth GPU interconnect for intra-node communication, useful for gradient synchronization and model-parallel activation transfers.

O

observability: Comprehensive monitoring approach that provides insight into system behavior through metrics, logs, and traces, enabling understanding of internal states from external outputs.
olap (online analytical processing): A database approach optimized for complex analytical queries across large datasets, typically used in data warehouses for business intelligence.
oltp (online transaction processing): A database approach optimized for frequent, short transactions and real-time processing, commonly used in operational applications.
on-chip memory: Fast memory integrated directly onto the processor chip, including caches and scratchpad memory, providing high bandwidth and low latency data access.
on-device learning: The capability for machine learning models to adapt and learn directly on edge devices without requiring data transmission to external servers.
one-hot encoding: A representation where categorical labels become vectors with a single 1 and remaining 0s.
one-shot pruning: A pruning strategy where a large fraction of parameters is removed in a single step, typically followed by fine-tuning to recover accuracy, offering simplicity but potentially requiring more aggressive fine-tuning.
online inference: Real-time prediction serving that processes individual requests with low latency, suitable for interactive applications requiring immediate responses.
ONNX: Open Neural Network Exchange, a standardized format for representing machine learning models that enables interoperability between different frameworks, allowing models trained in one framework to be deployed using another.
ONNX Runtime: Cross-platform inference engine that optimizes machine learning models through techniques like operator fusion and kernel tuning to improve inference speed and reduce computational overhead.
optimizer: An algorithm that adjusts model parameters during training to minimize the loss function, with common examples including SGD (Stochastic Gradient Descent), Adam, and RMSprop, each with different strategies for parameter updates.
optimizer state: Auxiliary values maintained by an optimizer in addition to model parameters, such as Adam’s momentum and variance tensors, often dominating training memory for large models.
orchestration: Coordination and management of complex workflows and distributed computing tasks, often using platforms like Kubernetes for container management.
outlier detection: The process of identifying data points that significantly deviate from normal patterns, which may represent errors, anomalies, or valuable rare events.
overfitting: A phenomenon where a model learns specific details of training data so well that it fails to generalize to new, unseen examples, typically indicated by high training accuracy but poor validation performance.

P

padding: A technique in convolutional networks that adds zeros or other values around the input borders to control the spatial dimensions of the output feature maps.
PagedAttention: A KV-cache memory-management technique that stores attention context in fixed-size pages rather than contiguous per-request blocks to reduce fragmentation.
paradigm shift: A fundamental change in scientific approach, like the shift from symbolic reasoning to statistical learning in AI during the 1990s, and from shallow to deep learning in the 2010s, requiring researchers to abandon established methods for radically different approaches.
parallelism: The simultaneous execution of multiple computational tasks or operations, fundamental to achieving high performance in neural network processing.
parameter: A learnable component of a neural network, including weights and biases, that gets adjusted during training to minimize the loss function.
Pareto frontier: The set of configurations where improving one objective requires worsening another, used to reason about accuracy, latency, memory, and energy trade-offs.
partitioning: A database technique that divides large datasets into smaller, manageable segments based on specific criteria to improve query performance and system scalability.
perceptron: The fundamental building block of neural networks, consisting of weighted inputs, a bias term, and an activation function that produces a single output.
performance insights: Analytical observations derived from monitoring production machine learning systems that reveal opportunities for improvement in model accuracy, system efficiency, or user experience.
performance-per-data (ppd): A metric measuring the accuracy gain per training sample, used to evaluate the quality of a dataset or selection strategy. High PPD indicates a dataset rich in information and low in redundancy.
pinned memory: Page-locked host memory that enables direct, asynchronous transfers between CPU and GPU, improving data-loading throughput at the cost of reserving system memory.
pipeline jungle: Anti-pattern where complex, interdependent data processing pipelines become difficult to maintain, debug, and modify, leading to technical debt and operational complexity
pipeline parallelism: A form of model parallelism where different layers of a model are placed on different devices and data flows through them in a pipeline fashion, allowing multiple batches to be processed simultaneously.
point-in-time correctness: The guarantee that training examples use only feature values that would have been available at the historical prediction time, preventing leakage.
pooling: A downsampling operation in convolutional networks that reduces spatial dimensions while retaining important features, commonly using max or average operations over local regions.
population stability index (PSI): A binned distribution-shift metric that compares current feature distributions against a baseline to detect population drift.
positional encoding: A method used in transformer architectures to inject information about the position of tokens in a sequence, since transformers lack inherent sequential processing.
post-training quantization: A quantization approach applied to already-trained models without modifying the training process, typically involving calibration on representative data to determine optimal quantization parameters.
power usage effectiveness (pue): Metric used in data centers to measure energy efficiency, calculated as the ratio of total facility power consumption to IT equipment power consumption.
power wall: The technological barrier reached around 2005 where increasing processor frequency no longer yielded performance gains without unsustainable increases in power density and heat generation, forcing a shift to parallel and specialized architectures.
precision: In numerical computing, the number of bits used to represent numbers, affecting both computational accuracy and resource requirements in machine learning systems.
prefetching: A system optimization technique that loads data into memory before it is needed, overlapping data loading with computation to reduce idle time and improve training throughput.
problem definition: The initial stage of machine learning development that involves clearly specifying objectives, constraints, success metrics, and operational requirements to guide all subsequent development decisions.
processing element (PE): A repeated accelerator building block containing compute units and local storage, with arrays of PEs providing parallel tensor execution.
programmable logic controller (PLC): Industrial control systems used in manufacturing and IoT environments that can be integrated with ML models for automated decision-making in operational technology contexts.
protein folding problem: The scientific challenge of predicting the three-dimensional structure of proteins from their amino acid sequences, a problem that puzzled scientists for decades until systems like AlphaFold achieved breakthrough accuracy using deep learning approaches.
proxy variable: A feature that indirectly carries signal about a protected or sensitive attribute even when that attribute is removed.
pseudonymization: A privacy technique that replaces direct identifiers with artificial identifiers while maintaining the ability to trace records for analysis purposes.
PyTorch: A deep learning framework developed by Facebook’s AI Research lab that emphasizes dynamic computational graphs, eager execution, and intuitive Python integration, particularly popular for research and experimentation.

Q

quantization: A model compression technique that reduces the precision of model parameters and activations from higher precision formats (like 32-bit floats) to lower precision (like 8-bit integers), significantly reducing memory usage and computational requirements
quantization-aware training: A training approach where quantization effects are simulated during the training process, allowing the model to adapt to reduced precision and typically achieving better accuracy than post-training quantization.
quantization granularity: The level at which quantization parameters are applied, ranging from per-tensor (coarsest) to per-channel or per-group (finer), with finer granularity typically preserving more accuracy but requiring more storage.
queries per second (qps): Performance metric that measures how many inference requests a system can process in one second, commonly used to evaluate throughput in production deployments.
query key value: The three components of attention mechanisms where queries determine what to look for, keys represent what is available, and values contain the actual information to be weighted and combined.

R

real-time processing: The processing of data as it becomes available, with guaranteed response times that meet strict timing constraints for immediate decision-making.
receptive field: The region of the input that influences a particular neuron’s output, determining the spatial extent of patterns that can be detected by that neuron.
recurrent neural network: A type of neural network designed for sequential data processing, featuring connections that create loops allowing information to persist across time steps.
red ai: AI research and development that prioritizes maximizing accuracy or performance without regard for the increasing computational and environmental costs required.
regularization: Techniques used to prevent overfitting in neural networks by adding constraints or penalties, including methods like dropout, weight decay, and data augmentation.
ReLU: Rectified Linear Unit activation function defined as \(f(x) = \max(0, x)\) that introduces nonlinearity while maintaining computational efficiency and avoiding vanishing gradient problems.
residual connection: A skip connection that adds the input of a layer to its output, enabling the training of very deep networks by mitigating the vanishing gradient problem.
ResNet: Residual Network, a deep convolutional architecture that introduced skip connections, enabling the training of networks with hundreds of layers and achieving breakthrough performance.
ResNet-50: A specific 50-layer variant of the Residual Network architecture that serves as the canonical Lighthouse Model for compute-bound workloads, balancing depth and computational cost for vision benchmarks.
retinal fundus photographs: Medical images of the interior surface of the eye, including the retina, optic disc, and blood vessels, commonly used for diagnosing eye diseases and training medical AI systems.
reverse-mode differentiation: An automatic differentiation technique that computes gradients by traversing the computational graph in reverse order, highly efficient for functions with many inputs and few outputs, making it ideal for neural network training.
ridge point: The arithmetic intensity at which a workload transitions from memory-bound to compute bound on a given hardware platform, calculated as Peak FLOP/s divided by Memory Bandwidth. The inflection point on the Roofline Model diagram.
Ring AllReduce: A bandwidth-efficient AllReduce algorithm where devices pass chunks around a ring and accumulate results for distributed gradient synchronization.
RMSProp: An adaptive learning rate optimization algorithm that maintains a moving average of squared gradients to automatically adjust learning rates for each parameter during training.
role-based access control (RBAC): An access-control model that grants permissions according to user, service, or team roles rather than ad hoc individual grants.
rollback: Process of reverting to a previous stable version of a model or system when issues are detected in production, ensuring service continuity.
roofline analysis: A performance modeling technique that plots operational intensity against peak performance to identify whether a system is memory bound or compute bound, guiding optimization efforts.

S

scalability: The ability of machine learning systems to handle increasing amounts of data, users, or computational demands without significant degradation in performance or user experience
scale and zero-point: Quantization parameters that map real-valued tensors to integer values and back, where scale controls step size and zero-point maps real zero into the quantized range.
schema: The structure and format definition of data that specifies data types, field names, and relationships, essential for data validation and processing consistency.
schema evolution: The process of modifying data schemas over time while maintaining backward compatibility and ensuring continued functionality of dependent systems and applications.
schema-on-read: An approach used in data lakes where data structure is defined and enforced at the time of reading rather than when storing, providing flexibility for diverse data types.
scratchpad memory: Fast software-managed local memory used by accelerators to stage predictable data movement under compiler or runtime control.
segmentation maps: Detailed annotations that classify objects at the pixel level, providing the most granular labeling information but requiring significantly more storage and processing resources.
self-attention: An attention mechanism where queries, keys, and values all come from the same sequence, allowing each position to attend to all positions including itself.
semi-supervised learning: A machine learning approach that uses both labeled and unlabeled data for training, using structural assumptions to improve model performance with limited labels.
sensitivity: The proportion of actual positive cases correctly identified by a classification model, also known as true positive rate or recall; critical in medical applications where missing positive cases has severe consequences.
serverless: Cloud computing model where infrastructure is automatically managed by the provider, allowing code execution without server management concerns.
service level agreement (sla): Formal contract specifying minimum performance standards and uptime guarantees for production services, with penalties for noncompliance.
service level objective (slo): Internal targets for service reliability and performance metrics such as latency, error rates, and availability that guide operational decisions.
shadow deployment: A deployment strategy where a new model runs in parallel with the production model, making predictions that are logged but not served to users, enabling validation without user impact
shallow learning: Machine learning approaches that use algorithms with limited complexity, such as support vector machines and decision trees, which require carefully engineered features but cannot automatically discover hierarchical representations like deep learning methods.
SHAP values: Feature-attribution scores based on Shapley values that estimate each feature’s contribution to a model prediction.
sigmoid: An activation function that maps input values to a range between 0 and 1, historically popular but prone to vanishing gradient problems in deep networks.
silent bias: Model unfairness that produces valid-looking but discriminatory outputs, evading traditional error monitoring and requiring disaggregated evaluation to detect.
silent degradation: The gradual decline in ML system performance that occurs without triggering errors, exceptions, or alerts. Unlike traditional software that crashes observably, ML systems can continue operating while producing increasingly inaccurate predictions.
silent failure: A system failure mode where an ML model continues to produce plausible-looking outputs that are gradually less accurate or contextually relevant without triggering conventional error alerts.
SIMD (Single Instruction, Multiple Data): A parallel computing architecture that applies the same operation to multiple data elements simultaneously, effective for regular data-parallel computations.
SIMT (Single Instruction, Multiple Thread): An extension of SIMD that enables parallel execution across multiple independent threads, each maintaining its own state and program counter.
singular value decomposition: A matrix factorization technique that decomposes a matrix into the product of three matrices, commonly used in low-rank approximations to compress neural network layers by retaining only the most significant singular values.
skip connection: A direct connection that bypasses one or more layers, allowing gradients to flow more easily through deep networks and enabling better training of very deep architectures.
soft targets: Teacher-model probability distributions used as supervision in knowledge distillation, carrying class-similarity information beyond hard labels.
softmax: An activation function that converts logits into a probability distribution where outputs sum to 1, used in multi-class classification.
sparse Tensor Core: A Tensor Core variant that accelerates supported structured sparsity patterns such as 2:4 sparsity by skipping constrained zero values.
sparsity: The property of neural networks where many weights are zero or near-zero, which can be exploited for computational efficiency through specialized hardware support and algorithms designed for sparse operations.
SPEC CPU: Standardized benchmark suite developed by the System Performance Evaluation Cooperative that measures processor performance using real-world applications rather than synthetic tests.
spec power: Benchmark methodology that measures server energy efficiency across varying workload levels, enabling direct comparisons of power-performance trade-offs in computing systems.
special function unit (SFU): Dedicated hardware for operations such as exponentials, roots, activations, and reductions that do not map cleanly to matrix multiply units.
specificity: The proportion of actual negative cases correctly identified by a classification model, also known as true negative rate; important for avoiding overwhelming referral systems with false positives.
speculative decoding: An optimization technique for autoregressive language models where a smaller model generates draft tokens that are then verified by a larger model, accelerating inference while maintaining quality.
speed of light (latency): The physical speed limit of information transmission (approx. 200,000 km/s in fiber), creating an irreducible lower bound on network latency that necessitates edge computing for applications requiring sub-10 ms response times over long distances.
state dict: A framework serialization mapping from module or optimizer names to tensors, used to save, load, and checkpoint models and optimizer state.
static graph: A computational graph that is defined completely before execution begins, enabling comprehensive optimization and efficient deployment but requiring all operations to be specified upfront, limiting runtime flexibility.
static quantization: A quantization approach where quantization parameters are determined once during calibration and remain fixed during inference, providing computational efficiency but less adaptability than dynamic approaches.
statistical learning: The era of machine learning that emerged in the 1990s, shifting focus from rule-based symbolic AI to algorithms that could learn patterns from data, laying the groundwork for modern data-driven approaches to artificial intelligence.
stochastic gradient descent: A variant of gradient descent that estimates gradients using individual training examples or small batches rather than the entire dataset, reducing memory requirements and enabling online learning.
Straight-Through Estimator (STE): A gradient approximation that lets training pass gradients through nondifferentiable operations such as rounding, often used in quantization-aware training.
stream ingestion: A data processing pattern that handles data in real-time as it arrives, essential for applications requiring immediate processing and low-latency responses.
stream processing: Real-time data processing approach that handles continuous flows of data as it arrives, enabling immediate responses to events and pattern detection.
stride: The step size by which a convolutional filter moves across the input, controlling the spatial dimensions of the output and the degree of overlap between filter applications.
structured pruning: A pruning approach that removes entire computational units such as neurons, channels, or layers, producing smaller dense models that are more hardware-friendly than the sparse matrices created by unstructured pruning.
supervised learning: A machine learning approach where models learn from labeled training examples to make predictions on new, unlabeled data.
symbolic ai: An approach to artificial intelligence that uses high-level symbolic representations of problems, logic, and search algorithms, dominant before the deep learning revolution and characterized by expert systems and rule-based reasoning.
synthetic benchmark: Artificial test program designed to measure specific aspects of system performance, as opposed to benchmarks based on real-world applications and workloads.
synthetic data: Artificially generated data created using algorithms, simulations, or generative models to supplement real-world datasets, addressing limitations in data availability or privacy concerns.
system entropy: The tendency of ML systems to degrade over time as the world changes (drift) or as hidden dependencies accumulate (technical debt), requiring active energy (ops) to maintain order.
system-on-chip (SoC): An integrated circuit that incorporates most or all components of a computer or electronic system, including CPU, GPU, memory, and specialized processors on a single chip. Commonly used in mobile devices and embedded systems for space and power efficiency.
systems co-design: Joint design of algorithms, software, and hardware so model structure and execution fit physical capabilities and constraints.
systems integration: The process of combining various components and subsystems into a unified, functional system that operates efficiently and reliably as a whole.
systems thinking: An approach to understanding complex systems by considering how individual components interact and affect the whole system, particularly important in ML where data, algorithms, hardware, and deployment environments must work together effectively
systolic array: A specialized hardware architecture that efficiently performs matrix operations by streaming data through a grid of processing elements, minimizing data movement and energy consumption.

T

tail latency: Worst-case response times in a system, typically measured as 95th or 99th percentile latency, important for understanding system reliability under peak load conditions.
tailored inference benchmarks: Specialized performance tests designed for specific deployment environments or use cases, accounting for unique constraints and optimization requirements.
tanh: Hyperbolic tangent activation function that maps inputs to (-1, 1), providing zero-centered outputs.
technical debt: Long-term maintenance cost accumulated from expedient design decisions during development, particularly problematic in ML systems due to data dependencies and model complexity.
telemetry: Automated collection and transmission of performance data and metrics from distributed systems, enabling remote monitoring and analysis.
tensor: A multi-dimensional array used to represent data in neural networks, generalizing scalars (0D), vectors (1D), and matrices (2D) to higher dimensions.
Tensor Core: A specialized GPU matrix unit for accelerated mixed-precision multiply-accumulate operations on tensor tiles.
tensor decomposition: The extension of matrix factorization to higher-order tensors, used to compress neural network layers by representing weight tensors as combinations of smaller tensors with fewer parameters.
tensor parallelism: A model parallelism strategy where individual tensor operations (like matrix multiplication) are split across multiple devices, reducing memory per device and latency for large layers.
tensor processing unit (TPU): Google’s custom application-specific integrated circuit designed specifically for machine learning workloads, optimized for matrix operations and featuring systolic array architecture.
TensorFlow: A comprehensive machine learning framework developed by Google that provides tools for the entire ML pipeline from research to production, featuring both eager execution and graph-based computation with extensive ecosystem support.
TensorRT: NVIDIA’s inference optimization library that applies techniques like operator fusion and precision reduction to accelerate deep learning inference on GPU hardware.
ternarization: An extreme quantization technique that constrains weights to three values (typically -1, 0, +1), providing significant compression while maintaining more representational capacity than binary quantization.
thermal design power (TDP): The sustained power and heat level a device or cooling system is designed to dissipate under typical maximum workload.
thermal throttling: A protective mechanism in mobile and embedded devices that reduces processor clock speed and performance to prevent overheating, often limiting the sustained performance of on-device ML inference.
throughput: The rate at which a system can process data or complete operations, typically measured in operations per second and crucial for training large models
time per output token (TPOT): The per-token latency during autoregressive decode after the first token, reflecting decode throughput and memory-bandwidth limits.
time-to-accuracy: The wall-clock time required to train a model to a specified validation accuracy, the ultimate metric for training system performance.
time-to-first-token (TTFT): The time from request arrival to the first generated token in streaming LLM serving, including prefill, scheduling, and initial response latency.
TinyML: A field focused on deploying machine learning models on ultra-constrained embedded devices such as microcontrollers and sensors, operating in the milliwatt to sub-watt power range, with severe limitations on memory, power, and computational capacity. Also called tiny machine learning.
TOPS: Tera Operations Per Second, an integer operation-rate unit indicating how many trillion operations a system can execute in one second.
total cost of ownership (tco): A comprehensive financial metric for ML systems encompassing training, inference, and operational costs over the system’s entire lifecycle.
train-serve split: A hybrid architecture pattern where computationally intensive model training occurs on powerful cloud infrastructure, while the trained model is optimized and deployed for inference on resource-constrained edge or mobile devices.
training: The process of adjusting neural network parameters using labeled data and optimization algorithms to minimize prediction errors and improve performance.
training-serving skew: A mismatch between how features or data are computed during model training vs. serving in production, causing model performance to degrade despite unchanged code. Common causes include different preprocessing pipelines, feature computation timing, or data sources between training and inference
transfer learning: A machine learning technique that uses knowledge gained from pretrained models on related tasks, allowing faster training and better performance on new tasks with limited data by reusing learned features and representations
transformer: A neural network architecture based entirely on attention mechanisms, eliminating recurrence and convolution while delivering substantial accuracy gains over RNN/CNN baselines across language, vision, and multimodal tasks.
translation equivariance: A property where shifting the input causes a corresponding shift in the output feature map, distinct from invariance which discards position.
translation invariance: The property of convolutional networks to recognize patterns regardless of their position in the input, achieved through weight sharing and pooling operations.
tucker decomposition: A tensor decomposition method that generalizes singular value decomposition to higher-order tensors using a core tensor and factor matrices, commonly used for compressing convolutional neural network layers.
tv white spaces: Unused broadcasting frequencies that can be repurposed for internet connectivity, as employed by systems like FarmBeats to extend network access to remote agricultural sensors and IoT devices.

U

uniform quantization: A quantization approach where the range of values is divided into evenly spaced intervals, providing simple implementation but potentially suboptimal for nonuniform value distributions.
universal approximation theorem: A theoretical result proving that neural networks with sufficient width and nonlinear activation functions can approximate any continuous function on a compact domain.
unreasonable effectiveness of data: The empirical observation that for many problems, adding more data is more effective than improving algorithms, driving the data-centric AI paradigm.
unstructured pruning: A pruning approach that removes individual weights while preserving the overall network architecture, creating sparse weight matrices that require specialized hardware support to realize computational benefits.
unstructured sparsity: A form of model sparsity where individual weights are set to zero without following any particular pattern, creating irregular sparsity patterns that require specialized hardware support to realize computational benefits.

V

validation issues: Problems identified during model testing that indicate poor performance, overfitting, data quality problems, or other issues that must be resolved before deployment.
vanishing gradient problem: A problem in deep neural networks where gradients become exponentially smaller as they propagate backward through layers, making it difficult for early layers to learn effectively.
vector operations: Computational operations that process multiple data elements simultaneously, enabling efficient parallel execution of element-wise transformations in neural networks.
verification gap: The mismatch between finite test-set coverage and the much larger input space an ML system may encounter in production.
versioning: The practice of tracking changes to datasets, models, and pipelines over time, enabling reproducibility, rollback capabilities, and audit trails in ML systems.
virtuous cycle: The self-reinforcing process in deep learning where improvements in data availability, algorithms, and computing power each enable further advances in the other areas, accelerating overall progress.
von neumann bottleneck: The performance limitation caused by the shared bus between processor and memory in traditional computer architectures, where data movement becomes more expensive than computation.

W

warehouse-scale computer (WSC): A data center operated as a single programmable computer, treating compute, storage, networking, and failures as system-level resources.
warp: A group of threads (typically 32 on NVIDIA GPUs) that execute the same instruction in lock-step; the fundamental unit of scheduling and execution on GPUs.
waymo: A subsidiary of Alphabet Inc. that represents a leading deployment of machine learning systems in autonomous vehicle technology, demonstrating how ML systems can span from embedded systems to cloud infrastructure in safety-critical environments.
weak supervision: An approach that uses lower-quality labels obtained more efficiently through heuristics, distant supervision, or programmatic methods rather than manual expert annotation.
web scraping: An automated technique for extracting data from websites to build custom datasets, requiring careful consideration of legal, ethical, and technical constraints.
weight: A learnable parameter that determines the strength of connection between neurons in different layers, adjusted during training to minimize the loss function.
weight matrix: An organized collection of weights connecting one layer to another in a neural network, enabling efficient computation through matrix operations.
weight sharing: The practice of using the same parameters across different spatial locations, as in convolutional networks, reducing the number of parameters while maintaining pattern detection capabilities.
whetstone: Early benchmark developed at the UK National Physical Laboratory by Curnow and Wichmann, first published as an ALGOL 60 program in 1972 and later as the canonical FORTRAN version in 1976. It measured floating-point arithmetic performance in MWIPS (millions of Whetstone instructions per second), becoming one of the first widely-adopted standardized performance tests.
workflow orchestration: Automated coordination and management of complex ML pipeline sequences, ensuring proper execution order, dependency management, and error handling across distributed systems.

X

XLA: Accelerated Linear Algebra, a domain-specific compiler for linear algebra operations that optimizes TensorFlow and JAX computations by generating efficient code for various hardware platforms including CPUs, GPUs, and TPUs.