Acceleration Fundamentals

Hardware Acceleration

Purpose

Why does moving data cost more than computing it?

The central surprise of modern computing is that arithmetic is nearly free while memory access is expensive. In the time it takes to fetch a single value from main memory, a processor could perform thousands of calculations. This inversion, the “memory wall,” is not an engineering limitation awaiting a fix; it is a physical consequence of the speed of light and the energy cost of moving electrons across silicon. It explains why specialized accelerators exist: they are not merely faster at math but architected specifically to hide, amortize, and minimize the crushing cost of moving data through deep memory hierarchies, massive parallelism, and specialized data paths. Concretely, hardware acceleration is how many large AI workloads sustain growth when general-purpose processor scaling alone is insufficient. It explains why some optimizations that reduce theoretical computation fail to improve actual runtime: if the operation was already memory bound, computing less changes nothing because the bottleneck was never computation. It also explains why hardware selection cannot be reduced to comparing peak FLOP/s—what matters is whether a workload’s data movement patterns align with what the hardware was actually designed to accelerate. For the engineer choosing hardware, the binding question is therefore not which chip is fastest, but which chip’s memory system best matches the model’s access patterns. A model with large embedding tables and irregular lookups needs a very different accelerator than one performing dense matrix multiplications over compact weight tensors. Getting this match right is the difference between running at a fraction of theoretical peak and approaching the hardware’s practical ceiling. In D·A·M terms, the accelerator is not a generic math engine; it is the machine constraint made concrete, dictating which algorithms survive in production.

Learning Objectives

Explain hardware acceleration as machine-axis specialization for tensor workloads, data reuse, and performance per watt
Calculate arithmetic intensity and roofline ceilings to classify kernels as compute bound or memory bound
Diagnose memory-wall bottlenecks using bandwidth, cache hierarchy, host-device transfer, and energy-movement costs
Compare Tensor Cores, systolic arrays, SIMD/SIMT units, and sparse execution for ML compute primitives
Select dataflow, tiling, and mapping strategies that maximize reuse under memory-capacity constraints
Analyze compiler and runtime optimizations that fuse kernels, plan memory, and schedule accelerators
Evaluate accelerator choices across throughput, latency, power, cost, and deployment-context constraints

Hardware acceleration turns on the machine axis.

Reducing parameters, precision, or operations only matters when the machine can execute the resulting representation efficiently. Data selection reduced the data term, and compression reduced the algorithm’s work; hardware acceleration asks what the machine can actually deliver. The answer starts with the memory wall: arithmetic is cheap, but moving data is expensive. In the time a modern accelerator computes a thousand floating-point operations, a single value travels from main memory. Specialized hardware matters because it raises compute throughput while organizing memory, dataflow, and parallelism so those arithmetic units stay fed.

Definition 1.1: Hardware acceleration

Hardware Acceleration is the practice of replacing general-purpose processor logic with domain-specific silicon optimized for the regular tensor operations of ML workloads, trading programmability for the compute density $(R_{\text{peak}})$ and performance-per-watt gains that data-parallel matrix multiplication can exploit.

Significance: The throughput gain is orders of magnitude. An A100 GPU delivers 312 TFLOP/s for FP16/BF16 matrix multiplication, while a server-class CPU delivers roughly 1–2 TFLOP/s for the same operation, a 156–312× gap achieved by dedicating 80+ billion transistors to parallel arithmetic units rather than to branch predictors, out-of-order schedulers, and large caches (NVIDIA Corporation 2020; Choquette et al. 2021).
Distinction: Unlike a general-purpose CPU, which is optimized to minimize latency for any single instruction in an arbitrary serial program, an accelerator is optimized to maximize throughput for a specific operation class—meaning it achieves its gains only when the workload presents enough parallel work to keep all arithmetic units busy simultaneously.
Common pitfall: A frequent misconception is that an accelerator’s advertised peak throughput is the throughput a workload receives. Delivered performance is the lower of the compute ceiling and what memory bandwidth can feed, the roofline constraint: a low-arithmetic-intensity kernel can sit at a small fraction of peak FLOP/s no matter how fast the silicon is rated, because it starves for data rather than for arithmetic.

The preceding definition frames the chapter’s central engineering trade-off. General-purpose processors devote substantial silicon area to branch prediction, speculative execution, and complex cache coherence protocols. Accelerators strip away that generality, filling the die with arithmetic units tuned to the regular, data-parallel patterns that characterize neural network computation. The result is order-of-magnitude improvements in throughput per watt for the workloads that match these patterns.

Hardware alone, however, cannot achieve these gains. The algorithms must be designed to exploit what the hardware offers, and the hardware must be built to accelerate the operations algorithms actually use. This symbiosis motivates a complementary principle: hardware-software co-design.

Definition 1.2: Hardware-software co-design

Hardware-Software Co-design is the ML accelerator development methodology that intentionally violates traditional hardware-software abstraction layers, allowing algorithm constraints to inform silicon design and hardware capabilities to directly shape algorithm formulation.

Significance: Co-design unlocks gains unavailable to either layer acting alone. INT8 quantization can deliver multi-fold throughput improvement not because 8-bit arithmetic is faster in the abstract, but because modern tensor-core datapaths pack lower-precision operations more densely than FP32 operations; the algorithm change pays off only when the hardware was co-designed to exploit it (NVIDIA Corporation 2020; Dally et al. 2021; Dally 2023).
Distinction: Unlike layered abstraction (where software calls a hardware API without knowing the silicon details), co-design exposes hardware constraints directly to algorithm and compiler authors: data alignment requirements, precision formats, and memory access patterns all become visible inputs to global cross-layer optimization.
Common pitfall: A frequent misconception is that co-design is a one-time hardware design choice. In practice, co-design is a continuous feedback loop: NVIDIA Tensor Cores were designed for FP16 matrix multiply, then upgraded to support TF32 and INT8 after observing that ML workloads demanded them, then extended again to sparse 2:4 patterns after algorithmic pruning research demonstrated structured sparsity was trainable (NVIDIA 2017; NVIDIA Corporation 2020).

Co-design explains why the compression techniques introduced in Model Compression deliver real speedups. The quantization techniques in Quantization and Precision show why converting FP32 to INT8 yields 2–4$\times$ acceleration: not because of fewer bits in the abstract, but because accelerators pack roughly 4$\times$ more low-precision operations into the same silicon and move fewer bytes per value (NVIDIA Corporation 2020). Structured pruning improves performance while unstructured pruning often does not, because structured patterns preserve the regular memory access patterns that hardware can optimize. The analysis now follows the path from workload to silicon: compute primitives, memory systems, roofline diagnosis, mapping and dataflow, then compiler and runtime support. The recurring question is why some promising algorithmic optimizations survive contact with hardware while others remain paper savings.

Theorem 1.1: The fundamental limit of acceleration (Amdahl's Law)

Hardware acceleration does not speed up the entire system; it only speeds up the parallelizable fraction ($p$). This is governed by Amdahl’s Law for AI (Amdahl 1967), formalized in equation 1: \[ \text{Speedup} = \frac{1}{(1 - p) + \frac{p}{G_{\text{accel}}}} \tag{1}\]

Parallel fraction ($p$): The matrix multiplications (typically 90–99 percent of an ML workload).
Accelerator gain ($G_{\text{accel}}$): The raw speed advantage of the GPU or Tensor Processing Unit (TPU) over the CPU for the accelerated portion of the workload.
Serial fraction ($1-p$): Data loading, Python overhead, and kernel launch latency.

Pitfall: Serial work caps total accelerator speedup. If data loading takes 10 percent of the time ($p=0.9$), even an infinite speed accelerator ($G_{\text{accel}}=\infty$) can only achieve a 10$\times$ total speedup. The serial component dominates the parallel accelerator component once the latter is sufficiently fast.

Amdahl, Gene M. 1967. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Proceedings of the April 18-20, 1967, Spring Joint Computer Conference on - AFIPS ’67 (Spring), AFIPS ’67 (spring), 483–85. https://doi.org/10.1145/1465482.1465560.

¹ Amdahl’s Law: Maps directly onto the iron law’s additive terms. Even if hardware drives the computation term $(O/(R_{\text{peak}} \cdot \eta_{\text{hw}}))$ to near zero, total time is still bounded below by the serial data-loading $(D_{\text{vol}}/\text{BW})$ and fixed-latency $(L_{\text{lat}})$ terms, which acceleration cannot touch. This is why large improvements in raw accelerator throughput can produce much smaller end-to-end task speedups when data loading, launch overhead, or preprocessing remains serial.

Hardware acceleration targets specific terms in the iron law of ML systems (Iron Law of ML Systems), which decomposes end-to-end time into data volume $(D_{\text{vol}}/\text{BW})$, computation $(O/(R_{\text{peak}} \cdot \eta_{\text{hw}}))$, and fixed latency $(L_{\text{lat}})$. While data selection reduced the total data and model compression reduced $O$ per sample, hardware acceleration increases the rate at which those operations execute by improving $R_{\text{peak}}$, $\eta_{\text{hw}}$, and $\text{BW}$. Physics of Computing supplies the analytical performance models that diagnose which of these terms dominates a given workload, including the dimensional analysis that confirms each iron law term resolves to seconds. Yet acceleration has a hard ceiling, established by Amdahl’s Law¹.

Amdahl’s Law is not merely theoretical: it explains why many GPU upgrades disappoint in practice. The following heatmap (figure 1) visualizes the acceleration wall, the diminishing returns from faster hardware when serial bottlenecks persist. Unless a workload is highly parallelizable ($p > 0.99$), investing in faster hardware yields diminishing returns. The contour values are illustrative ranges for intuition.

Figure 1: **The Iron Law Heatmap**: Total system speedup as a function of accelerator gain ($G_{\text{accel}}$) and parallel fraction ($p$). High speedup appears only near the top-right corner, where both accelerator gain and parallel fraction are high. The acceleration wall is the low-$p$ region: if a workload is even slightly serial ($p < 0.9$), increasing hardware speed yields little benefit. Contours span roughly 1$\times$–500$\times$ speedup.

The key intuition to carry into specific hardware architectures is that raw speedups matter only after the serial fraction has been reduced.

The parallel fraction $p$ differs dramatically between workload archetypes running on the same hardware, and at fleet scale these differences determine whether an accelerator investment pays off or stalls at the serial bottleneck.

Checkpoint 1.1: The parallelism gate

Hardware speedups are capped by sequential bottlenecks.

Amdahl’s Reality

Serial bottlenecks: Use Amdahl’s bound to explain why a 1,000$\times$ faster GPU may only speed up training by 5$\times$ when data loading is slow.
Workload variation: Compare the parallelizable fractions of ResNet-50 and MobileNet, then predict which benefits more from accelerator throughput.

The serial bottleneck becomes concrete on real hardware, where the same accelerator that nears its parallel ceiling on one workload can stall on another. Numbers to Know collects the reference $R_{\text{peak}}$ figures across accelerator generations and the latency hierarchy that ground these hardware comparisons in order-of-magnitude terms.

Lighthouse 1.1: Amdahl's Law on H100

ResNet-50 inference on NVIDIA H100:

H100 delivers $G_{\text{accel}}$ = 247× speedup over the baseline CPU assumption for matrix multiply (1979 TOPS INT8 vs. ~8 TOPS on baseline CPU without AMX extensions) (Choquette 2023)
In this worked example, inference has $p$ = 0.95 (95 percent parallelizable, 5 percent serial: data loading, preprocessing, postprocessing) \[ \text{$\text{Speedup} = \frac{1}{(1-0.95) + \frac{0.95}{247×}} = \frac{1}{0.05 + 0.0038} \approx 18.6\times$} \]

Despite a 247× hardware advantage, total system speedup is only 18.6×. The 5 percent serial fraction caps practical gains.

Contrast with GPT-2 (autoregressive):

Same H100, but the GPT-2 token-generation scenario uses $p$ = 0.80 (20 percent serial: KV-cache updates, sampling, Python overhead) \[ \text{$\text{Speedup} = \frac{1}{(1-0.80) + \frac{0.80}{247×}} = \frac{1}{0.20 + 0.0032} \approx 4.9\times$} \]

The Bandwidth Hog archetype suffers more from serial bottlenecks. Even infinite accelerator speed yields only $1/(1-p)$ = 5× maximum speedup. This is why large language model (LLM) inference optimization focuses on reducing the serial fraction through serving-side techniques such as batching and speculative decoding, where a small draft model proposes tokens for parallel verification, rather than raw hardware speed.

² Arithmetic Intensity: The ratio of compute operations performed for each byte of data moved from memory (FLOP/byte). This metric provides the direct, quantitative answer to the text’s central question: workloads with high arithmetic intensity, such as well-tiled convolutions and GEMMs above the hardware ridge point (the intensity at which compute, not memory, becomes the limit), are compute bound and accelerate with more TFLOP/s. Workloads with low intensity, like GPT-2’s attention layers (<10 FLOP/byte), are memory bound, making faster chips irrelevant without more bandwidth.

These examples reveal that hardware optimization turns on whether a workload is limited by compute rate or data movement. That distinction determines which accelerator to choose, which optimizations matter, and whether a 10$\times$ more powerful chip will help. The roofline model provides the analytical framework for making this diagnosis (Williams et al. 2009); The roofline model introduces it formally and section 1.5 applies it to AI workloads. It plots an operation’s arithmetic intensity², defined as the ratio of floating-point operations to bytes of memory traffic (FLOP/byte), against hardware capabilities, revealing whether performance is capped by compute or bandwidth. A dense matrix multiplication with high arithmetic intensity benefits from more TFLOP/s; a LayerNorm with low arithmetic intensity benefits from more memory bandwidth. High-reuse ResNet-50 convolutions can cross into the compute-bound regime while GPT-2’s attention layers are memory bound, and this distinction is precisely why these architectures require different optimization strategies.

The analytical tools are now in place. The chapter builds on them in stages: the historical evolution of domain-specific architectures, from floating-point coprocessors through graphics processors to contemporary AI accelerators; the computational primitives that characterize ML workloads (matrix multiplication, vector operations, and nonlinear activation functions) and how specialized hardware optimizes them through systolic arrays and tensor cores; memory hierarchy design, where data movement energy costs exceeding computation costs by more than 100$\times$ (Horowitz 2014; Sze et al. 2017) make on-chip buffer optimization and high-bandwidth memory interfaces critical; the roofline model that diagnoses whether a given workload is bound by compute or by bandwidth; the mapping and dataflow strategies that turn that diagnosis into an execution plan the silicon runs efficiently; and the software stack, where compiler optimization and runtime system support determine the extent to which theoretical hardware capabilities translate into measurable performance. Throughout, the core analysis stays with single-accelerator and single-node systems; the closing material uses multi-device examples only to show how the same bottleneck diagnoses scale. The history of specialized hardware reveals recurring design patterns that explain why accelerators take their form.

Hardware Specialization

The TPUv1/K80 efficiency shock is the modern AI instance of a recurring hardware pattern: when a workload becomes important and regular enough, general-purpose processors give way to specialized hardware. Machine learning acceleration follows the same trajectory seen in floating-point arithmetic, graphics processing, and digital signal processing. Each era confronted the same constraint introduced in the Purpose section: data movement costs dominate computation costs, and specialization succeeds by minimizing unnecessary data movement.

Modern ML accelerators (DianNao-class neural-network accelerators (Chen et al. 2014), GPUs with tensor cores, Google’s TPUs³, Apple’s Neural Engine) emerged from these established architectural principles. The evolution spans four phases: specialized computing origins, parallel graphics processing, domain-specific architectures, and the emergence of ML-specific hardware. Each phase reveals design principles that remain relevant for understanding and optimizing contemporary AI systems.

Chen, Tianshi, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning.” Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14), 269–84. https://doi.org/10.1145/2541940.2541967.

³ TPU (Tensor Processing Unit): The first TPU made a deliberately narrow bet, filling the die with a single $256{\times}256$ systolic array for 8-bit matrix multiplication and stripping away the caches, branch predictors, and out-of-order logic on which a general-purpose core spends most of its area (Jouppi et al. 2017). That trade buys extreme compute density on dense matrix multiply at the cost of flexibility: the same chip that excels at neural network inference is poorly suited to irregular or branch-heavy code, which is why the array dimension itself becomes a design constraint, since layers whose dimensions are not multiples of 256 leave rows and columns of the array idle.

Example 1.1: The TPUv1 vs. K80 efficiency shock

Context: In 2015, Google deployed its first Tensor Processing Unit (TPUv1) and compared it to the dominant GPU of the era, the NVIDIA K80.

Result: The TPUv1 was not just slightly faster; it was 15–30$\times$ faster on inference workloads and achieved 30–80$\times$ better performance-per-watt in Google’s published comparison (Jouppi et al. 2017).

Mechanism: The K80 was a general-purpose processor (good for graphics, physics, diverse math). The TPU was a domain-specific architecture (DSA) built for one thing: 8-bit integer matrix multiplication. It stripped away caches, branch prediction, and out-of-order execution logic to fill the chip with pure arithmetic units (systolic arrays).

Systems lesson: This result ended the “General Purpose” era for AI. It proved that tailoring silicon to the algorithmic primitive (matrix multiplication) yields order-of-magnitude gains that Moore’s Law alone could not deliver for decades.

Hardware specialization improves performance by implementing frequent patterns in dedicated circuits, but introduces trade-offs in flexibility, silicon area, and programming complexity. The principles that shaped early floating-point and graphics accelerators now inform AI hardware design.

Specialized computing

Hardware specialization emerges when specific computational patterns become the primary system bottleneck, preventing general-purpose processors from scaling efficiently. Historically, this progression follows three distinct phases: the precision bottleneck (scalar floating-point), the throughput bottleneck (parallel graphics), and the integration bottleneck (memory-compute locality).

The first phase, the precision bottleneck, occurred when scientific and engineering applications required high-precision decimal math that general-purpose CPUs performed poorly. In the late 1970s, CPUs typically emulated floating-point operations in software, requiring hundreds of cycles for a single multiplication. This scalar inefficiency led to the first major instance of hardware specialization: the mathematics coprocessor.

The Intel 8087 (1980)⁴ addressed this bottleneck by offloading arithmetic-intensive tasks to a dedicated unit. By implementing floating-point logic in hardware rather than software emulation, the 8087 achieved up to 100$\times$ performance gains for scientific workloads (Palmer 1980). This established a core principle: when a specific data type or operation consumes the majority of execution cycles, moving it to specialized silicon provides 10–100$\times$ improvements.

⁴ Intel 8087: The coprocessor implemented floating-point logic directly in silicon, avoiding the CPU’s slow, multi-instruction software emulation for each calculation. This offload strategy was the sole mechanism behind the 100$\times$ performance gain, a result only achievable because scientific workloads spent the vast majority of their cycles on these specific arithmetic operations. The 8087’s success thus provided the canonical proof that specializing hardware for a dominant computational kernel yields performance improvements 10–100$\times$ greater than general-purpose scaling.

As specialized functions like floating-point math proved their value, they followed a recurring pattern of integration. The Intel 486DX (1989) moved the FPU directly onto the CPU die, eliminating the off-chip communication latency and making high-precision math a standard feature rather than an optional accelerator (Patterson and Hennessy 2017). This cycle (specialization to solve a bottleneck, followed by integration into the general-purpose stack) repeats across every era of hardware evolution.

The progression from specialization to integration has shaped modern computing. Each domain (graphics, signal processing, machine learning) introduced specialized architectures that were later absorbed into general-purpose platforms.

Figure 2 traces this recurring cycle of specialization and integration across five eras, each addressing the dominant computational bottleneck of its period: the 1980s floating-point and signal-processing units (Intel 8087, TI TMS32010 DSP), 1990s 3D graphics (NVIDIA GeForce 256), 2000s media and network processing (H.264 codecs, Intel IXP2800), 2010s deep-learning tensor operations (Google TPU v1, NVIDIA Tensor Cores), and 2020s application-specific accelerators (AI engines, wafer-scale ML chips). Capabilities such as real-time translation, recommendations, and on-device inference build directly on principles established in these earlier specialization waves.

Figure 2: **Hardware Specialization Timeline**: Computing architectures progressively incorporate specialized accelerators to address emerging performance bottlenecks, from floating-point units to graphics processors and machine learning accelerators. Each era produced hardware tailored to the dominant computational patterns of its period.

Parallel computing and graphics processing

The principles established through floating-point acceleration provided a blueprint for addressing subsequent computational challenges. As computing applications diversified, new computational patterns emerged that exceeded the capabilities of general-purpose processors, and each domain contributed unique insights to hardware acceleration strategies.

Graphics processing emerged as a primary driver of hardware specialization in the 1990s. Early graphics accelerators focused on specific operations like bitmap transfers and polygon filling. NVIDIA’s GeForce 256 in 1999 represented a milestone in specialized computing. The GeForce 256 implemented hardware-accelerated transform and lighting (T&L), moving these computations from CPU to dedicated silicon. While not yet programmable, these Graphics Processing Units (GPUs) demonstrated how fixed-function parallel architectures could efficiently handle data-parallel workloads such as texture mapping and vertex transformation. The transition to programmable shaders with the GeForce 3 (2001) and unified shader architectures with the GeForce 8 (2006) eventually enabled GPU computing for general-purpose workloads. By 2004, high-end GPUs could process over 100 million polygons per second (Owens et al. 2008).

Lyons, Richard G. 2011. Understanding Digital Signal Processing. 3rd ed. Prentice Hall.

Concurrently, Digital Signal Processing (DSP) processors established parallel data path architectures with specialized multiply-accumulate units and circular buffers optimized for filtering and transform operations. Texas Instruments’ TMS32010 (1983) demonstrated how domain-specific instruction sets could dramatically improve performance for signal processing applications (Lyons 2011).

Network processing introduced additional patterns of specialization. Network processors developed unique architectures to handle packet processing at line rate, incorporating multiple processing cores, specialized packet manipulation units, and tiered memory management systems. Intel’s IXP2800 network processor shows the consequence of one hard constraint: meeting line-rate packet deadlines leaves no slack for cache misses, so the design arranges many parallel cores around tiered on-chip memory to keep data adjacent to compute. That compute-near-memory organization, forced here by packet timing, is the same arrangement ML accelerators later adopt to keep their processing-element grids fed.

Across these domains, a common blueprint emerges: identify the dominant computational patterns, build specialized processing elements and memory hierarchies around them, create tailored programming models, and progressively evolve toward more flexible architectures. This pattern of architectural co-evolution established the foundation for contemporary AI hardware design. DSP innovations in low-power signal processing enabled real-time inference on edge devices, including voice assistants and wearables. Together, these domains informed ML hardware designs and demonstrated that accelerators could be deployed across both cloud and embedded contexts.

A single result proved the GPU’s relevance to AI was not theoretical. AlexNet⁵ (Krizhevsky et al. 2012) won the ImageNet competition by a 10.8-percentage-point margin—on two consumer-grade NVIDIA GTX 580 graphics cards, each with only 3 GB of VRAM. The systems lesson was impossible to ignore: matching a workload’s data parallelism to GPU hardware could yield order-of-magnitude improvements in time-to-train. The era of GPU-centric deep learning had begun.

⁵ AlexNet: Krizhevsky, Sutskever, and Hinton’s 60-million-parameter convolutional neural network (CNN) that won ImageNet 2012 by a 10.8-percentage-point margin on two consumer GTX 580 GPUs with only 3 GB of VRAM each. Because the model exceeded single-GPU memory, Krizhevsky manually partitioned layers across the two cards, choosing which layers communicated across PCIe to minimize the data-transfer bottleneck—an ad-hoc model parallelism that foreshadowed later systematic tensor and pipeline parallelism strategies. Training took five to six days rather than weeks on CPUs, proving that matching a workload’s parallelism to GPU hardware could yield order-of-magnitude reductions in time-to-train.

Emergence of domain-specific architectures

These diverse acceleration patterns converged in a broader architectural shift. The emergence of domain-specific architectures (DSAs)⁶ marks a transition in computer system design, driven by two converging factors: the breakdown of traditional scaling laws (Esmaeilzadeh et al. 2011) and the increasing computational demands of specialized workloads. Moore’s Law⁷ had previously ensured predictable enhancements in transistor density every 18 to 24 months (Moore 1998). Dennard scaling⁸ (Dennard et al. 1974) had permitted frequency increases without corresponding power-density increases; its breakdown removed that path to easy performance gains. Together, these shifts created a performance and efficiency bottleneck in general-purpose computing. As Hennessy and Patterson (2019) noted in the 2017 Turing Lecture, these limitations signaled the onset of a new era in computer architecture centered on domain-specific solutions that optimize hardware for specialized workloads.

⁶ Domain-Specific Architecture (DSA): Silicon optimized for a single application domain, sacrificing general-purpose programmability for efficiency. Google’s TPUv1 achieved 15–30$\times$ better performance and 30–80$\times$ better performance per watt than contemporary CPUs and GPUs on Google’s inference benchmarks by eliminating branch prediction, caches, and out-of-order logic in favor of a systolic array (Jouppi et al. 2017). The trade-off is inflexibility: a DSA that excels at dense matrix multiplication may perform worse than a CPU on irregular workloads like graph traversal, making workload-hardware alignment the central design decision. Hennessy and Patterson’s rule of thumb is that a new architecture must deliver at least 10$\times$ efficiency over the general-purpose alternative to justify the ecosystem cost of adoption (Hennessy and Patterson 2019; Patterson and Hennessy 2017).

Hennessy, John L., and David A. Patterson. 2019. “A New Golden Age for Computer Architecture.” Communications of the ACM 62 (2): 48–60. https://doi.org/10.1145/3282307.

Patterson, David A., and John L. Hennessy. 2017. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann.

⁷ Moore’s Law: The consequence for ML is not just slower hardware improvement but a structurally widening gap: in the illustrative assumptions used by figure 3, model compute demand grows roughly 6.1× per year while accelerator peak supply improves roughly 2× per year, widening the demand/supply gap by about 3.5× per year. This divergence, visible in compute-trend analyses from OpenAI and Epoch AI, makes algorithmic efficiency techniques—model compression, quantization, sparsity—structurally necessary rather than optional optimizations (Amodei and Hernandez 2018; Epoch AI 2024).

Moore, G. E. 1998. “Cramming More Components onto Integrated Circuits.” Proceedings of the IEEE 86 (1): 82–85. https://doi.org/10.1109/jproc.1998.658762.

⁸ Dennard Scaling: The 1974 principle that as transistor dimensions shrank, their operating voltage could be lowered to keep power density constant (Dennard et al. 1974). Its breakdown after ~2005 meant that clock speeds could no longer be increased without violating the chip’s thermal design power (TDP) limits, creating the “dark silicon” problem: at advanced nodes, thermal constraints prevent powering more than roughly 30–50 percent of transistors simultaneously (Esmaeilzadeh et al. 2011). This directly forces specialization—only by dedicating powered transistors to narrow workloads (like matrix multiplication) can architects extract useful performance from the available silicon budget.

Dennard, Robert H., Frank H. Gaensslen, Hwa-Nien Yu, Victor L. Rideout, Elias Bassous, and Antoine R. LeBlanc. 1974. “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions.” IEEE J. Solid-State Circuits 9 (5): 256–68. https://doi.org/10.1109/jssc.1974.1050511.

Esmaeilzadeh, Hadi, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. “Dark Silicon and the End of Multicore Scaling.” Proceedings of the 38th Annual International Symposium on Computer Architecture, 365–76. https://doi.org/10.1145/2000064.2000108.

Figure 3: **The Systems Gap**: Relative compute growth (log scale) comparing model demand to hardware supply, normalized to 2012 = 1.0. The gray dotted line (CPU) and blue dashed line (GPU) reflect hardware progress, which lags the exponential red solid line (Model Demand). The purple region is the ‘Systems Gap’ that must be bridged through parallelism and co-design.

The scale of this challenge becomes stark in figure 3, which plots the systems gap: the divergence between what models demand and what hardware naturally provides. In the plotted normalization, GPU supply, often framed as Huang’s Law,⁹ rises about 1.7$\times$ per year while model demand rises about 6$\times$ per year, so the systems gap widens by roughly 3–4$\times$ each year (Amodei and Hernandez 2018; Epoch AI 2024).

⁹ Huang’s Law: The observation that GPU performance for AI workloads historically improved faster than traditional Moore’s Law, a pace achieved through architectural innovations (for example, Tensor Cores) rather than transistor scaling alone. The normalized figure uses a representative GPU-supply curve of about 1.7$\times$ per year and a model-demand curve of about 6$\times$ per year, illustrating a gap that widens by roughly 3–4$\times$ annually unless software and architecture co-design close it (Amodei and Hernandez 2018; Epoch AI 2024; NVIDIA Corporation 2020; Choquette 2023).

Amodei, Dario, and Danny Hernandez. 2018. “AI and Compute.” OpenAI Blog 2.

Epoch AI. 2024. Machine Learning Trends. Epoch AI Research Database.

The plot is normalized to a 2012 baseline to emphasize relative growth. Notice how the purple-shaded region between the curves keeps widening—this gap cannot be closed by waiting for faster chips; it requires architectural innovation.

The technology S-curve: Why we must shift

Every computing paradigm follows a distinct lifecycle of three phases: ferment (initial slow progress), take-off (exponential growth), and saturation (diminishing returns at physical limits). This technology S-curve pattern appears in the two overlapping curves in figure 4: as a general-purpose curve saturates, domain-specific architectures can open a new efficiency curve for workloads with stable computational structure.

Figure 4: **The Twin S-Curves of Specialized Computing**: General-purpose CPUs (gray) enjoyed decades of exponential growth driven by Moore’s Law and Dennard Scaling. As physics constrained this curve around 2010 (Saturation), domain-specific architectures (blue) provided a new efficiency curve. The durable pattern is that large efficiency gains come from specializing hardware for linear algebra, albeit at the cost of general programmability.

The “easy” gains from shrinking transistors are gone. To sustain the exponential growth required by AI models (which are growing 4–10$\times$ faster than Moore’s Law), we cannot wait for the next CPU generation. We must shift to a new curve, one defined not by clock speed but by architecture. To understand how we reached this inflection point, we must first examine the mechanics of the scaling laws that once fueled the general-purpose era.

Historically, improvements in processor performance depended on semiconductor process scaling and increasing clock speeds. As power density limitations restricted further frequency scaling and transistor miniaturization encountered increasing physical and economic constraints, architects explored alternative approaches to sustain computational growth. The result was a shift toward domain-specific architectures, which dedicate silicon resources to optimize computation for specific application domains, trading flexibility for efficiency.

Domain-specific architectures achieve superior performance and energy efficiency when the hardware stops treating the workload as arbitrary code. The first shift is a customized data path: matrix multiplication units in AI accelerators, for example, implement systolic arrays, grid-like networks of processing elements that rhythmically compute and pass data through neighboring units. Once that data path is fixed, the memory hierarchy can be tuned around the reuse pattern the workload actually needs, with cache configurations, prefetching logic, and memory controllers designed for the expected tensor flow.

The same specialization then reduces control overhead. Domain-specific instruction sets encode common operation sequences into single instructions, minimizing decode and dispatch complexity, while fixed-function circuit blocks bypass software interpretation for operations that appear constantly. The result is not one trick but a stack of matching decisions: data movement, memory locality, instruction overhead, and circuit implementation all align around the same computational pattern.

Modern smartphones illustrate these principles compellingly. They can decode high-resolution video within tight power and thermal envelopes even though video processing requires billions of operations per second. This efficiency is achieved through dedicated hardware video codecs¹⁰ that implement industry standards such as H.264/AVC and H.265/HEVC (Sullivan et al. 2012). These specialized circuits can provide order-of-magnitude performance-per-watt gains compared with software decoding on general-purpose processors, with the exact gain depending on codec, resolution, process node, and CPU baseline.

¹⁰ Codec: A portmanteau of “coder-decoder,” reflecting the hardware’s dual function. Encoding (compression) is compute-intensive because it searches for optimal representations, while decoding (decompression) is bandwidth-intensive because it reconstructs full-resolution frames from compressed streams. Dedicated codec silicon implements both paths in fixed-function hardware, so neither path wastes transistors on unrelated general-purpose control logic.

Sullivan, Gary J., Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. “Overview of the High Efficiency Video Coding (HEVC) Standard.” IEEE Transactions on Circuits and Systems for Video Technology 22 (12): 1649–68. https://doi.org/10.1109/tcsvt.2012.2221191.

Shang, Junyang, Gu-Yeon Wang, and Yiran Liu. 2018. “Accelerating Genomic Data Analysis with Domain-Specific Architectures.” IEEE Transactions on Computers 67 (7): 965–78.

¹¹ ASIC (Application-Specific Integrated Circuit): These circuits achieve their extreme efficiency by implementing a single algorithm directly in silicon, often improving performance-per-watt by $10^3\times$ to $10^5\times$. Examples include cryptographic hashing for blockchain mining and sequence alignment for genomics. The trade-off is total inflexibility: if that core algorithm changes, the ASIC cannot be reprogrammed and becomes obsolete, locking the hardware design to the specific problem version it was built to solve.

Bedford Taylor, Michael. 2017. “The Evolution of Bitcoin Hardware.” Computer 50 (9): 58–66. https://doi.org/10.1109/mc.2017.3571056.

These later domains are not separate anecdotes; they repeat the same bottleneck response. Genomics processing benefits from custom accelerators because sequence alignment and variant calling expose stable kernels that specialized silicon can execute with less wasted movement (Shang et al. 2018). Blockchain computation produced application-specific integrated circuits (ASICs)¹¹ for the same reason: cryptographic hashing is fixed enough to justify silicon that trades flexibility for efficiency (Bedford Taylor 2017).

The trajectory yields an engineering rule: the era of “free” performance gains from general-purpose scaling is over. For decades, software engineers could rely on Moore’s Law to accelerate existing code without architectural changes. The breakdown of Dennard scaling forced a decisive change: engineers can no longer wait for faster CPUs to solve computational bottlenecks but must instead design the hardware to fit the algorithm. This necessity of hardware-software co-design is why modern AI engineering requires deep understanding of the underlying silicon. Performance is now determined by how well the algorithm’s memory access patterns and parallelism map to the specialized physical structures of domain-specific architectures.

Machine learning hardware specialization

Machine learning constitutes a computational domain with unique characteristics that have driven the development of specialized hardware architectures. Unlike traditional computing workloads that exhibit irregular memory access patterns and diverse instruction streams, neural networks are characterized by predictable patterns: dense matrix multiplications, regular data flow, and tolerance for reduced precision. These characteristics enable specialized hardware optimizations that would be ineffective for general-purpose computing but provide substantial speedups for ML workloads. The hardware built to exploit these patterns constitutes a class of devices known as ML accelerators, and the economic trigger for specialization appears when those regular neural-network patterns dominate a fleet rather than a benchmark.

War Story 1.1: The TPU capacity cliff

Context: Google had considered an application-specific chip for neural networks as early as 2006 but did not treat it as urgent: existing data-center capacity absorbed the early deep-learning workloads (Jouppi et al. 2017).

Failure mode: In 2013, internal projections changed the calculus. If users adopted voice-search-driven speech recognition for even a few minutes per day, the resulting inference load from deep neural networks would roughly double the number of data centers Google needed to operate. There was no realistic capital plan to absorb that, and conventional CPUs offered no path to close the gap on performance per watt or performance per dollar (Jouppi et al. 2017).

Resolution: Google started a high-priority custom-ASIC effort and, in fifteen months, designed, verified, built, and deployed the first-generation Tensor Processing Unit (TPU) into production data centers in 2015, optimizing for inference latency, cost, and performance per watt rather than general-purpose flexibility (Jouppi et al. 2017).

Systems lesson: Hardware acceleration becomes mandatory when a single workload crosses a fleet-level economic threshold. The decision is not “CPU vs. GPU vs. TPU” in the abstract; it is whether the workload’s arithmetic intensity, latency target, and aggregate volume make general-purpose capacity unaffordable.

Definition 1.3: ML accelerator

Machine Learning Accelerators are domain-specific processors whose silicon is designed primarily for the dense matrix operations and regular data flow of neural networks, achieving high $R_{\text{peak}}$ and memory bandwidth utilization for these workloads by devoting die area to arithmetic units rather than to general-purpose control logic.

Significance: An ML accelerator’s defining feature is not raw arithmetic but a balanced feed of data to that arithmetic. The A100’s 2.04 TB/s of memory bandwidth, roughly a 10$\times$ gap over a server CPU’s 200 GB/s, is what lets its 312 TFLOP/s of FP16/BF16 throughput stay fed rather than starved (NVIDIA Corporation 2020; Choquette et al. 2021). That balance is then specialized by workload: training accelerators size FLOP/s and bandwidth for bidirectional gradient flow and large activation footprints, while inference accelerators trade those for energy efficiency and deterministic single-request latency.
Distinction: The accelerator’s gains are conditional on the data flowing through it being parallel and regular. An ML accelerator processes thousands of independent arithmetic operations at once with predictable memory access, so it is orders of magnitude faster than a CPU on dense matrix multiplication but can be slower on irregular control flow such as tree traversal or dynamic programming, where there is no parallel data stream to feed.
Common pitfall: A frequent misconception is that ML accelerators always accelerate ML. An accelerator only delivers its peak throughput when the workload provides enough parallel work to saturate all arithmetic units simultaneously: a batch-1 autoregressive inference request may use only a small fraction of a large training accelerator’s compute capacity because sequential token generation cannot fill thousands of parallel compute lanes.

Choquette, Jack, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. “NVIDIA A100 Tensor Core GPU: Performance and Innovation.” IEEE Micro 41 (2): 29–35. https://doi.org/10.1109/mm.2021.3061394.

A DRAM access costs ~100$\times$ a MAC; data movement dominates energy.

Machine learning computational requirements reveal limitations in traditional processors. CPUs reach only 5 percent–10 percent utilization on neural network workloads, delivering approximately 100 GFLOP/s (billions of floating-point operations per second) while consuming hundreds of watts. This inefficiency results from architectural mismatches: CPUs optimize for single-thread performance and irregular memory access, while neural networks require massive parallelism and predictable data streams. The memory bandwidth constraint compounds the problem: a single neural network layer may require accessing gigabytes of parameters, overwhelming CPU cache hierarchies designed for kilobyte-scale working sets.

The energy economics of data movement influence accelerator design. Accessing data from DRAM can consume on the order of $10^2\times$ more energy than a multiply-accumulate operation (exact values vary by technology node and design), making minimizing data movement a primary optimization target (Horowitz 2014; Sze et al. 2017). This disparity helps explain the progression from repurposed graphics processors to purpose-built neural network accelerators. TPUs and other custom accelerators can sustain high utilization on dense kernels by implementing systolic arrays and other architectures that maximize data reuse while minimizing movement.

IEEE Standards Association. 2019. IEEE 754-2019: Standard for Floating-Point Arithmetic. https://doi.org/10.1109/IEEESTD.2019.8766229.

¹² Latency vs. Throughput in Accelerator Design: Training’s bidirectional data flow and large activation memory footprint favor throughput-oriented designs that use large batches to maximize arithmetic utilization. Inference’s simple forward-pass computation, by contrast, is judged on latency, where single-request response time is the critical metric. This forces a hardware trade-off: a training-optimized architecture built to maximize FLOP/s can introduce pipeline and batching overhead that worsens tail latency for latency-sensitive inference workloads compared with a chip or runtime path optimized for single-request service.

Training and inference present distinct computational profiles that influence accelerator design. Training generally relies on floating-point arithmetic for gradient computation and weight updates: FP32 and FP16 are standardized binary floating-point formats (IEEE Standards Association 2019), while mixed-precision training uses lower-precision tensor operations with higher-precision accumulation when accuracy permits (Micikevicius et al. 2017). Training also requires bidirectional data flow for backpropagation (see Activation memory requirements for activation memory analysis), and large memory capacity for storing activations. Inference can exploit reduced precision (INT8 or INT4), requires only forward computation, and prioritizes latency over throughput¹². These differences drive specialized architectures: training accelerators maximize FLOP/s and memory bandwidth, while inference accelerators optimize for energy efficiency and deterministic latency.

Deployment context shapes architectural choices by identifying the binding constraint. In data centers, the constraint is time-to-result for training massive models. An NVIDIA H100 consuming hundreds of watts is economically justified if it reduces a GPT-scale training run from weeks to days, because the cumulative cost of rented accelerator time usually dwarfs the energy bill (Choquette 2023). Google’s TPUv4 makes a similar trade-off, prioritizing raw throughput through massive systolic arrays and high-bandwidth memory (Jouppi et al. 2023), accepting high power consumption because faster iteration reduces both time-to-deploy and total training cost.

At the opposite extreme, edge deployment inverts this priority: the binding constraint is energy per inference, not throughput. A smartphone camera or always-on audio path operating inside a few-watt power budget cannot afford the DRAM-intensive access patterns of a data center accelerator. Instead, edge architectures minimize data movement through local scratchpads, tightly integrated accelerators, dynamic voltage scaling, and event-driven processing when the workload allows it. The systems insight is that the same memory wall principle applies at both extremes: data center chips fight it with bandwidth (terabytes per second of HBM), while edge chips fight it with proximity (keeping data in registers and scratchpads).

The success of application-specific accelerators demonstrates that no single architecture can efficiently address all ML workloads. A massive installed base of edge devices demands architectures optimized for energy efficiency and real-time latency targets, while cloud-scale training continues advancing the boundaries of computational throughput. This diversity drives continued innovation in specialized architectures, each optimized for its specific deployment context and computational requirements. However, despite this diversity, all accelerators operate under the same physical constraint: the energy cost of moving data.

Checkpoint 1.2: The accelerator gate

Hardware specialization is driven by energy physics.

The Energy Inversion

Data movement cost: Can you explain why moving data from DRAM costs 100$\times$ more energy than computing on it?
Architectural response: Explain how systolic arrays (TPU) and Tensor Cores (GPU) reduce repeated movement of the same operands.

Selection Logic

Training vs. inference: Why do training chips need massive HBM bandwidth, while inference chips prioritize low latency and INT8 ops?

This historical progression reveals a key pattern: each wave of hardware specialization responded to a specific computational bottleneck. Floating-point coprocessors addressed arithmetic precision; GPUs addressed graphics throughput; AI acceleration targets a qualitatively different constraint, the integration bottleneck examined in section 1.1.6. Table 1 summarizes the key milestones in hardware specialization. The architectural strategies introduced for these earlier specialized workloads (floating-point operations, graphics rendering, media processing) now underpin the design of modern AI accelerators and provide context for understanding how hardware specialization continues to enable scalable, efficient execution of machine learning workloads across diverse deployment environments.

What distinguishes AI acceleration from earlier specialization waves is the scale of integration required. AI accelerators must work seamlessly with frameworks like TensorFlow, PyTorch, and JAX. They require deep compiler support for graph-level transformations, kernel fusion, and memory scheduling. They must also deploy across environments from data centers to mobile devices, each with distinct performance and efficiency requirements. Such system-level transformation requires tight hardware-software coupling, a theme that recurs throughout this chapter.

AI accelerators target a specific bottleneck whose identity shapes every subsequent architectural decision. Unlike floating-point coprocessors that addressed arithmetic precision or GPUs that addressed graphics throughput, AI accelerators target a qualitatively different constraint: the integration bottleneck introduced next.

Table 1: Hardware Specialization Trends: Successive computing eras progressively integrate specialized hardware to accelerate prevalent workloads, moving from general-purpose CPUs to domain-specific architectures and ultimately to customizable AI accelerators. Tailoring hardware to computational patterns improves performance and energy efficiency, driving innovation in machine learning systems.

Era	Computational Pattern	Architecture Examples	Characteristics
1980s	Floating-Point & Signal Processing	FPU, DSP	• Single-purpose engines • Focused instruction sets • Coprocessor interfaces
1990s	3D Graphics & Multimedia	GPU, SIMD Units	• Many identical compute units • Regular data patterns • Wide memory interfaces
2000s	Real-time Media Coding	Media Codecs, Network Processors	• Fixed-function pipelines • High throughput processing • Power-performance optimization
2010s	Deep Learning Tensor Operations	TPU, GPU Tensor Cores	• Matrix multiplication units • Massive parallelism • Memory bandwidth optimization
2020s	Application-Specific Acceleration	ML Engines, Smart NICs, Domain Accelerators	• Workload-specific datapaths • Customized memory hierarchies • Application-optimized designs

The integration bottleneck

Machine learning represents a computational domain where the primary performance limit has shifted from arithmetic to integration. While early coprocessors solved the precision bottleneck (8087) and GPUs solved the throughput bottleneck (rasterization), modern AI workloads are constrained by the integration bottleneck: the energy and latency cost of moving massive amounts of data between memory and thousands of parallel compute units.

Three unique properties of neural networks drive this shift. Their massive parallelism is the first: unlike general-purpose code with complex branching, neural networks execute billions of independent matrix multiplications and convolutions, and this regular structure allows replacing complex CPU control logic with dense arrays of processing elements (systolic arrays). Their data flow is also predictable, mathematically determined by the network’s layers, which enables hardware to “prefetch” data into local scratchpads¹³ and bypass the expensive random-access cache hierarchies of CPUs. Finally, neural networks tolerate reduced precision, remaining robust when selected operations use 8-bit or 4-bit integers instead of 32- or 64-bit floating-point numbers; this flexibility lets architects fit substantially more low-precision compute into the same silicon area and reduce memory traffic per value (Dally et al. 2021; Dally 2023).

¹³ Scratchpad Memory: Because the dataflow for a neural network is mathematically determined, a compiler can schedule the exact data needed into this fast, software-controlled local memory. This bypasses the complex and energy-intensive hardware logic a CPU cache uses to guess at future data needs for unpredictable workloads. For example, Google’s TPU v1 uses a 24 MB software-managed Unified Buffer rather than relying on CPU-style hardware caches for activations, a primary driver of its efficiency on ML workloads (Jouppi et al. 2017).

Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, 1–12. https://doi.org/10.1145/3079856.3080246.

¹⁴ HBM (High Bandwidth Memory): Achieves 2.0–3.4 TB/s bandwidth in current data center accelerators (A100’s HBM2e to H100’s HBM3) through 3D die stacking with thousands of through-silicon vias (TSVs), compared to 760 GB/s for GDDR6X (NVIDIA Corporation 2020; Choquette 2023). This 2.7–4.4× bandwidth advantage transforms memory-bound ML workloads toward compute-bound performance, which is why high-end data center AI accelerators such as H100, A100, and TPUv4 use HBM (Jouppi et al. 2023). The trade-off is cost: HBM is a dominant cost component in data center AI accelerators, limiting it to applications where the bandwidth-per-dollar justifies the substantial premium over consumer-grade GDDR.

Jouppi, Norm, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, et al. 2023. “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings.” Proceedings of the 50th Annual International Symposium on Computer Architecture, 1–14. https://doi.org/10.1145/3579371.3589350.

The primary engineering challenge is no longer maximizing calculation rate but keeping data close to the calculation. In modern accelerators, accessing data from external memory (DRAM) can consume 100$\times$ more energy than the actual arithmetic operation. This disparity is precisely why modern accelerator architectures prioritize high-bandwidth memory (HBM)¹⁴ and large on-chip scratchpads over simply adding more compute units.

To see how accelerators address this integration bottleneck in practice, examine the architectural blueprint in figure 5. Notice how every design decision, from the processing element grid to the multilevel cache hierarchy, targets data movement reduction rather than raw compute multiplication.

Figure 5: **Anatomy of a Modern AI Accelerator**: AI accelerators integrate specialized processing elements containing tensor cores, vector units, and special function units, supported by a hierarchical memory system from high-bandwidth memory down to local caches. This architecture maximizes data reuse and parallel execution while minimizing energy-intensive data movement, which is the foundation for large performance-per-watt improvements over general-purpose processors.

The evolution from the Intel 8087 to the Google TPU reveals a consistent pattern: hardware evolves to fit the algorithm’s dominant bottleneck. Where the 8087 addressed floating-point operations that dominated many scientific workloads, modern AI accelerators address dense matrix and convolution operations that dominate much of neural-network training and inference (Palmer 1980; Goodfellow et al. 2016; Sze et al. 2017; Jouppi et al. 2017). This concentration of demand explains why specialized AI silicon can deliver large performance-per-watt improvements over general-purpose processors on matching workloads.

These same three properties, massive parallelism, predictable data flow, and tolerance for reduced precision, shape every accelerator architecture decision. Before examining the computational primitives that exploit them, we examine the architectural organization that enables their efficient execution. Modern AI accelerators achieve their dramatic performance improvements through a carefully orchestrated hierarchy of specialized components operating in concert.

The processing substrate consists of an array of processing elements (visible as the “PE” grid in figure 5), each containing dedicated computational units optimized for specific operations: tensor cores execute matrix multiplication, vector units perform element-wise operations, and special function units compute activation functions. These processing elements are organized in a grid topology that enables massive parallelism, with dozens to hundreds of units operating simultaneously on different portions of the computation, exploiting the data-level parallelism inherent in neural network workloads.

The memory hierarchy forms an equally critical architectural component. High-bandwidth memory provides the aggregate throughput required to sustain these numerous processing elements, while a multilevel cache hierarchy from shared L2 caches down to per-element L1 caches and scratchpads minimizes the energy cost of data movement. This hierarchical organization embodies a core design principle: in AI accelerators, data movement typically consumes more energy than computation itself, necessitating architectural strategies that prioritize data reuse by maintaining frequently accessed values (including weights and partial results) in proximity to compute units. The machine foundations appendix collects reference specifications for modern accelerators (H100, TPU v5) and summarizes the latency penalties across each memory level.

The host interface establishes connectivity between the specialized accelerator and the broader computing system, enabling coordination between general-purpose CPUs that manage program control flow and the accelerator that executes computationally intensive neural network operations. This architectural partitioning reflects specialization at the system level: CPUs address control flow, conditional logic, and system coordination, while accelerators focus on the regular, massively parallel arithmetic operations that dominate neural network execution. The data path in figure 5 runs from the host interface through the memory hierarchy and into the processing element grid; that end-to-end integration is what makes the system optimized for AI workloads rather than general computation.

With the accelerator’s physical architecture established, the next step is to explain why these specific components dominate. Tensor cores, vector units, and hierarchical memory do not exist by accident; they exist because neural network computations repeatedly invoke a small set of operations. Understanding these patterns is essential because they explain which algorithmic changes translate to real speedups (those that align with hardware primitives) and which remain purely theoretical.

Self-Check: Question

What recurring structural pattern best explains the specialization path from the Intel 8087 through GPUs to TPUs?
1. A dominant computational bottleneck in each era made general-purpose processors inefficient, prompting a specialized unit that was later absorbed into mainstream silicon as the workload stabilized
2. Each generation became progressively more general-purpose to maximize software portability, so specialization is essentially a transitional phase
3. Clock-frequency scaling drove each transition, with specialization emerging only after the final frequency ceiling was reached
4. Each generation replaced memory hierarchies with larger on-chip arithmetic arrays so data movement stopped constraining performance
Why did domain-specific architectures become structurally necessary (not merely attractive) after Moore’s Law slowed and Dennard scaling ended?
1. Power-density and thermal limits produced dark silicon: architects could no longer power every transistor simultaneously, so dedicating powered transistors to narrow high-value workloads became the only way to keep performance scaling
2. Model compute demand began growing slower than hardware supply, so architects had free transistor budget to devote to specialized units
3. CPUs lost the ability to execute floating-point arithmetic, forcing the workload onto dedicated accelerators
4. Programmers preferred fixed-function hardware because it simplified debugging and deployment pipelines
Explain why machine learning created an “integration bottleneck” rather than merely another arithmetic bottleneck, and why this distinction drives accelerator design choices.
Order the following specialization waves by what each one made architecturally necessary for the next: (1) Domain-specific AI accelerators emerge to exploit ML’s regular dataflow, (2) Floating-point coprocessors establish the pattern of offloading a dominant primitive, (3) Parallel graphics processors prove that thousands of lightweight arithmetic units can be managed coherently, (4) ML-specific units refine DSAs around systolic arrays and mixed precision.
A startup profiles batch-1 inference for a 7-billion-parameter autoregressive model on an A100 and observes 5–10 percent compute utilization. Which diagnosis best matches the section’s analysis?
1. Autoregressive token-by-token generation produces too little parallel work per step to saturate the accelerator’s thousands of arithmetic lanes, and weight reads dominate per-token time
2. The model is too parallel for the hardware, so the scheduler is oversupplying work to the arithmetic units and forcing them to stall
3. The GPU lacks adequate branch-prediction hardware for the control flow of the decoder loop
4. Reduced-precision arithmetic is unavailable during inference, forcing FP64 execution on every kernel

See Answers →

AI Compute Primitives

Regardless of the layer type (fully connected, convolutional, or attention-based), the dominant operation in neural networks is multiplying input values by learned weights and accumulating the results. This multiply-accumulate (MAC) pattern often dominates execution time and can appear billions of times per inference pass. Its regularity is what makes hardware specialization possible: unlike general-purpose code with unpredictable branches and irregular memory access, MACs follow fixed data-flow patterns with predictable reuse, enabling architectures that trade away generality for raw throughput. The transition from CPUs achieving approximately 100 GFLOP/s to accelerators delivering 100,000+ GFLOP/s reflects this architectural bet: eliminating flexibility to optimize for the specific operations that neural networks actually perform.

We call the hardware units that exploit these patterns AI compute primitives: specialized functional blocks, each optimized for a particular class of operation. Three primitives are especially common in accelerators, each targeting a distinct computational pattern found in neural networks.

Listing 1 demonstrates how a dense layer decomposes at the framework level, encapsulating thousands of multiply-accumulate operations in a single high-level call.

Listing 1: Dense Layer Abstraction: High-level framework APIs encapsulate 131,072 MACs (256 inputs times 512 outputs) in a single function call, hiding the computational complexity from developers while enabling automatic hardware optimization.

# Framework abstracts compute-intensive operations
dense = Dense(512)(input_tensor)  # $256{\times}512$ MACs per sample

This single line of code conceals the computational complexity that accelerators must handle. Listing 2 reveals how the framework expands this high-level call into mathematical operations.

Listing 2: Matrix Operation Expansion: Each dense layer decomposes into matrix multiplication and element-wise operations, exposing the dominant compute pattern that many neural-network kernels are built around.

# Linear transformation work scales with input_dim x output_dim x
# batch.
output = (
    matmul(input, weights) + bias
)  # Matrix multiply dominates cost
output = activation(
    output
)  # Element-wise: proportional to output_dim x batch

The matrix multiplication dominates computation time, but this abstraction still hides the underlying loop structure. At the processor level, listing 3 reveals how nested loops multiply inputs and weights, sum the results, and apply a nonlinear function, exposing the $\mathcal{O}(B \times d_{\text{in}} \times d_{\text{out}})$ complexity that accelerators must handle efficiently.

Listing 3: Processor-Level Execution: Nested loops reveal the $\mathcal{O}(B \times d_{\text{in}} \times d_{\text{out}})$ multiply-accumulate operations that accelerators must execute, with 4.2M MACs MACs for $B$=32, $d_{\text{in}}$=256, $d_{\text{out}}$=512 configurations.

# Total operations: batch_size × output_size × input_size MACs
for n in range(batch_size):  # Batch dimension: parallelizable
    for m in range(output_size):  # Output neurons: parallelizable
        sum = bias[m]  # Initialize accumulator
        for k in range(input_size):  # Reduction dimension: sequential
            sum += input[n, k] * weights[k, m]  # MAC operation
        output[n, m] = activation(sum)  # Nonlinear transformation
# Example work scales as batch_size × output_size ×
# input_size multiply-accumulate operations

This loop structure reveals three distinct computational patterns that recur across all neural network architectures: element-wise operations along vectors (the activation function applied to each output), matrix-level reductions (the weighted sum across all input features), and nonlinear transformations (the activation function itself). Each pattern is frequent enough to justify dedicated silicon, offers orders-of-magnitude speedup when specialized, and has remained stable across decades of neural network evolution, from early perceptrons through transformers. These patterns become hardware blocks: vector units for independent elements, matrix engines for reductions, and special-function units for nonlinear math.

Vector operations

Vector operations provide the first level of hardware acceleration by processing multiple data elements simultaneously. Recall the nested-loop structure exposed in listing 3: a batch of 32 samples through a 256-to-512 dense layer requires 4.2M MACs multiply-accumulate operations. A traditional scalar processor executes these one at a time, loading an input value and a weight value, multiplying them, and accumulating the result. This sequential approach is hopelessly inefficient for neural networks that repeat this pattern across millions of parameters.

Vector processing units solve this by operating on multiple data elements simultaneously. RISC-V¹⁵, the fifth generation of the reduced instruction set computer (RISC) architecture (Waterman et al. 2013), provides a useful setting for illustrating this idea. Listing 4 uses vector-style assembly code in which a single instruction processes a vector of data elements in parallel. The loop has five hardware-visible stages:

¹⁵ RISC-V (Reduced Instruction Set Computer V): The open ISA allows hardware teams to add custom ML instructions—vector dot-product, activation functions, sparse tensor ops—without the licensing fees or NDAs required by ARM or x86. The constraint this removes is the 5–10 year wait for proprietary vendors to add ML-specific extensions to their roadmaps. The trade-off is software ecosystem maturity: RISC-V ML accelerators lack the cuDNN/TensorRT equivalents that make GPU programming practical, limiting adoption to edge and embedded inference where the software stack is narrow enough to build from scratch.

Waterman, Andrew, Yunsup Lee, Rimas Avizienis, Henry Cook, David Patterson, and Krste Asanovic. 2013. “The RISC-V Instruction Set.” 2013 IEEE Hot Chips 25 Symposium (HCS), 1–1. https://doi.org/10.1109/hotchips.2013.7478332.

Vector length configuration: Configures the vector units to process 32-bit elements, automatically determining how many operations happen in parallel based on hardware width (VLEN).
Vector initialization: Clears the accumulator vector v0 (containing, for example, eight parallel sums) using an exclusive-OR operation, which is more efficient than a load immediate.
Vector loads: Loads continuous 32-bit input and weight values from memory into vector registers v1 and v2 in a single instruction, maximizing memory bandwidth utilization.
Fused Multiply-Accumulate: Performs parallel multiply-add operations ($v_0 = v_0 + v_1 \times v_2$). This is the core computational primitive, doubling throughput compared to separate multiply and add instructions.
Pointer arithmetic: Updates memory pointers by the vector byte length to prepare for the next data chunk.

Listing 4: Vectorized Multiply-Accumulate Loop: This illustrative loop shows how vector-style instructions enable efficient batch processing by performing multiple multiply-add operations simultaneously, reducing computational latency in neural network kernels.

vsetvli t0, a0, e32
loop_batch:
    loop_neuron:
        vxor.vv v0, v0, v0
        loop_feature:
            vle32.v v1, (in_ptr)
            vle32.v v2, (wt_ptr)
            vfmacc.vv v0, v1, v2
            add in_ptr, in_ptr, 32
            add wt_ptr, wt_ptr, 32
            bnez feature_cnt, loop_feature

The key insight from this assembly sequence is that the fused multiply-accumulate instruction (vfmacc.vv) performs the same operation that would require separate multiply and add instructions on a scalar processor, while the vector load instructions (vle32.v) amortize memory access overhead across multiple data elements. This vector implementation processes eight data elements in parallel, reducing both computation time and energy consumption. Vector load instructions transfer eight values simultaneously, maximizing memory bandwidth utilization. The vector multiply-accumulate instruction processes eight pairs of values in parallel, dramatically reducing the total instruction count from 4.2M MACs scalar operations to roughly 524,288 vector chunks.

Key vector operations map directly to common deep learning patterns. Table 2 enumerates how operations such as reduction, gather, scatter, and masked operations appear frequently in pooling, embedding lookups, and attention mechanisms, clarifying the direct mapping between low-level vector hardware and high-level machine learning workloads.

Table 2: Vector Operations: Core vector operations map directly to deep learning primitives: reductions implement pooling layers, gathers enable embedding lookups, scatters update embedding gradients, and masked operations handle attention masks. Each operation exploits data-level parallelism to process multiple elements simultaneously, explaining why vector units are universal across all accelerator designs.

Vector Operation	Description	Neural Network Application
Reduction	Combines elements across a vector (for example, sum, max)	Pooling layers, attention score computation
Gather	Loads multiple nonconsecutive memory elements	Embedding lookups, sparse operations
Scatter	Writes to multiple nonconsecutive memory locations	Gradient updates for embeddings
Masked operations	Selectively operates on vector elements	Attention masks, padding handling
Vector-scalar broadcast	Applies scalar to all vector elements	Bias addition, scaling operations

These efficiency gains extend beyond instruction count reduction. Memory bandwidth utilization improves as vector loads transfer multiple values per operation, and energy efficiency increases because control logic is amortized across many data elements. These improvements compound across the deep layers of modern neural networks, where billions of element-wise operations execute per forward pass. The architectural pattern is not new. The Cray-1¹⁶ pioneered the same approach for scientific computing in 1975 (Jordan 1982), but neural networks have given it unprecedented commercial importance.

¹⁶ Cray-1 Vector Legacy: The Cray-1 (1975) achieved 160 MFLOP/s—1,000$\times$ faster than contemporary computers—by processing 64 elements simultaneously through pipelined vector units, at a cost of $8.8 million ($40–45 million in 2024 dollars). Its architectural template (wide vector registers, pipelined execution, streaming data through arithmetic units) is precisely the design that modern AI accelerators scale to thousands of elements: an H100’s tensor cores are conceptual descendants of Cray’s vector units, operating on matrix tiles rather than vectors.

Jordan, T. L. 1982. “A Guide to Parallel Computation and Some Cray-1 Experiences.” In Parallel Computations. Elsevier. https://doi.org/10.1016/b978-0-12-592101-5.50006-3.

Vector operations excel at element-wise transformations like activation functions, where each output depends only on its corresponding input. Neural networks, however, also require structured computations where each output depends on all inputs—the weighted sums that define layer transformations. These many-to-many operations naturally express themselves as matrix multiplications, our second compute primitive.

Matrix operations

Matrix multiplication dominates neural network computation, transforming high-dimensional data through structured patterns of weights, activations, and gradients (Goodfellow et al. 2016). While vector operations process elements independently, matrix operations orchestrate computations across multiple dimensions simultaneously. These operations reveal patterns that drive hardware acceleration strategies.

Matrix operations in neural networks

Neural network computations decompose into hierarchical matrix operations. Listing 5 captures this hierarchy through a linear layer that transforms input features into output neurons over a batch.

Listing 5: Matrix Operations: Neural networks perform transformations using matrix multiplications and biases to achieve output predictions. Training requires careful management of input batches and activation functions to optimize model performance.

layer = nn.Linear(256, 512)  # Layer transforms 256 inputs to
# 512 outputs
output = layer(input_batch)  # Process a batch of 32 samples

# Framework Internal: Core operations (column-batch convention)
Z = matmul(weights, input)  # Matrix: transforms [256×32]
# input to [512×32] output
Z = Z + bias  # Vector: adds bias to each
# output independently
output = relu(Z)  # Vector: applies activation to
# each element independently

This computation demonstrates the scale of matrix operations in neural networks. Each output neuron (512 total) must process all input features (256 total) for every sample in the batch (32 samples). The weight matrix alone contains 256 $\times$ 512 = 131,072 parameters that define these transformations, illustrating why efficient matrix multiplication dominates performance considerations.

Neural networks employ matrix operations across diverse architectural patterns beyond simple linear layers. Matrix operations appear consistently across modern neural architectures. Convolution operations transform into matrix multiplications through the im2col technique¹⁷, enabling efficient execution on matrix-optimized hardware. Listing 6 illustrates these diverse applications.

¹⁷ Im2col (Image-to-Column): Transforms convolution into a matrix multiplication by explicitly duplicating overlapping input regions into the columns of a new, larger matrix. This memory-for-compute trade-off is precisely what enables execution on matrix-optimized hardware, as the context sentence states. The cost is significant memory amplification; a standard $3{\times}3$ kernel increases the input’s memory footprint by 9$\times$ to create the required dense matrix structure.

Listing 6: Matrix Patterns Across Architectures: Linear layers, attention mechanisms, and convolutions all reduce key work to matrix multiplications, making matrix hardware the shared primitive across modern neural architectures.

hidden = matmul(weights, inputs)
# weights: [out_dim x in_dim], inputs: [in_dim x batch]
# Result combines all inputs for each output

# Attention Mechanisms - Multiple matrix operations
Q = matmul(Wq, inputs)
# Project inputs to query space [query_dim x batch]
K = matmul(Wk, inputs)
# Project inputs to key space [key_dim x batch]
attention = matmul(Q, K.T)
# Compare all queries with all keys [query_dim x key_dim]

# Convolutions - Matrix multiply after reshaping
patches = im2col(input)
# Convert [H x W x C] image to matrix of patches
output = matmul(kernel, patches)
# Apply kernels to all patches simultaneously

Matrix operations hardware acceleration

This pervasive pattern of matrix multiplication has direct implications for hardware design: accelerators need specialized units that can handle these computations at scale. Listing 7 demonstrates a representative dedicated matrix unit that processes an entire $16{\times}16$ block at once, illustrating why matrix instructions and tensor cores can deliver much higher throughput than scalar or vector-only execution paths (NVIDIA 2017; Intel Corporation 2021a).

Listing 7: Matrix Unit Operation: Enables efficient block-wise matrix multiplication and accumulation in hardware-accelerated systems, demonstrating how specialized units streamline computational tasks for AI/ML operations.

mload mr1, (weight_ptr)     # Load e.g., $16{\times}16$ block of
                            # weight matrix
mload mr2, (input_ptr)      # Load corresponding input block
matmul.mm mr3, mr1, mr2     # Multiply and accumulate entire
                            # blocks at once
mstore (output_ptr), mr3    # Store computed output block

This matrix processing unit can handle $16{\times}16$ blocks of the linear layer computation described earlier, processing 256 multiply-accumulate operations simultaneously compared to the eight operations possible with vector processing. These matrix operations complement vectorized computation by enabling structured many-to-many transformations. The interplay between matrix and vector operations shapes the efficiency of neural network execution.

Like vector processing, matrix acceleration has deep historical roots—DSPs and GPUs optimized for matrix computations in the 1980s-1990s for image processing, scientific computing, and 3D rendering (Golub and Loan 1996; Owens et al. 2008; Hwu 2011). Neural networks have made matrix multiplication commercially dominant, driving the development of dedicated tensor cores and TPUs that process these operations at unprecedented scale.

Golub, Gene H., and Charles F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.

Owens, John D., Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. 2008. “GPU Computing.” Proceedings of the IEEE 96 (5): 879–99. https://doi.org/10.1109/jproc.2008.917757.

Hwu, Wen-mei W. 2011. “Introduction.” In GPU Computing Gems Emerald Edition. Elsevier. https://doi.org/10.1016/b978-0-12-384988-5.00064-4.

Matrix and vector operations together handle the linear algebra of neural networks. Between every linear transformation, however, sits a nonlinear activation function—and these transcendental computations (exponentials, square roots, trigonometric functions) cannot be efficiently expressed through multiply-accumulate alone. Table 3 contrasts the two primitive types, clarifying which neural network operations map to each.

Table 3: Operation Characteristics: Matrix operations excel at many-to-many transformations common in neural network layers, while vector operations efficiently handle one-to-one transformations like activation functions and normalization. The distinction determines which hardware primitive (tensor core or vector unit) delivers optimal performance for each operation.

Operation Type	Best For	Examples	Key Characteristic
Matrix Operations	Many-to-many transforms	Layer transformations, attention, convolutions	Each output depends on multiple inputs
Vector Operations	One-to-one transforms	Activation functions, layer normalization, element-wise gradients	Each output depends only on corresponding input

Special function units

Special Function Units (SFUs) provide dedicated hardware for these nonlinear computations, completing the trio of core processing primitives. The need for such units is not new: floating-point coprocessors addressed scalar arithmetic bottlenecks (Palmer 1980), and digital signal processing hardware addressed related demands for specialized arithmetic in scientific and signal-processing workloads (Smith 1997). Neural networks have intensified this demand because activation functions, normalization layers, and softmax transformations appear after every linear layer, making them a throughput bottleneck rather than an occasional convenience.

Palmer, John F. 1980. “The Intel 8087 Numeric Data Processor.” Proceedings of the 1980 National Computer Conference (AFIPS), 887–93. https://doi.org/10.1145/1500518.1500674.

Smith, Steven W. 1997. “Digital Signal Processing.” In Digital Signal Processing Demystified. Elsevier. https://doi.org/10.1016/b978-187870716-1/50004-4.

Nonlinear functions

To see why dedicated hardware matters, consider a typical layer sequence (Goodfellow et al. 2016). Listing 8 combines linear transformations with nonlinear activations—operations that appear simple in Python but reveal substantial computational complexity at the hardware level.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Listing 8: Nonlinear Transformations: Neural networks process input data through a sequence of linear transformations followed by nonlinear activations to capture complex patterns. This layer sequence enhances model expressiveness and learning capabilities.

layer = nn.Sequential(
    nn.Linear(256, 512), nn.ReLU(), nn.BatchNorm1d(512)
)
output = layer(input_tensor)

This sequence introduces multiple nonlinear transformations that extend beyond simple matrix operations. Listing 9 breaks down these operations into their mathematical components, exposing the computational complexity that hardware must address.

Listing 9: Nonlinear Transformations: Neural networks apply linear and nonlinear operations to transform input data into meaningful features for learning. Machine learning models use these transformations to capture complex patterns in data efficiently.

Z = matmul(weights, input) + bias  # Linear transformation
H = max(0, Z)  # ReLU activation
mean = reduce_mean(H, axis=0)  # BatchNorm statistics
var = reduce_mean((H - mean) ** 2)  # Variance computation
output = gamma * (H - mean) / sqrt(var + eps) + beta  # Normalization

Hardware implementation of nonlinear functions

The computational complexity of these operations becomes apparent when examining their implementation on traditional processors. These seemingly simple mathematical operations translate into complex sequences of instructions. Consider batch normalization (Ioffe and Szegedy 2015): computing the normalization requires reductions, variance calculation, and a square root, while operations like softmax introduce exponentials whose cost depends on the processor implementation. A rectified linear unit (ReLU) is mathematically simple, but a naive scalar implementation still performs a comparison and selection for every element; optimized ML kernels usually make that step branchless. Listing 10 therefore uses ReLU and batch normalization to show two different sources of overhead: element-wise passes through memory and multi-pass normalization work.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” Proceedings of the 32nd International Conference on Machine Learning (ICML) 37: 448–56.

Listing 10: ReLU and BatchNorm Operations: Neural networks process input data through element-wise selections and multiple normalization passes, highlighting efficiency challenges in naive implementations.

for batch in range(32):
    for feature in range(512):
       # ReLU: Naive scalar compare/select; optimized kernels
       # usually implement this branchlessly.
       z = matmul_output[batch, feature]
       h = max(0.0, z)    # Conditional operation

       # BatchNorm: Multiple passes over data
       mean_sum[feature] += h    # First pass for mean
       var_sum[feature] += h * h # Additional pass for variance

       temp[batch, feature] = h  # Extra memory storage needed


# Normalization requires complex arithmetic
for feature in range(512):
    mean = mean_sum[feature] / batch_size
    var = (var_sum[feature] / batch_size) - mean * mean

    # Square root computation: Multiple iterations
    scale = gamma[feature] / sqrt(var + eps)  # Iterative
                                              # approximation
    shift = beta[feature] - mean * scale

    # Additional pass over data for final computation
    for batch in range(32):
        output[batch, feature] = temp[batch, feature] *
                                 scale + shift

These operations introduce several interrelated inefficiencies that compound across the deep layers of modern networks. Multiple passes over data inflate memory bandwidth requirements, while complex arithmetic operations like square root and exponential demand many instruction cycles each. Element-wise selections such as ReLU are cheap per element but still create extra memory traffic when they are launched as separate kernels, and the need for intermediate storage between passes further increases memory pressure. Vector processing units, designed for regular computations, cannot fully use their width on operations like exponentials and square roots when those functions require specialized or lower-throughput paths.

More specifically, each operation introduces distinct challenges. Batch normalization requires multiple passes through data: one for mean computation, another for variance, and a final pass for output transformation. Each pass loads and stores data through the memory hierarchy. Operations that appear simple in mathematical notation often expand into many instructions, especially for square roots and exponentials on processors without specialized hardware paths. ReLU generally maps to a compare-and-select or maximum operation, so its standalone cost is dominated less by arithmetic than by the additional read and write if it is not fused with neighboring work. The implementation needs temporary storage for intermediate values, increasing memory usage and bandwidth consumption. While vector units excel at regular computations, functions like exponentials and square roots often require specialized implementations that may not fully use vector processing capabilities.

SFU hardware implementation

SFUs address these inefficiencies through dedicated hardware implementation. Modern ML accelerators include specialized circuits that transform these complex operations into low-latency, fixed-function computations. Listing 11 demonstrates this efficiency: loading a vector of values allows the accelerator to apply ReLU, sigmoid, and square-root-style operations through dedicated execution paths, eliminating multiple software passes and complex instruction sequences.

Listing 11: Hardware Acceleration: Single-cycle nonlinear operations enable efficient vector processing in ML accelerators, demonstrating how specialized hardware reduces computational latency.

vld.v v1, (input_ptr)    # Load vector of values
vrelu.v v2, v1           # Single-cycle ReLU on entire vector
vsigm.v v3, v1           # Fixed-latency sigmoid computation
vtanh.v v4, v1           # Direct hardware tanh implementation
vrsqrt.v v5, v1          # Fast reciprocal square root

Each SFU implements a specific function through specialized circuitry. For instance, a ReLU unit performs the comparison and selection in dedicated logic, eliminating branching overhead. Square root operations use hardware implementations of algorithms like Newton-Raphson with fixed iteration counts, providing predictable latency bounds. Exponential and logarithmic functions often combine small lookup tables with hardware interpolation circuits. Table 4 summarizes the various hardware implementations and their typical latencies, spanning from single-cycle activations to logarithmic-time reductions.

Table 4: Special Function Units: Dedicated hardware implementations of common mathematical functions (like relu, sigmoid, and reciprocal square root) accelerate machine learning computations by eliminating software overhead and enabling parallel processing of vector data. The latency ranges are representative design targets, not universal product specifications; the important point is that nonlinear primitives have different hardware costs.

Function Unit	Operation	Implementation Strategy	Illustrative Latency
Activation Unit	ReLU, sigmoid, tanh	Piece-wise approximation circuits	1–2 cycles
Statistics Unit	Mean, variance	Parallel reduction trees	$\log(N)$ cycles
Exponential Unit	exp, log	Table lookup + hardware interpolation	2–4 cycles
Root/Power Unit	sqrt, rsqrt	Fixed-iteration Newton-Raphson	4–8 cycles

Vector operations, matrix operations, and special function units constitute the three core computational primitives, but primitives alone do not determine throughput. The primitives tell us what operations accelerators perform efficiently; the execution models tell us how those operations are parallelized across thousands of processing elements. This distinction matters because the same matrix multiplication can achieve 10 percent or 90 percent of peak performance depending on how it maps to the execution model: a difference driven by thread organization, memory access patterns, and synchronization overhead rather than algorithmic complexity.

Self-Check: Question

Which mapping of neural-network operations to accelerator primitives best matches the architectural argument of the section?
1. Dense projections map to tensor cores or systolic matrix units, element-wise operations (bias add, masking, ReLU) map to vector units, and transcendental activations like exp or sigmoid benefit from dedicated special-function units
2. Dense projections map to special-function units because matrix multiplication is a transcendental operation; softmax maps to tensor cores; element-wise masking runs on systolic arrays
3. Element-wise activations map to tensor cores because they share the same arithmetic shape; matrix multiplications run on vector units because both do multiply-accumulate; reductions run on branch predictors
4. All three operation classes map equally well to the same generic scalar pipeline because modern accelerators unify them into one execution unit
A 128×128 systolic array performs a large matrix multiplication. Explain quantitatively why its energy per multiply-accumulate is dramatically lower than a vector-unit implementation of the same multiplication.
True or False: A 50-percent unstructured pruning pass delivers roughly the same inference speedup as a 2:4 structured-sparsity pass on modern tensor hardware, because both halve the number of nonzero multiplies.
A compiler must map a 4096×4096 matrix multiplication onto a 128×128 systolic array. Why is tiling required rather than optional?
1. The hardware has a fixed physical array size and a bounded on-chip scratchpad, so the logical computation must be partitioned into 128×128 sub-problems that fit and that maximize operand reuse within each block
2. Tiling changes the operation from matrix multiplication to vector reduction once the matrix exceeds a threshold size
3. Tiling is required so FP16 accumulations can be promoted to FP32 inside each tile, without which numerical precision would collapse
4. The accelerator can execute only one row of the output matrix per cycle regardless of how much on-chip memory is available
A convolution layer produces 257 output channels and runs on a 128-wide tensor unit. Two full tiles cover 256 channels, but the 257th channel forces a third tile with only one active lane out of 128. What fraction of the third tile’s compute bandwidth is wasted, and what is the general lesson for dimension selection?
1. 127/128 ≈ 99 percent of the third tile is wasted; because utilization loss grows sharply near tile boundaries, architects and model designers pick output channel counts that are multiples of the tile width (128, 256, 512) to keep the final tile full
2. Approximately 50 percent is wasted because any partial tile halves throughput; the fix is to pad activations to the next batch size
3. No bandwidth is wasted because modern accelerators dynamically resize their tensor units to match odd dimensions
4. Approximately 1/128 is wasted because only the unused lanes consume energy; the shape choice is cosmetic and has no performance consequence
Why does a switch from FP32 to FP16 or INT8 deliver more than the naive “half the bits, half the time” speedup on modern accelerators?
1. Reduced precision attacks both sides of the roofline: more low-precision MAC units fit in a fixed silicon area (raising the compute ceiling), and fewer bytes traverse the memory hierarchy per operation (raising arithmetic intensity)
2. Reduced precision removes the need for on-chip memory entirely because values fit in registers
3. Reduced precision improves model accuracy on large accelerators, so fewer training steps are needed to reach target quality
4. Tensor cores function only on integer operands, so any FP32 path is an emulation that runs orders of magnitude slower

See Answers →

Compute Units and Execution Models

Applying ReLU to a 512-element vector shows why execution models matter: the operation is simple, but throughput depends on whether the hardware treats those 512 comparisons as scalar instructions, SIMD lanes, GPU threads, or tensor-program fragments. Modern AI processors package the three compute primitives into distinct execution units: single instruction, multiple data (SIMD) units, tensor cores, and processing elements that define how computations are structured and exposed to programmers. Understanding this organization reveals both the theoretical capabilities and practical performance characteristics that determine real-world throughput.

Mapping primitives to execution units

The progression from computational primitives to execution units follows a structured hierarchy that reflects the increasing complexity and specialization of AI accelerators:

Vector operations → SIMD/SIMT units that enable parallel processing of independent data elements
Matrix operations → Tensor cores and systolic arrays that provide structured matrix multiplication
Special functions → Dedicated hardware units integrated within processing elements

Each execution unit combines these computational primitives with specialized memory and control mechanisms, optimizing both performance and energy efficiency. This structured packaging allows hardware vendors to expose standardized programming interfaces while implementing diverse underlying architectures tailored to specific workload requirements. The choice of execution unit significantly influences overall system efficiency by determining data locality, compute density, synchronization overhead, and how much of the theoretical peak the workload can actually use.

Evolution from SIMD to SIMT architectures

Imagine applying ReLU activation to a 512-element vector. A scalar processor executes 512 comparison-and-select operations sequentially. A SIMD (Single Instruction, Multiple Data) unit processes 8 or 16 elements per instruction, reducing the work to 32–64 instructions. An SIMT (Single Instruction, Multiple Thread) GPU can launch one lightweight thread per element; the hardware schedules those threads in warps or waves, completing the vector through multiple parallel groups while hiding latency. This progression reflects two related ideas: Flynn’s SIMD taxonomy formalized data-parallel execution (Flynn 1966), and GPU SIMT architectures extend that principle to many lightweight threads scheduled in warps (Lindholm et al. 2008; Nickolls et al. 2008).

Flynn, M. J. 1966. “Very High-Speed Computing Systems.” Proceedings of the IEEE 54 (12): 1901–9. https://doi.org/10.1109/proc.1966.5273.

SIMD execution applies identical operations to multiple data elements in parallel, minimizing instruction overhead while maximizing data throughput. This execution model is widely used to accelerate workloads with regular, independent data parallelism, such as neural network computations. The Arm Scalable Vector Extension (SVE) provides a representative example of how modern architectures implement scalable SIMD operations efficiently (Stephens et al. 2017). Listing 12 demonstrates this approach.

Listing 12: Vector Operation: Vector multiplication and addition operations enable efficient parallel processing in machine learning models.

ptrue p0.s              # Create predicate for vector length
ld1w z0.s, p0/z, [x0]   # Load vector of inputs
fmul z1.s, z0.s, z0.s   # Multiply elements
fadd z2.s, z1.s, z0.s   # Add elements
st1w z2.s, p0, [x1]     # Store results

The ptrue predicate that opens this sequence is what makes SVE scalable: it queries the hardware’s native vector length at run time, so the same binary saturates a narrow 128-bit vector unit and a wide 2048-bit one without recompilation (Stephens et al. 2017). Intel’s Advanced Matrix Extensions (AMX) are a different kind of specialization: tile registers and matrix instructions expose two-dimensional matrix operations directly to software rather than merely widening a vector lane (Intel Corporation 2021a). Together, SVE and AMX show two hardware-facing paths for ML kernels: vector-length-portable SIMD and fixed tile-based matrix acceleration.

Stephens, Nigel, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, et al. 2017. “The ARM Scalable Vector Extension.” IEEE Micro 37 (2): 26–39. https://doi.org/10.1109/mm.2017.35.

¹⁸ Reduced-Precision ML: The precision-performance trade-off is quantifiable (Dally et al. 2021; Dally 2023): halving the bit-width of an operand quadruples the number of ALUs that fit in the same silicon area and halves the memory bandwidth consumed per element. NVIDIA’s architectural shift from FP64-heavy designs (Fermi, Kepler) to mixed-precision Tensor Cores (Volta, 2017) delivered 125 TFLOP/s of FP16 tensor throughput vs. the prior generation’s 21 TFLOP/s of FP16—roughly 6$\times$ at the same 300 W TDP. This established precision selection as a first-class architectural decision: the correct precision is the lowest one that preserves model accuracy, not the highest one the hardware supports.

Dally, William J., Stephen W. Keckler, and David B. Kirk. 2021. “Evolution of the Graphics Processing Unit (GPU).” IEEE Micro 41 (6): 42–51. https://doi.org/10.1109/mm.2021.3113475.

Dally, Bill. 2023. “Hardware for Deep Learning.” 2023 IEEE Hot Chips 35 Symposium (HCS), 1–58. https://doi.org/10.1109/hcs59251.2023.10254716.

Lindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.

Nickolls, John, Ian Buck, Michael Garland, and Kevin Skadron. 2008. “Scalable Parallel Programming with CUDA: Is CUDA the Parallel Programming Model That Application Developers Have Been Waiting For?” Queue 6 (2): 40–53. https://doi.org/10.1145/1365490.1365500.

¹⁹ Streaming Multiprocessor (SM): The physical hardware engine that implements the SIMT model by using warp schedulers to coordinate the thousands of parallel threads mentioned in the text. The “efficient scaling” of neural networks is therefore entirely dependent on maintaining high SM occupancy—the fraction of active warps available to the schedulers. If occupancy is low, the SM’s execution units are starved for work and sit idle, meaning the GPU is memory bound and cannot achieve its peak computational throughput.

²⁰ Warp: The basic execution unit of 32 threads that enables SIMT efficiency by sharing a single instruction fetch and executing in lock-step. The direct trade-off for this efficiency is warp divergence: when threads take different control-flow paths, the hardware must serialize each path’s execution for all 32 threads, potentially cutting throughput by 50 percent or more. This is why ML kernels use branchless predicated operations to maintain full warp efficiency.

To address these limitations, SIMT¹⁸ extends SIMD principles by enabling parallel execution across multiple independent threads, each maintaining its own program counter and architectural state (Lindholm et al. 2008; Nickolls et al. 2008). This model maps naturally to matrix computations, where each thread processes different portions of a workload while still benefiting from shared instruction execution. In NVIDIA’s GPU architectures, each Streaming Multiprocessor (SM)¹⁹ coordinates thousands of threads executing in parallel, allowing for efficient scaling of neural network computations. Threads are organized into warps²⁰, which are the basic execution units that enable SIMT efficiency. Listing 13 shows this parallel processing model in action.

Listing 13: SIMT Execution: Each thread processes a unique output element in parallel, demonstrating how SIMT enables efficient matrix multiplication on GPUs.

__global__ void matrix_multiply(float* C, float* A, float*
                                B, int N) {  // CUDA kernel
    // Each thread processes one output element
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    float sum = 0.0f;
    for (int k = 0; k < N; k++) {
        // Threads in a warp execute in parallel
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

The preceding listing shows a CUDA²¹ kernel where SIMT execution allows neural network computations to scale efficiently across thousands of threads while maintaining flexibility for divergent execution paths. Similar execution models appear in AMD’s RDNA and Intel’s Xe architectures, reinforcing SIMT as a core mechanism for AI acceleration.

²¹ CUDA (Compute Unified Device Architecture): Released by NVIDIA in 2006, CUDA eliminated the need to disguise general-purpose computations as graphics operations, opening GPUs to scientific and ML workloads through a C-like programming model. The ecosystem it created—cuBLAS, cuDNN, TensorRT—constitutes a software moat that can lock the ML training stack to NVIDIA hardware: migrating away requires rewriting or replacing thousands of GPU-optimized kernels, a cost that often exceeds the hardware savings of competing platforms. This software lock-in, not raw silicon performance alone, helps explain why many large ML training stacks remain CUDA-centered.

Tensor Cores

Consider a single transformer attention head computing the $Q \times K^T$ product for a 2,048-token sequence with 64-dimensional embeddings. This operation requires multiplying a $2048{\times}64$ matrix by a $64{\times}2048$ matrix: roughly 268.4M MACs, or about 536.9 MFLOP when the multiply and add are counted separately. On a scalar processor executing one FLOP per cycle at 2 GHz, this single attention head would take about 268.4 ms. GPU SIMT execution can distribute this work across many threads, but tensor cores go further by processing entire $16{\times}16$ matrix tiles per instruction; under the illustrative assumption used here, the tiled path completes the same operation in under 0.5 milliseconds, a roughly 536.9× improvement over scalar execution. This dramatic speedup arises not from faster clock speeds but from a fundamentally different approach to organizing computation around matrix blocks rather than individual elements.

While SIMD and SIMT units provide efficient execution of vector operations, neural networks rely heavily on matrix computations that require specialized execution units for structured multi-dimensional processing. The energy economics of matrix operations drive this specialization: traditional scalar processing can require multiple off-chip memory accesses per operation, while tensor cores amortize data movement across entire matrix blocks. Tensor processing units extend SIMD and SIMT principles by enabling efficient matrix operations through dedicated hardware blocks (tensor cores) that execute matrix multiplications and accumulations on matrix tiles²². In many cases, this shifts the dominant cost from off-chip data movement toward on-chip reuse and arithmetic, depending on the kernel mix and memory behavior.

²² Tensor Core Dimension Alignment: NVIDIA Tensor Cores are most efficient when matrix dimensions are aligned to precision- and architecture-specific multiples, such as multiples of 8 or 16 for common FP16/BF16/INT8 paths. Modern cuBLAS and cuDNN can still use Tensor Cores for many nonaligned dimensions, but poorly aligned shapes may trigger less efficient kernels, require padding, or reduce effective throughput (NVIDIA 2024a; NVIDIA Corporation 2021). This is why model architects often choose embedding and channel dimensions such as 512 rather than 500 and why batch-size-1 inference may fail to reach peak Tensor Core utilization: alignment and arithmetic intensity jointly determine whether the hardware’s matrix engines stay full.

NVIDIA Corporation. 2021. NVIDIA cuDNN Developer Guide.

²³ Tensor Core: A single tensor core instruction executes a complete matrix-multiply-accumulate operation on a small tile of data using a dedicated hardware block (NVIDIA 2017; NVIDIA Corporation 2020). This approach bypasses the overhead of fetching and scheduling dozens of individual arithmetic instructions on general-purpose CUDA cores. Because these blocks constitute a large fraction of a modern accelerator’s advertised tensor throughput, failing to use them can leave most of the chip’s theoretical peak unavailable to the workload.

NVIDIA. 2017. Training with Mixed Precision.

Tensor cores²³ provide an example of this approach. Listing 14 exposes matrix computation capabilities through specialized instructions that use dedicated hardware blocks.

Listing 14: Tensor Core Operation: Matrix multiplications are performed in parallel across entire matrix blocks, optimizing computational efficiency for neural network training.


Tensor Core Operation (example GPU):
mma.sync.aligned.m16n16k16.f16.f16
  {d0,d1,d2,d3},     // Destination registers
  {a0,a1,a2,a3},     // Source matrix A
  {b0,b1,b2,b3},     // Source matrix B
  {c0,c1,c2,c3}      // Accumulator

A single tensor core instruction processes an entire matrix block while maintaining intermediate results in local registers, improving computational efficiency compared to implementations based on scalar or vector operations. This structured approach enables hardware to achieve high throughput while reducing the burden of explicit loop unrolling and data management at the software level.

Design priorities determine how matrix engines appear in different processor families. GPU tensor cores preserve programmability while accelerating general-purpose deep learning kernels. TPU-style designs use large-scale matrix units arranged in systolic arrays to maximize sustained training throughput on dense tensor kernels. Mobile NPUs²⁴ shrink the same idea into low-power inference blocks, while server CPUs add matrix instruction extensions (AMX-class tiles) for inference and mixed workloads. Each version changes the same contract: how much flexibility the hardware keeps while reducing movement around dense matrix operations.

²⁴ Neural Processing Unit (NPU): Mobile NPUs achieve low-power inference by implementing common tensor operations in fixed-function or narrowly programmable hardware rather than as fully general GPU kernels. This architectural commitment can deliver large energy-efficiency gains for supported kernels, but it makes deployment dependent on operator coverage: unsupported functions must fall back to a CPU or GPU path that may be far less efficient for that workload (Sze et al. 2017).

Sze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” Proceedings of the IEEE 105 (12): 2295–329. https://doi.org/10.1109/jproc.2017.2761740.

The increasing specialization of AI hardware has driven measurable performance improvements in deep learning workloads. To appreciate the magnitude of this shift, trace the curve in figure 6 from left to right: over a single decade, NVIDIA’s advertised single-chip throughput rose roughly 1,000$\times$ (three orders of magnitude) from the K20X’s 3.9 TFLOP/s in FP32 to the H100’s roughly 4,000 TFLOP/s in FP8, as the architecture transitioned from general-purpose floating-point execution units to dedicated tensor-processing cores, lower precision formats, and structured sparsity support (NVIDIA Corporation 2017, 2020, 2024; Choquette 2023). Because the plotted points mix precision formats and FLOP/s against INT8 TOPS, the curve tracks generation-over-generation capability rather than one consistent unit, so the later B200 sits even higher.

Figure 6: **GPU Performance Scaling**: NVIDIA advertised single-chip peak throughput increased by more than 1,000$\times$ over roughly a decade, from K20X-era FP32 throughput to H100/B200 tensor-operation peaks. The plotted labels mix precision modes, so the figure should be read as an architectural trend, not an apples-to-apples FP32 comparison. This gain was driven by tensor core acceleration, reduced precision (FP16, INT8, FP8, FP4), and hardware-accelerated structured sparsity (NVIDIA Corporation 2017, 2020, 2024; Choquette 2023).

Processing elements

The highest level of execution unit organization integrates multiple tensor cores with local memory into processing elements (PEs). A processing element serves as the primary building block in many AI accelerators, combining different computational units to efficiently execute neural network operations. Each PE typically includes vector units for element-wise operations, tensor cores for matrix computation, special function units for nonlinear transformations, and dedicated memory resources to optimize data locality and minimize data movement overhead.

Processing element design varies because each architecture chooses a different balance between compute density, local memory, and interconnect distance. Graphcore’s Intelligence Processing Unit (IPU) distributes computation across 1,472 tiles, each containing independent processing elements optimized for fine-grained parallelism (Graphcore 2020). Cerebras extends the same local-compute principle in the CS-2 system, integrating roughly 850,000 AI-optimized cores across a wafer-scale device for deep learning acceleration (Systems 2021). Tesla’s D1 processor emphasizes substantial local memory inside its processing elements, optimizing throughput and latency for real-time autonomous vehicle workloads (Tesla, Inc. 2021).

Graphcore. 2020. The Colossus MK2 IPU Processor.

Tesla, Inc. 2021. Tesla Dojo Technology: A Guide to Tesla’s Configurable Floating Point Formats & Arithmetic. Tesla AI Day, Whitepaper.

Across these designs, the binding trade-off is the one the roster illustrates: compute density versus data locality. Packing more cores raises peak throughput only if each one can be kept fed, so a processing element’s delivered efficiency depends as much on interconnect strategy and memory locality as on raw arithmetic capability.

That same dependence on locality governs which algorithmic optimizations the hardware can actually exploit. A regular grid of processing elements accelerates sparsity only when the surviving nonzero values preserve the predictable access patterns the grid depends on, which is precisely the constraint that N:M structured sparsity is designed to satisfy.

N:M structured sparsity mechanics

While unstructured pruning reduces model size, it rarely translates to hardware speedup because memory access becomes irregular. Hardware accelerators solve this with N:M Structured Sparsity²⁵, a pattern-based approach that enforces regularity. The notation “$N{:}M$” specifies that exactly $N$ values must be nonzero within every contiguous block of $M$ values, creating a predictable pattern that hardware can exploit.

²⁵ N:M Structured Sparsity: The 2:4 ratio (50 percent density) used by NVIDIA’s Ampere Sparse Tensor Cores is a hardware-friendly compromise: every contiguous four-value group retains two nonzero values, preserving regular indexing while halving the dense value payload (NVIDIA Corporation 2020). At 2:4, the metadata overhead is compact enough to store alongside the weights without overwhelming the memory-traffic savings, which is the constraint that makes the advertised 2$\times$ tensor-math throughput path plausible when kernels and model weights satisfy the pattern.

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Whitepaper, V1.0.

NVIDIA. 2020a. Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores.

NVIDIA’s Sparse Tensor Cores implement a concrete instance of this pattern: the 2:4 constraint, which requires that exactly two of every contiguous block of four values be nonzero (equivalently, two must be zero) (NVIDIA Corporation 2020; NVIDIA 2020a). This constraint allows the hardware to compress the matrix by 50 percent in memory plus metadata. The execution proceeds in three stages: first, the hardware stores only the two nonzero values and compact metadata for every four-element block (compression); second, during matrix multiplication, the Sparse Tensor Core reads the metadata to select the corresponding activations and performs math only on the nonzero weights (compute); third, this increases the effective FLOP/byte ratio, providing an up-to-2$\times$ tensor-math throughput path over dense matrix multiplication when the model is fine-tuned to respect the 2:4 constraint.

To understand why “Structured” patterns are required for hardware speedup, consider how sparse matrices are actually stored in memory. Sparse matrix formats treats sparse matrix formats such as CSR and block sparse storage formally; they make the constraint visible, since indices must be stored alongside values. Compare the storage layouts in figure 7. If the sparsity is random, the index overhead and irregular access kill performance. Structured sparsity, whether at the large block scale or the fine-grained N:M scale, makes this indexing predictable and compact, allowing hardware to fetch data efficiently.

Figure 7: **Sparse Storage Formats**: Hardware efficiency depends on how sparse matrices are stored. The four panels show dense storage (simple but wasteful for zeros), CSR, block sparse, and the block-sparse BSR layout, which compress the matrix by storing only nonzero values plus the index of each stored block. The separate Non-zero Block Indices column is that index overhead: it is the price paid for skipping zeros, and structured sparsity (like N:M or blocks) keeps it predictable and compact so hardware can fetch data efficiently.

The 2:4 pattern illustrates a broader principle: hardware achieves efficiency not by computing zeros faster, but by never loading them in the first place. This insight connects sparsity to the memory wall, since structured patterns reduce memory traffic, which is where the real cost lies.

Beyond structured sparsity optimizations, different hardware architectures implement matrix operations through distinct computational structures. Systolic arrays represent one such approach that has proven particularly effective for AI workloads.

Systolic arrays

While tensor cores package matrix operations into structured computational units, systolic arrays provide an alternative approach optimized for continuous data flow and operand reuse. The core motivation for systolic architectures stems from the same energy constraint that drives accelerator design: minimizing the impact of memory access penalties through architectural design. A simple energy comparison through the array reveals why this architecture has become central to modern AI accelerators.

The systolic architecture improves energy efficiency by keeping operands local as work pulses through the array.

Napkin Math 1.1: The energy advantage of pulsing data

Scenario: The “Systolic” (heartbeat) metaphor is not just about timing; it reflects a decisive energy efficiency advantage. We can quantify the energy advantage of systolic dataflow over traditional vector units using the energy corollary:

Vector unit: Loads $A$, loads $B$, computes $A \times B + C$, writes $C$.
- Data movement: 3 loads + 1 write = 4 DRAM accesses (per operation).
- Energy: ≈ 4 $\times$ 640 pJ + 1 pJ (compute) = 2561 pJ/op.
Systolic Array (128 $\times$ 128 size): Loads A and B once at the edges. Data “pulses” through 128 processing elements.
- Data movement: 2 loads per 128 operations = 0.016 DRAM accesses (per operation).
- Energy: ≈ 0.016 $\times$ 640 pJ + 1 pJ (compute) ≈ 11 pJ/op.

Systems insight: In this worked energy model, a systolic array is 232.8× more energy-efficient than a naive vector unit for large matrix multiplications.

Concretely, a $128{\times}128$ array can achieve over 16,384 MACs/cycle with a large energy dividend by pulsing data through processing elements instead of repeatedly loading it from DRAM (Horowitz 2014; Jouppi et al. 2023).
This efficiency is what allows a Google TPU to pack 100,000+ MAC units into a single chip without melting.
Limitation: This “Energy Dividend” only pays out if the matrix is large enough to fill the array. For small matrices (common in real-time inference), the array is underused, and the energy efficiency drops back toward the vector unit baseline.

A systolic array arranges processing elements in a grid pattern, where data flows rhythmically between neighboring units in a synchronized manner, enabling each operand to participate in multiple computations as it propagates through the array. This structured movement minimizes external memory accesses by maximizing local data reuse. A single weight value can contribute to dozens of operations as it moves through the processing elements, transforming the energy profile from memory-bound to compute-efficient execution.

Kung and Leiserson²⁶ (Kung and Leiserson 1979) first introduced systolic arrays, formalizing their use in parallel computing architectures for efficient matrix operations (Kung 1982). Unlike general-purpose execution units, systolic arrays exploit spatial and temporal locality by reusing operands as they propagate through the grid. Google’s TPU exemplifies this architectural approach: in the TPUv4, a $128{\times}128$ systolic array of multiply-accumulate units processes matrix operations by streaming data through the array in a pipelined manner (Jouppi et al. 2023). Figure 8 follows these data paths: a control unit feeds input buffers that stream data horizontally into the array, while the partial sums each cell produces flow vertically down to the accumulator chain at the bottom, which collects the finished results. Each processing element performs one multiply-accumulate per cycle and passes its operands to its neighbors, so a value loaded once is reused across an entire row or column rather than refetched from memory.

²⁶ Systolic Array: From Greek sustole (“contraction”), borrowed from cardiology where it describes the heart’s rhythmic pumping cycle. Kung and Leiserson chose the name because data pulses through the processing grid exactly as blood pulses through the circulatory system—each element contracts (computes) and pushes results to its neighbor in lock-step. This rigid rhythmic data path is the architecture’s core trade-off: it excels at the dense matrix multiplication described but proves inflexible for irregular workloads, because a single weight is reused for all 128 MAC operations in a TPUv4 array column, eliminating hundreds of individual memory accesses.

Kung, Hsiang Tsung, and Charles E Leiserson. 1979. “Systolic Arrays (for VLSI).” Sparse Matrix Proceedings 1978 1: 256–82.

Kung, H. T. 1982. “Why Systolic Architectures?” Computer 15 (1): 37–46. https://doi.org/10.1109/mc.1982.1653825.

Figure 8: **Systolic Array Dataflow**: A control unit feeds input data streams into a grid of processing elements, each performing multiply-accumulate operations. Data flows horizontally and vertically through the array in a pipelined manner, maximizing operand reuse and minimizing memory access, as exemplified by Google’s TPUv4.

The tiling principle: Bridging graph and silicon

A fundamental mismatch exists between the computational graph (which sees a single 4,096 $\times$ 4,096 matrix multiplication) and the physical silicon (which possesses a fixed 128 $\times$ 128 systolic array). Bridging this gap requires tiling: the process of partitioning large tensor operations into “tiles” that fit exactly into the hardware’s fast local memory (SRAM or Scratchpad).

To process our 4,096-wide worked-example layer on a 128-wide systolic array, the compiler must decompose the operation into 1,024 individual tiles. This is not merely a software convenience; it is a physical requirement. Each tile is fetched from slow HBM, “staged” in fast SRAM, and then “pulsed” through the systolic array. Algorithm 1 states the loop nest a compiler emits for this decomposition: stream tiles of $A$ and $B$ on chip and accumulate their product into a tile of $C$ before writing it back.

The tile sizes are the lever. A larger tile reuses each loaded byte across more multiply-accumulate operations, raising the kernel’s arithmetic intensity and pushing it toward the compute-bound side of the roofline; the ceiling is how much of $A$, $B$, and $C$ fits in fast on-chip memory at once. This tiling pattern is the central mechanism behind high-performance ML systems. It allows the hardware to maintain high system efficiency $(\eta_{\text{hw}})$ by ensuring that for every byte loaded from main memory, the data is reused 128× within the systolic grid. An engineer who understands tiling understands the “silicon contract”: if a layer’s dimensions are not multiples of the tile size (for example, a width of 129 on a 128 array), the system pays a fringe tax in underutilized silicon, where 127 units sit idle while one unit finishes the “remainder” tile.

\begin{algorithm} \caption{Tiled (blocked) matrix multiply} \begin{algorithmic} \Require $A \in \mathbb{R}^{M\times K}$, $B \in \mathbb{R}^{K\times N}$; tile sizes $T_M, T_N, T_K$ \Ensure $C = AB$ \For{each row tile $i_0$ (size $T_M$)} \For{each column tile $j_0$ (size $T_N$)} \State initialize the $C$-tile accumulator to zero on chip \For{each $k_0$ over $K$ in steps of $T_K$} \State load $A$-tile $[i_0,k_0]$ and $B$-tile $[k_0,j_0]$ on chip \State accumulate the tile product on chip \Comment{reuse $T_N$/$T_M$ per byte} \EndFor \State write the $C$-tile back to memory \EndFor \EndFor \end{algorithmic} \end{algorithm}

One extra dimension past the tile width tips utilization off a cliff.

The systolic array architecture achieves computational efficiency through synchronized data movement across a structured grid of processing elements. Systolic arrays organize computation around four components:

Control unit: Coordinates timing and data distribution across the array, maintaining synchronized operation throughout the computational grid.
Data streams: Input matrices propagate through coordinated pathways where matrix A elements traverse horizontally while matrix B elements flow vertically through the processing grid.
Processing element grid: Individual processing elements execute multiply-accumulate operations on streaming data, generating partial results that accumulate toward the final computation.
Output collection: Results aggregate at designated output boundaries where accumulated partial sums form complete matrix elements.

Systems Perspective 1.1: Matching architecture to workload

The architects’ dilemma: Systolic arrays must choose which data to keep stationary (in registers) to minimize movement. This choice hard-codes the hardware’s preference for certain model types. Table 5 previews the three stationary-operand strategies and the workloads each favors; section 1.7 develops each one in full as a general mapping decision.

Table 5: Systolic-Array Dataflow Strategies: Three stationary-operand choices for systolic arrays, the reuse pattern each maximizes, and the workload class that benefits, showing how a fixed dataflow choice hard-codes an accelerator’s affinity for specific models.

Strategy	Stationary Item	Optimized For	Example Workload
Weight-Stationary	Weights ($W$)	High Reuse of Weights	CNNs (Conv2D): Filters are small and reused across the entire image.
Output-Stationary	Partial Sums ($C$)	High Reuse of Accumulators	Large Batch MatMul: Accumulating results for many inputs against a large weight matrix.
Input-Stationary	Inputs ($A$)	High Reuse of Activations	Transformers: The same activations feed many weight matrices across attention heads.

There is no “perfect” accelerator. A chip optimized for Weight-Stationary flow (like early TPUs) excels at CNNs where filters are small and heavily reused, but faces challenges with LLM inference at small batch sizes, where the weight matrix is read once per token with minimal reuse, pushing architectures toward output-stationary or hybrid dataflow patterns.

Because systolic arrays physically fix how data flows through the grid, designers must decide which operand to keep stationary, a choice that permanently shapes the hardware’s affinity for certain workloads. This is not merely an implementation detail but a permanent architectural commitment: the decision made at chip design time determines which neural network operations will achieve high utilization and which will be starved for data.

The synchronized data flow ensures that matrix element $A[i,k]$ encounters corresponding $B[k,j]$ elements at precise temporal intervals, executing the multiply-accumulate operations required for matrix multiplication $C[i,j] = \sum_k A[i,k]\times B[k,j]$. This systematic reuse of operands across multiple processing elements substantially reduces memory bandwidth requirements by eliminating redundant data fetches from external memory subsystems.

Consider the multiplication of $2{\times}2$ matrices A and B within a systolic array. During the first computational cycle, element $A[0,0]=2$ propagates horizontally while $B[0,0]=1$ moves vertically, converging at processing element $\text{PE}(0,0)$ to execute the multiplication $2 \times 1 = 2$. In the subsequent cycle, the same $A[0,0]=2$ advances to $\text{PE}(0,1)$ where it encounters $B[0,1]=3$, computing $2 \times 3 = 6$. Concurrently, $A[0,1]=4$ enters $\text{PE}(0,0)$ to engage with the next B matrix element. This coordinated data movement enables systematic operand reuse across multiple computational operations, eliminating redundant memory accesses and exemplifying the efficiency principle underlying systolic array architectures.

Each processing element in the array performs a multiply-accumulate operation in every cycle. In the configuration shown here (matching the preceding example, where matrix $A$ flows horizontally and $B$ flows vertically):

Receives a weight value from the left (the $A$ matrix, flowing horizontally)
Receives an input activation from above (the $B$ matrix, flowing vertically)
Multiplies these values and adds to its running sum
Passes the weight value rightward and the input activation downward to neighboring elements

Actual data flow directions vary across implementations; some architectures reverse these roles or use weight-stationary configurations where weights are preloaded rather than streamed.

This structured computation model minimizes data movement between global memory and processing elements, improving both efficiency and scalability. As systolic arrays operate in a streaming fashion, they are particularly effective for high-throughput workloads such as deep learning training and inference.

While figure 8 captures the core dataflow principle, systolic architectures vary significantly across different accelerator designs in practice. Training-focused architectures like Google’s TPU employ large arrays ($128{\times}128$ or larger) optimized for high computational throughput, while inference-oriented designs found in edge devices prioritize energy efficiency with smaller configurations ($8{\times}8$ to $32{\times}32$).

The underlying principle remains consistent: data flows systematically through processing elements, with inputs moving horizontally and vertically to compute partial sums in a synchronized fashion. However, as detailed in section 1.4.1, practical effectiveness is ultimately constrained by memory bandwidth bottlenecks.

A 128 $\times$ 128 systolic array capable of 16,384 operations per cycle requires continuous data feed to maintain utilization. Each cycle demands fresh input activations and weight parameters that must traverse from off-chip memory through on-chip buffers to the array edges. The TPU v4’s 1,200 GB/s HBM2 bandwidth enables high utilization, but even this substantial bandwidth becomes limiting when processing large transformer models where memory requirements exceed on-chip capacity.

The quantization techniques in Quantization and Precision reduce model memory footprint by converting FP32 weights to INT8 representations. This optimization directly addresses the memory bandwidth constraints identified here. Converting 32-bit floating-point weights to 8-bit integers can reduce weight traffic by 4$\times$; whether that changes a kernel from bandwidth bound to compute bound depends on the operation’s original arithmetic intensity, the accelerator’s INT8 ridge point (the intensity threshold at which its INT8 compute saturates), and the overhead of quantization and dequantization. Similarly, structured pruning removes entire rows or columns of weight matrices, reducing both the data volume that must traverse memory hierarchies and the computation required. These algorithmic optimizations prove valuable precisely because they target the memory bottleneck that limits accelerator performance in practice.

Numerics in AI acceleration

Systolic arrays and tensor cores achieve their efficiency partly through specialized support for reduced-precision arithmetic. This connection is direct: the 2$\times$ speedup from FP16 vs. FP32 is not merely “using fewer bits” but reflects that accelerators physically pack 2$\times$ more FP16 multiply-accumulate units into the same silicon area. Building on the quantization and mixed-precision techniques established in Model Compression, reduced precision becomes a hardware design decision: the numerical format determines the balance among accuracy, throughput, energy consumption, and data movement across SIMD and SIMT units, tensor cores, and systolic arrays.

Precision trade-offs

Lower precision is not free: each step down, from FP32 to FP16 to INT8, trades dynamic range and mantissa bits for the throughput and bandwidth gains just described. Hardware architects balance that trade-off when designing accelerator datapaths.

The evolution of AI hardware reflects this co-design between software optimization and hardware capability. Early GPU architectures supported only FP32 for deep learning workloads, but the precision-reduction strategies in Precision reduction strategies showed that reduced precision could maintain model accuracy, so hardware vendors responded by adding native support for FP16, BF16, and integer formats. This hardware evolution enables software optimizations to translate directly into performance gains, as reduced-precision operations execute on dedicated circuits optimized for those specific formats.

The transition from high-precision to lower-precision formats is deeply integrated into hardware execution models. As detailed in section 1.3.2, SIMD and SIMT units provide flexible support for multiple precisions. Tensor cores (section 1.3.3) accelerate computation using reduced-precision arithmetic, while systolic arrays (section 1.3.6) optimize performance by minimizing memory bandwidth constraints through low-precision formats that maximize operand reuse.

Despite the advantages of reduced precision, deep learning models cannot always rely solely on low-bit representations. To address this challenge, modern AI accelerators implement mixed-precision computing, where different numerical formats are used at different stages of execution. These precision choices affect numerical reliability: matrix multiplications may be performed in FP16 or BF16, while accumulations are maintained in FP32 to prevent precision loss. Similarly, inference engines use INT8 arithmetic while preserving key activations in higher precision when necessary.

Mixed-precision computing

Modern AI accelerators increasingly support mixed-precision execution, allowing different numerical formats to be used at various stages of computation. Training workloads often use FP16 or BF16 for matrix multiplications, while maintaining FP32 accumulations to preserve precision (Micikevicius et al. 2017; Mellempudi et al. 2019). The software implementation of mixed-precision training, including loss scaling techniques and framework support, is covered in Mixed-precision training. Inference workloads, by contrast, optimize for INT8 or even INT4, achieving high efficiency while retaining acceptable accuracy.

Micikevicius, Paulius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, et al. 2017. “Mixed Precision Training.” arXiv Preprint arXiv:1710.03740.

Mellempudi, Naveen, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. 2019. “Mixed Precision Training with 8-Bit Floating Point.” arXiv Preprint arXiv:1905.12334.

NVIDIA. 2020b. TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x.

NVIDIA Corporation. 2018. NVIDIA Tesla T4 Tensor Core GPU. NVIDIA product documentation.

The shift toward precision diversity is evident in the evolution of AI hardware. Early architectures such as NVIDIA Volta provided limited support for lower precision beyond FP16, whereas later architectures, including Turing and Ampere, expanded the range of supported formats. Table 6 traces this progression: Ampere GPUs introduced TF32 as a hybrid between FP32 and FP16 (NVIDIA 2020b), alongside broader support for BF16, INT8, and INT4 (NVIDIA Corporation 2017, 2018, 2020).

Newer architectures incorporate a growing diversity of numerical formats because different workloads bind at different points on the accuracy-throughput-energy trade-off. Precision support is therefore another form of workload matching, not a generic feature checklist.

The precision format used in hardware design has cascading implications across the entire system. Reducing from FP32 to FP16 cuts memory traffic in half, which matters far more than it might seem: because memory access dominates energy consumption, halving memory traffic can substantially reduce energy per inference when data movement is the bottleneck (Horowitz 2014). Simultaneously, tensor cores and systolic arrays can pack more lower-precision multiply-accumulate units into the same silicon area, raising peak throughput (Dally et al. 2021; Dally 2023). Integer formats push this further—INT8 arithmetic requires roughly 30$\times$ less energy than FP32 per operation, which is why inference-focused accelerators like the TPUv1 were built around INT8 from the start (Jouppi et al. 2017). The systems insight is that reduced precision does not merely “save bits”: it simultaneously relieves the memory bandwidth bottleneck and increases compute density, attacking both sides of the roofline at once.

Table 6: Precision Support Evolution: GPU architectures progressively expanded support for lower-precision data types, enabling performance gains and efficiency improvements in AI workloads. Early architectures primarily used FP32, while later generations incorporated FP16, BF16, INT8, and INT4 to accelerate both training and inference tasks.

Architecture	Year	Supported Tensor Core Precisions	Supported CUDA Core Precisions
Volta	2017	FP16	FP64, FP32, FP16
Turing	2018	FP16, INT8, INT4, INT1	FP64, FP32, FP16, INT8
Ampere	2020	FP64, TF32, BF16, FP16, INT8, INT4	FP64, FP32, FP16, BF16, INT8

As AI models continue to scale, precision support connects the compute primitive discussion back to the memory wall: lower-bit formats matter when they reduce the bytes moved and keep the hardware’s matrix engines fed. The remaining architectural question is how these execution units, precision formats, and memory paths integrate into complete accelerator systems. Architectural integration determines how efficiently computational primitives become usable accelerator throughput. SIMD lanes, tensor cores, and systolic arrays are building blocks, but their full-chip organization varies significantly across AI processors; the choice of execution units, their numerical precision support, and their connectivity shape how effectively hardware can scale for deep learning workloads.

Intra-node interconnects: Scaling the stack

Mastery of the single-machine stack requires understanding how bits move between GPUs and the CPU. In the 1–8 GPU regime, scaling is achieved through high-speed intra-node interconnects such as NVLink and host-to-device PCIe transfers that mitigate the memory wall. These links form a bandwidth taper: data-movement speed falls at each step away from the compute units, from on-package HBM through the GPU-to-GPU NVLink bridge down to the host PCIe link. The PCIe step is much slower than the accelerator-local memory and inter-GPU fabric, so any data path that touches the CPU can become a performance hazard, the “PCIe Wall” that NVLink exists to avoid (C. NVIDIA 2020). Section 1.4.5.1 develops this hierarchy quantitatively, where host-accelerator communication is the operative concern.

Intel Corporation. 2021a. “Intel Advanced Matrix Extensions (Intel AMX).” Intel Architecture Instruction Set Extensions Programming Reference.

Modern AI processors exhibit a range of design trade-offs based on their intended applications, and comparing their configurations reveals how deployment constraints drive architectural divergence. A training-optimized accelerator like the NVIDIA A100 packs many Streaming Multiprocessors with wide SIMD units and FP16 tensor cores because training throughput scales with aggregate multiply-accumulate capacity (NVIDIA Corporation 2020). Google’s TPUv4 makes a radically different bet: just two cores per chip, each containing massive BF16 systolic arrays, a design that trades programmer flexibility for efficiency on dense matrix multiplications (Jouppi et al. 2023). At the inference end, Intel’s Sapphire Rapids dedicates Advanced Matrix Extensions (AMX) tile engines to INT8 and BF16, reflecting the insight from Model Compression that inference models tolerate reduced precision (Intel Corporation 2021a). Mobile neural engines take this further by shrinking matrix engines into low-power SoC blocks, prioritizing energy efficiency per operation over peak throughput. Table 7 compares these architectural configurations.

The pattern across these configurations reveals a consistent engineering principle: each design sacrifices generality to optimize for its target workload’s dominant operation and precision. Training chips invest silicon in wide floating-point datapaths; inference chips trade precision for throughput; mobile chips trade throughput for energy efficiency. No single design dominates across all workloads, which is precisely why hardware selection depends on workload analysis rather than headline specifications.

Table 7: AI Processor Configurations: Modern AI processors prioritize different execution unit characteristics for specific workloads: NVIDIA A100 uses wide SIMD and tensor cores for training, Google TPUv4 emphasizes high-throughput BF16 matrix multiplication, Intel Sapphire Rapids focuses on INT8-optimized inference, and mobile NPUs prioritize low-power execution. These variations in SIMD width, tensor core size, and processing element count reflect the growing diversity in AI hardware architectures.

Processor	SIMD Width	Tensor Core Size	processing elements	Primary Workloads
NVIDIA A100	1024-bit	$4{\times}4{\times}4$ FP16	108 SMs	Training, HPC
Google TPUv4	128-wide	$128{\times}128$ BF16	2 cores/chip	Training
Intel Sapphire	512-bit AVX	$32{\times}32$ INT8/BF16	56 cores	Inference
Mobile NPU	CPU/GPU/DSP vectors	Small matrix tiles	Integrated NPU blocks	Mobile inference

Cost-performance analysis

While architectural specifications define computational potential, practical deployment decisions require understanding cost-performance trade-offs across different accelerator options. However, raw computational metrics alone provide an incomplete picture. The dominant constraint in modern AI acceleration is not compute capacity but data movement efficiency.

The energy differential established earlier (where memory access costs dominate computation) drives the entire specialized hardware revolution. This disparity helps explain why many accelerators achieve only a fraction of peak compute on memory-bound workloads, while architectures that maximize data reuse (for example, systolic arrays on dense matrix kernels) can sustain substantially higher utilization under favorable conditions.

Consider an organization choosing between “more of an older accelerator” vs. “fewer of a newer accelerator.” Peak FLOP/s can be misleading for transformer-style workloads with low arithmetic intensity, where training is often memory-bandwidth bound rather than compute bound. In such cases, bandwidth per dollar and achievable utilization can matter more than headline compute, so a newer accelerator with substantially higher bandwidth can deliver materially better sustained performance even if peak FLOP/s improves by a smaller factor.

These dynamics help explain the rapid adoption of newer accelerators despite higher unit prices. For memory-bound workloads, improvements in effective bandwidth (and the software stack’s ability to use it) can dominate real-world performance. Cloud deployment further complicates the analysis, as rental pricing, utilization, and operational overheads can change the break-even point between purchasing and renting hardware.

Table 8 provides representative cost-performance data for common accelerators. These figures are approximate and vary by vendor, region, and purchase volume; the key insight is the trend rather than the absolute numbers. The cost per TFLOP/s has improved substantially from V100 to newer accelerators, even as the absolute power requirement (TDP) has climbed to nearly 1,000 W for flagship units, reflecting the industry’s shift toward density over raw unit cost. The trend is not strictly monotonic at every generation under all precision modes: under the representative TF32 calculation shown here, H100’s price per TFLOP/s runs slightly above A100’s because list price scaled faster than TF32 throughput.

Table 8: Accelerator Cost-Performance Comparison: Hardware costs evaluated against representative peak computational capabilities for optimal deployment strategy selection. The precision modes differ by row, so price/performance entries are useful for trend intuition but are not an apples-to-apples FP16 comparison. Newer accelerators offer better price-performance ratios, though total cost of ownership includes power consumption, cooling requirements, and infrastructure costs. Prices are approximate list prices and vary by region and volume; TPU pricing estimated from cloud rates.

Accelerator	List Price	Representative Peak Throughput (precision shown)	Memory Bandwidth	Price/Performance
NVIDIA V100	~$10,000	125 TFLOP/s	900 GB/s	$80/(TFLOP/s)
NVIDIA A100	~$15,000	312 TFLOP/s	2,039 GB/s	$48.1/(TFLOP/s)
NVIDIA H100	~$25,000–30,000	494 TFLOP/s (TF32)	3,350 GB/s	~$50.6/(TFLOP/s)
Google TPUv4	~$8,000*	275 TFLOP/s (BF16)	1,200 GB/s	~$29.1/(TFLOP/s)
Intel Gaudi 2	~$12,000	865 TFLOP/s (FP8)	2,450 GB/s	$13.9/(TFLOP/s)

The table reveals several important patterns. First, price-performance generally improves across generations, though not monotonically under every price and precision assumption. Second, memory bandwidth often improves faster than the price-performance ratio suggests, making newer accelerators disproportionately valuable for memory-bound workloads. Third, the “best” accelerator depends heavily on workload characteristics: a transformer training workload that is memory-bandwidth bound may benefit more from H100’s 3,350 GB/s bandwidth than from raw FLOP/s improvements. Bandwidth consistently emerges as the deciding economic factor, which leads directly to the physical origin of the AI memory wall.

Framework selection significantly impacts these economic decisions. Detailed hardware-framework optimization strategies are covered in ML Frameworks, while performance evaluation methodologies are discussed in Benchmarking.

The preceding sections revealed impressive computational machinery: vector units achieving 8$\times$ parallelism through SIMD execution, matrix operations processing 256 elements simultaneously, and tensor cores executing $16{\times}16{\times}16$ fused multiply-accumulate blocks as dedicated tile operations. An NVIDIA A100’s tensor cores can execute 312 TFLOP/s, and newer accelerators extend this trend with FP8 support for lower-precision deep learning workloads (NVIDIA Corporation 2020; Kuzmin et al. 2022; Micikevicius et al. 2022). At these rates, the pure arithmetic for a ResNet-50 forward pass could complete in microseconds.

Kuzmin, Andrey, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. 2022. “FP8 Quantization: The Power of the Exponent.” Advances in Neural Information Processing Systems 35, 14651–62. https://doi.org/10.52202/068431-1065.

Micikevicius, Paulius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, et al. 2022. “FP8 Formats for Deep Learning.” arXiv Preprint arXiv:2209.05433.

NVIDIA’s Blackwell (B200) architecture extends this trend by introducing native FP4 support, with NVIDIA reporting up to 9 PFLOP/s (dense) or 18 PFLOP/s (sparse) peak throughput in FP4 per chip (NVIDIA Corporation 2024). This confirms the precision bottleneck trend: as models grow, hardware adapts by trading precision for massive parallelism, requiring systems engineers to master progressively lower-bit numerics (FP8, FP4) to unlock the silicon’s full potential.

Yet real ResNet-50 inference takes milliseconds, not microseconds. The gap between theoretical capability and practical performance reveals the chapter’s central tension, first posed in the Purpose section: computational capability has outpaced our ability to feed data to processors. Moving data from memory costs orders of magnitude more energy than arithmetic, and memory bandwidth has improved more slowly than tensor arithmetic throughput. This disparity determines whether those 312 TFLOP/s translate into low sustained utilization or high sustained utilization on a particular workload (Horowitz 2014; Gholami et al. 2024).

Understanding why this gap exists, and what architectural innovations address it, requires examining the memory systems that feed data to the compute primitives analyzed earlier. The memory hierarchy is not merely a supporting subsystem; it is the primary determinant of whether accelerators achieve their theoretical potential.

AI Memory Systems

ResNet-50 can expose the gap between accelerator arithmetic and accelerator memory: tensor cores may offer enormous low-precision throughput, but convolution weights, activations, and intermediate results still have to arrive on time. The execution units examined in previous sections (SIMD units, tensor cores, and systolic arrays) provide impressive computational throughput, with modern accelerators reaching hundreds of TFLOP/s or more for low-precision neural-network operations (NVIDIA Corporation 2020, 2024; Choquette 2023). Those theoretical capabilities remain unrealized when memory subsystems cannot supply data at sufficient rates. This constraint, termed the AI memory wall, is also the physical core of the systems gap that figure 3 charted: of all the ways model demand outruns hardware supply, the lag of memory bandwidth behind arithmetic throughput is the dominant component.

Unlike conventional workloads, ML models require frequent access to large volumes of parameters, activations, and intermediate results, leading to substantial memory bandwidth demands. This challenge intersects with the data management strategies covered in Data Engineering. Modern AI hardware addresses these demands through advanced memory hierarchies, efficient data movement techniques, and compression strategies that promote efficient execution.

Understanding the AI memory wall

The AI memory wall represents the primary bottleneck constraining modern accelerator performance: the growing disparity between computational throughput and memory bandwidth that prevents accelerators from achieving their theoretical capabilities. While compute units can execute millions of operations per second through specialized primitives like vector operations and matrix multiplications, they depend critically on memory systems to supply the continuous stream of weights, activations, and intermediate results these operations require.

Definition 1.4: AI memory wall

The AI Memory Wall is the ML accelerator performance constraint that arises when arithmetic throughput $(R_{\text{peak}})$ outpaces memory bandwidth $(\text{BW})$.

Significance: It dictates that system performance is no longer bounded by FLOP/s, but by the energy and latency cost of moving data. Within the iron law, it is the point where the $\frac{D_{\text{vol}}}{\text{BW}}$ term dominates the total execution time $(T)$.
Distinction: Unlike a general-purpose memory wall, which affects all computing, the AI memory wall is driven by the massive model state and activation storage required by deep learning.
Common pitfall: A frequent misconception is that the memory wall is “fixed” by more memory. In reality, it is a bandwidth-latency gap: even with infinite capacity, the speed of moving data between memory and compute remains the fundamental physical bottleneck.

The underlying cause of this wall is physical: the Von Neumann²⁷ Bottleneck, which has constrained computing since 1945, makes moving data cost orders of magnitude more energy than processing it, and figure 9 shows why AI accelerators must prioritize data locality over raw arithmetic throughput.

²⁷ Von Neumann Bottleneck: The physical separation of the processor from its memory forces all instructions and data to traverse an energy-intensive bus. This distance is the direct cause of the high energy cost of data movement; every byte must be fetched, paying a physical tax. Accessing a value from external DRAM can cost over 20,000$\times$ more energy than performing an 8-bit integer operation on that value (Horowitz 2014).

Horowitz, Mark. 2014. “1.1 Computing’s Energy Problem (and What We Can Do about It).” 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 10–14. https://doi.org/10.1109/isscc.2014.6757323.

Figure 9: **The Energy Hierarchy**: Energy cost per operation (Log Scale) based on the ‘Horowitz Numbers.’ Fetching data from off-chip DRAM costs ~128$\times$ more energy than an SRAM access and ~20,000$\times$ more than an INT8 addition. This stark physical disparity dictates that AI accelerators must prioritize data locality (keeping weights in SRAM/Registers) over raw arithmetic throughput to remain within power budgets.

Quantifying the compute-memory performance gap

The energy disparity that figure 9 captures grows more severe with each hardware generation. Over the past two decades, peak computational capabilities have grown substantially faster than DRAM bandwidth (Gholami et al. 2024). This divergence creates a widening gap where accelerators possess massive computational power but cannot access data quickly enough to use it. Representative high-end accelerators can deliver on the order of $10^3$ TFLOP/s of peak tensor throughput (for example, NVIDIA H100 delivering 989 TFLOP/s in FP16 or nearly 2,000 TFLOP/s in FP8) while providing approximately 3.35 TB/s of memory bandwidth (Choquette 2023). This implies that on the order of $10^2$ FLOP of work per byte moved is required to fully use the compute, which can exceed the arithmetic intensity of many practical neural network workloads.

Gholami, Amir, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer. 2024. “AI and Memory Wall.” IEEE Micro 44 (3): 33–39. https://doi.org/10.1109/mm.2024.3373763.

The memory wall manifests through three critical constraints. First, the energy disparity: accessing DRAM can consume orders of magnitude more energy than a multiply-accumulate operation (Horowitz 2014; Sze et al. 2017), which often shifts bottlenecks from raw compute to power and data movement. Second, the bandwidth limitation: even TB/s memory systems may not feed large parallel compute arrays continuously on memory-bound workloads, leaving compute underutilized. Third, the latency hierarchy: off-chip memory access can require hundreds of cycles, creating pipeline stalls that cascade through parallel execution units.

Hardware balance ($I_{\text{ridge}}$): The paradigm partition

Different paradigms inhabit different regions of this “memory wall.” We quantify this using the hardware balance $(I_{\text{ridge}})$, defined as the arithmetic intensity required to hide the cost of fetching one byte of data; the roofline literature calls this threshold the ridge point: \[ I_{\text{ridge}} = \frac{R_{\text{peak}}}{\text{BW}} \]

This ratio partitions the deployment spectrum into two distinct regimes. High-end accelerators like the NVIDIA H100 have a balance of $\approx 150$–$300$, making them “Bandwidth-Hungry” giants where the challenge is moving data fast enough to saturate the ALUs. In contrast, TinyML microcontrollers often have a balance of $< 10$, making them “Compute-Starved” but relatively bandwidth-efficient. This explains why an architecture that is efficient in the cloud (where we optimize for $\text{BW}$ limits) can be a disaster at the edge: the hardware balance has shifted under the model, transforming a memory-bound success into a compute-bound failure.

The divergence between these two scaling rates is quantified in figure 10. The gap between the compute curve and the bandwidth curve widens year over year, confirming that memory bandwidth, not compute, is the primary constraint in AI acceleration. The values are illustrative to emphasize the divergence trend.

Figure 10: **The Compute-Bandwidth Divergence**: Compute throughput (FLOP/s) and memory bandwidth (GB/s) plotted on a log scale (2000–2025). While arithmetic throughput has grown exponentially, bandwidth has improved more slowly. Values are illustrative to show the widening AI memory wall.

The imbalance has a direct architectural consequence visible in figure 11: the hardware ridge point has climbed sharply and remains high, pushing sparse and low-reuse operations further into the memory-bound regime on modern accelerators.

Figure 11: **The Rising Ridge**: Hardware arithmetic intensity (FLOP/byte) over time using dense FP16 tensor peaks and memory bandwidth from the local hardware constants. As compute capability grows faster than memory bandwidth, the ridge point rises from V100 through H100 and remains high on B200. This trend explains why architectures with high data reuse flourish while low-reuse workloads face a growing hardware tax.

Beyond performance limitations, memory access imposes a steep energy cost. Fetching data from off-chip DRAM consumes far more energy than performing arithmetic operations (Horowitz 2014). This inefficiency is particularly evident in machine learning models, where large parameter sizes, frequent memory accesses, and nonuniform data movement patterns exacerbate memory bottlenecks. The energy differential drives architectural decisions: Google’s TPUv1 achieved 30–80$\times$ better performance per watt than contemporary CPUs and GPUs on Google’s inference benchmarks by minimizing data movement through systolic arrays and large on-chip memory (Jouppi et al. 2017). These design choices demonstrate that energy constraints, not computational limits, often determine practical deployment feasibility.

Memory access patterns in ML workloads

To make these energy costs concrete, we can trace a single tensor through every level of the memory hierarchy during a real inference pass.

Lighthouse 1.2: Life of a tensor: GPU-hosted KWS

Recall the Keyword Spotting lighthouse summarized in Five lighthouse models. If the same one-second audio clip is batched on a GPU-hosted inference path, its physical journey through the memory hierarchy looks like this:

DRAM (HBM): The tensor starts here.
- Size: 16,000 samples $\times$ 2 bytes (FP16) = 32 KB.
- Latency: Fetching this from off-chip memory takes ~300 ns (plus queuing delay).
- Energy: Cost is ~20 pJ/bit. High cost.
L2 cache: The GPU’s DMA engine pulls it here.
- Latency: ~4 ns.
- Access: Shared across multiple Streaming Multiprocessors (SMs).
L1 cache/shared memory: A specific SM claims a tile of the audio.
- Latency: ~1 ns.
- Locality: Critical step. If the data leaves this level, we pay the “HBM Tax” again.

Registers: The Tensor Core operates here.
- Latency: ~0 ns (single cycle).
- Throughput: 312 TFLOP/s.
- Energy: Cost is ~0.1 pJ/bit.

Systems insight: The “Speed of Light” limit means we cannot compute faster than we can move data from Step 1 to Step 4. The roofline is determined by the bandwidth of the Step 1 $\rightarrow$ Step 2 link.

Beyond raw computational throughput, an accelerator’s efficiency depends on its ability to continuously supply data to processing units without stalls. Neural networks impose three concurrent demands on this data supply. Model parameters (weights and biases) may number in the billions, requiring efficient storage and streaming to maintain throughput. Intermediate activations produced at each layer must be temporarily held for subsequent operations, contributing to memory overhead in deep architectures. During training, backpropagation adds a third demand: storing and accessing gradients for every parameter, further increasing data movement volume between compute units and memory.

As models increase in size and complexity, improvements in memory capacity and bandwidth become increasingly important. Although specialized compute units accelerate operations like matrix multiplications, their overall performance depends on the continuous, efficient delivery of data to the processing elements. In large-scale applications such as natural language processing and computer vision, models often incorporate millions to billions of parameters (Brown et al. 2020), and achieving high performance requires minimizing delays and stalls caused by inefficient data movement between memory and compute units (Narayanan et al. 2021; Kwon and Rhu 2018).

Kwon, Youngeun, and Minsoo Rhu. 2018. “TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning.” Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 740–53. https://doi.org/10.1145/3352460.3358284.

One way to quantify this challenge is by comparing the data transfer time with the time required for computations. To do this, the following variables are defined: $D_{\text{vol}}$ is the total data volume (bytes), $\text{BW}$ is the available memory bandwidth (bytes/s), $O$ is the number of floating-point operations, $R_{\text{peak}}$ is the peak hardware throughput (FLOP/s), and $\eta_{\text{hw}}$ is the realized hardware utilization.

We can express the memory transfer time $T_{\text{mem}}$ and compute time $T_{\text{compute}}$ as: \[\begin{gather*} T_{\text{mem}} = \frac{D_{\text{vol}}}{\text{BW}} \\ T_{\text{compute}} = \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} \end{gather*}\]

When $T_{\text{mem}} > T_{\text{compute}}$, the system becomes memory bound. This imbalance forces the processing elements to spend more time waiting for data than performing computations, demonstrating the need for memory-optimized architectures and efficient data movement strategies to sustain high performance.

Figure 12 quantifies this disparity for specific public-count models and hardware generations, showing how model parameter counts have outpaced memory bandwidth improvements. The gap between these curves, from AlexNet’s 60 million parameters to publicly disclosed hundred-billion-parameter models, represents the engineering challenge that drives accelerator memory system design (Krizhevsky et al. 2012; Brown et al. 2020; Chowdhery et al. 2022; Dubey et al. 2024). Even high-bandwidth accelerators like NVIDIA’s B200 and AMD’s MI300X/MI325X-class devices cannot close this gap by bandwidth alone: bandwidth has improved far less than public frontier-model parameter counts over the same period (NVIDIA Corporation 2024; AMD 2023). Parameter counts for proprietary systems such as GPT-4 and Gemini are not officially disclosed, so they are not plotted as factual data points.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311.

Dubey, Abhimanyu, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.

NVIDIA Corporation. 2024. NVIDIA Blackwell Architecture. NVIDIA product documentation.

AMD. 2023. AMD Instinct MI300X Accelerators. AMD product documentation.

Figure 12: **Model Size vs. Hardware Bandwidth**: Publicly disclosed model parameter counts and hardware memory bandwidth plotted from 2012 to 2025, showing how model growth from AlexNet to hundred-billion-parameter models has far outpaced bandwidth improvements across GPU and TPU generations.

Irregular memory access

Many ML workloads combine regular dense kernels with irregular memory pressure from sparsity, embedding lookups, variable sequence lengths, attention/KV-cache traffic, and small batches. The dense parts are exactly why accelerators work so well; the irregular parts are where standard caching mechanisms and memory hierarchies struggle, leading to increased memory latency and inefficient bandwidth utilization.

Comparing ML memory access patterns against traditional computing workloads reveals the scale of the challenge. Traditional workloads, such as scientific computing, general-purpose CPU applications, and database processing, typically exhibit well-defined memory access characteristics that benefit from standard caching and prefetching techniques. ML workloads, on the other hand, introduce highly dynamic access patterns (table 9) that challenge conventional memory optimization strategies.

Table 9: Memory Access Characteristics: Traditional workloads exhibit predictable, sequential memory access benefiting from standard caching, while machine learning workloads introduce irregular and dynamic patterns due to sparsity and data dependencies. These differences inform the design of memory systems that efficiently support modern AI applications.

Feature	Traditional Computing Workloads	Machine Learning Workloads
Memory Access Pattern	Regular and predictable (e.g., sequential reads, structured patterns)	Irregular and dynamic (e.g., sparsity, attention mechanisms)
Cache Locality	High temporal and spatial locality	Often low locality, especially in large models
Data Reuse	Structured loops with frequent data reuse	Sparse and dynamic reuse depending on layer type
Data Dependencies	Well-defined dependencies allow efficient prefetching	Variable dependencies based on network structure
Workload Example	Scientific computing (e.g., matrix factorizations, physics simulations)	Neural networks (e.g., CNNs, Transformers, sparse models)
Memory Bottleneck	DRAM latency, cache misses	Off-chip bandwidth constraints, memory fragmentation
Impact on Energy Consumption	Moderate, driven by FLOP-heavy execution	High, dominated by data movement costs

One key source of irregularity in ML workloads stems from batch size and execution order. The way input data is processed in batches directly affects memory reuse, creating a complex optimization challenge. Small batch sizes decrease the likelihood of reusing cached activations and weights, resulting in frequent memory fetches from slower, off-chip memory. Larger batch sizes can improve reuse and amortize memory access costs, but simultaneously place higher demands on available memory bandwidth, potentially creating congestion at different memory hierarchy levels. This delicate balance requires careful consideration of model architecture and available hardware resources.

Different neural network layers interact with memory in distinct ways beyond batch size considerations. Convolutional layers benefit from spatial locality, as neighboring pixels in an image are processed together, enabling efficient caching of small weight kernels. Conversely, fully connected layers require frequent access to large weight matrices, often leading to more randomized memory access patterns that poorly align with standard caching policies. Transformers introduce additional complexity, as attention mechanisms demand accessing large key-value pairs stored across varied memory locations. The dynamic nature of sequence length and attention span renders traditional prefetching strategies ineffective, resulting in unpredictable memory latencies.

Another factor contributing to irregular memory access is sparsity²⁸ in neural networks. Many modern ML models employ techniques such as weight pruning, activation sparsity, and structured sparsity to reduce computational overhead. However, these optimizations often lead to nonuniform memory access, as sparse representations necessitate fetching scattered elements rather than sequential blocks, making hardware caching less effective. Models that incorporate dynamic computation paths, such as Mixture of Experts and Adaptive Computation Time, introduce highly nondeterministic memory access patterns, where the active neurons or model components can vary with each inference step. This variability challenges efficient prefetching and caching strategies.

²⁸ Sparsity and Memory Irregularity: This irregularity arises because techniques like pruning and dynamic activations force memory controllers to gather scattered nonzero elements via indirect addressing, breaking the sequential access patterns that hardware caches and prefetchers depend on. The resulting trade-off is severe, as the latency penalty from these random, unpredictable memory accesses can easily negate the computational savings from performing fewer operations. Without specialized hardware support for structured sparsity, an unstructured sparse model can become entirely memory bound and run slower than its dense counterpart, even with over 90 percent of its weights removed.

These irregularities have measurable consequences. ML workloads often experience reduced cache efficiency, as activations and weights may not be accessed in predictable sequences. This leads to increased reliance on off-chip memory traffic, which slows down execution and consumes more energy. Irregular access patterns contribute to memory fragmentation, where the way data is allocated and retrieved results in inefficient use of available memory resources. The combined effect is that ML accelerators frequently encounter memory bottlenecks that limit their ability to fully use available compute power.

The irregular access patterns and memory wall constraints examined earlier create formidable challenges, but they also reveal optimization opportunities. Although individual memory accesses may appear unpredictable, ML workloads exhibit structured reuse patterns at a higher level: the same weights are applied across batch elements, the same kernels slide across spatial dimensions, and the same attention patterns recur across sequence positions. Hardware designers exploit these regularities through carefully structured memory hierarchies that maintain frequently accessed data close to compute units, even when the specific access sequence varies.

Memory hierarchy

Modern AI accelerators exploit these structured reuse patterns through multilevel memory hierarchies: rather than treating memory as a monolithic resource, they organize storage into distinct tiers optimized for different access patterns, reuse distances, and energy costs. While general-purpose computing contends with unpredictable memory access, ML workloads exhibit structured reuse that can be optimized through careful data organization across multiple memory levels.

At the highest level, large-capacity but slow storage devices provide long-term model storage. At the lowest level, high-speed registers and caches ensure that compute units can access operands with minimal latency. Between these extremes, intermediate memory levels, such as scratchpad memory, high-bandwidth memory, and off-chip DRAM, offer trade-offs between performance and capacity.

The key pattern in table 10 is that each step down the hierarchy trades roughly an order of magnitude in latency for an order of magnitude in capacity—and the energy cost of a memory access at any level dwarfs the energy cost of the arithmetic it feeds.

Table 10: Memory Hierarchy Trade-Offs: AI accelerators use a multilevel memory hierarchy to balance performance and capacity. Each level provides distinct latency, bandwidth, and capacity characteristics that dictate how neural network components (weights, activations, and intermediate results) should be allocated to minimize bottlenecks and maximize throughput.

Memory Level	Approx. Latency	Bandwidth	Capacity	Example Use in Deep Learning
Registers	~1 cycle	Highest	Few values	Storing operands for immediate computation
L1/L2 Cache (SRAM)	~1–10 ns	High	KB–MB	Caching frequently accessed activations and small weight blocks
Scratchpad Memory	~5–20 ns	High	MB	Software-managed storage for intermediate computations
High-Bandwidth Memory (HBM)	~100 ns	Very High	GB	Storing large model parameters and activations for high-speed access
Off-Chip DRAM (DDR, GDDR, LPDDR)	~50–150 ns	Moderate	GB–TB	Storing entire model weights that do not fit on-chip
Flash Storage (SSD/NVMe)	~100 µs–1 ms	Low	TB	Storing pretrained models and checkpoints for later loading

The hierarchy invites an apparently simple solution: build larger, faster off-chip memory and eliminate the need for on-chip SRAM entirely. The answer is rooted in physics: signal propagation within and between chips imposes a hard latency floor.

Napkin Math 1.2: The speed of light limit

Problem: Why is on-chip SRAM necessary instead of fetching all data from HBM?

Physics:

Distance: On an H100-class 814 mm² die, signals travel ~20 mm.
Speed: Signals in silicon travel at $\approx 0.5c$ (half speed of light).
Latency: 20 mm takes $\approx 130 \text{ ps}$.
Clock cycle: At 2 GHz, a cycle is $500 \text{ ps}$.
DRAM: Off-chip HBM sits millimeters away on the package, but DRAM access latency plus protocol overhead = 100+ cycles.

Systems insight: Data cannot be fetched from DRAM in a single cycle. It is physically impossible. Local registers and SRAM (L1) are required to feed compute units at 2 GHz. The “memory wall” is partially a distance wall—and for transformer models this is the direct reason weights must be staged in SRAM in tiles rather than read in one pass from HBM: the round-trip latency to HBM is too long to sustain the systolic array’s pipeline. It is also why reading an entire KV-cache from HBM on every token generation step collapses inference throughput—the access pattern cannot be hidden behind arithmetic the way a tiled matmul can.

On-chip memory

On-chip memory is the fast local storage located on or near the accelerator die, including registers, SRAM caches, and software-managed scratchpads. Each level of the memory hierarchy serves a distinct role in AI acceleration, with different trade-offs in speed, capacity, and accessibility. Registers, located within compute cores, provide the fastest access but can only store a few operands at a time. These are best used for immediate computations, where the operands needed for an operation can be loaded and consumed within a few cycles. However, because register storage is so limited, frequent memory accesses are required to fetch new operands and store intermediate results.

To reduce the need for constant data movement between registers and external memory, small but fast caches serve as an intermediary buffer. These caches store recently accessed activations, weights, and intermediate values, ensuring that frequently used data remains available with minimal delay. However, the size of caches is limited, making them insufficient for storing full feature maps or large weight tensors in machine learning models. As a result, only the most frequently used portions of a model’s parameters or activations can reside here at any given time.

For larger working datasets, many AI accelerators include scratchpad memory, which offers more storage than caches but with a key difference: it allows explicit software control over what data is stored and when it is evicted. Unlike caches, which rely on hardware-based eviction policies, scratchpad memory enables machine learning workloads to retain key values such as activations and filter weights for multiple layers of computation. This capability is useful in models like convolutional neural networks, where the same input feature maps and filter weights are reused across multiple operations. By keeping this data in scratchpad memory rather than reloading it from external memory, accelerators can significantly reduce unnecessary memory transfers and improve overall efficiency (Chen, Emer, et al. 2017). On NVIDIA GPUs, the hardware exposes scratchpad memory to programmers as shared memory: a fast, software-managed SRAM region that all threads in a thread block can read and write, distinct from the hardware-managed L1/L2 caches. Custom ML kernels written in CUDA or Triton control this memory explicitly. FlashAttention achieves its substantial throughput gains for transformer attention layers precisely by exploiting this mechanism: rather than materializing the full $S{\times}S$ attention score matrix in HBM, it tiles queries, keys, and values through SRAM/shared memory and writes only the final output back to HBM (Dao et al. 2022). The reduction in HBM round-trips—not fewer arithmetic operations—is the primary source of the speedup.

Example 1.2: The Tensor Core contract

Scenario: A transformer workload moves from older GPUs to NVIDIA A100 GPUs, expecting a large speedup from Tensor Cores.

Failure mode: Profiling shows that the Tensor Cores are barely active. The workload uses precision formats, dimensions, or custom kernels that do not match the hardware’s accelerated tensor-operation paths. Tensor Cores on A100s only trigger for specific precision formats (FP16, BF16, or TF32). By forcing FP32 accumulation in a way the hardware did not support for acceleration, the code fell back to the standard CUDA cores, which have $1/16$th the throughput.

Systems insight: Hardware features are brittle contracts. If the workload does not present supported data types and tile shapes, the accelerator falls back to generic execution. Hardware cannot be exploited without conforming to its contracts (NVIDIA Corporation 2020).

Off-chip memory

Once a model’s working set outgrows on-chip SRAM, the design question is no longer how fast each tier is but which off-chip tier the data lands in, because every step down table 10 trades latency for capacity. Model size sets the answer: weights that fit in HBM stream at a few TB/s; weights that overflow into commodity DRAM pay higher access latency on every fetch; and weights that live only on flash must be staged into faster memory before the accelerator can produce a single result. The tiers below describe what each level offers so that this capacity-versus-latency choice can be made deliberately rather than by default.

Beyond on-chip memory, high-bandwidth memory provides rapid access to larger model parameters and activations that do not fit within caches or scratchpad buffers. HBM achieves its high performance by stacking multiple memory dies and using wide memory interfaces, allowing it to transfer large amounts of data with minimal latency compared to traditional DRAM. Because of its high bandwidth and lower latency, HBM is often used to store entire layers of machine learning models that must be accessed quickly during execution. However, its cost and power consumption limit its use primarily to high-performance AI accelerators, making it less common in power-constrained environments such as edge devices.

When a machine learning model exceeds the capacity of on-chip memory and HBM, it must rely on off-chip DRAM, such as DDR, GDDR, or LPDDR. While DRAM offers significantly greater storage capacity, its access latency is higher, meaning that frequent retrievals from DRAM can introduce execution bottlenecks. To make effective use of DRAM, models must be structured so that only the necessary portions of weights and activations are retrieved at any given time, minimizing the impact of long memory fetch times.

At the highest level of the hierarchy, flash storage and solid-state drives (SSDs) store large pretrained models, datasets, and checkpointed weights. These storage devices offer large capacities but are too slow for real-time execution, requiring models to be loaded into faster memory tiers before computation begins. For instance, in training scenarios, checkpointed models stored in SSDs must be loaded into DRAM or HBM before resuming computation, as direct execution from SSDs would be too slow to maintain efficient accelerator utilization (Narayanan et al. 2021).

The memory hierarchy thus balances competing objectives of speed, capacity, and energy efficiency. However, moving data through multiple memory levels introduces bottlenecks that limit accelerator performance. Data transfers between memory levels incur latency costs, particularly for off-chip accesses. Limited bandwidth restricts data flow between memory tiers. Memory capacity constraints force constant data movement as models exceed local storage. These constraints make memory bandwidth the primary determinant of real-world accelerator performance, a topic we examine next.

Memory bandwidth and architectural trade-offs

Advertised memory bandwidth is only a ceiling; achievable bandwidth depends on access pattern, batching, locality, and the host interface that feeds the accelerator. Modern accelerators exhibit distinct bandwidth-capacity trade-offs that directly shape which workloads they can serve efficiently. Representative data center accelerators provide memory bandwidth on the order of a few TB/s, often paired with tens of GB of high-bandwidth memory. Raw bandwidth alone, however, is misleading: what matters is achievable bandwidth for a given access pattern. Transformer attention, convolution, and fully connected layers can all realize different fractions of peak bandwidth because their reuse, tiling, and access regularity differ. Fully connected layers approach peak bandwidth only when batch sizes are large enough to amortize the cost of loading weight matrices—which connects directly to the batch-size sensitivity discussed in the following roofline analysis. The practical consequence is that an accelerator’s effective bandwidth for a specific workload may be well below its advertised peak, making bandwidth-per-dollar a more reliable purchasing metric than peak bandwidth alone.

As established earlier, on-chip memory access typically consumes energy in the single-digit-to-tens of picojoules per access, while external DRAM can be on the order of hundreds of picojoules per access, an orders-of-magnitude energy penalty. AI accelerators minimize DRAM access through three key strategies: weight stationarity (keeping model parameters in on-chip memory), input stationarity (buffering input activations locally), and output stationarity (accumulating partial sums on-chip).

Memory bandwidth scaling follows different trajectories across accelerator designs. GPU architectures scale bandwidth by adding memory channels, reaching on the order of 1 TB/s in mainstream products and a few TB/s in high-end systems. TPU-class designs achieve their bandwidth efficiency through systolic array dataflow and aggressive on-chip reuse, often trading flexibility for efficiency on dense tensor kernels. Mobile system on chip (SoC) designs face the tightest constraints, delivering on the order of hundreds of GB/s of unified memory bandwidth within a few-watt power envelope, which demands careful workload scheduling and thermal management.

HBM provides far higher bandwidth than commodity DDR memory, but at substantially higher cost and packaging complexity. High-bandwidth accelerators therefore trade higher memory-system cost for higher sustained performance on bandwidth-bound workloads. Edge accelerators often sacrifice bandwidth to meet tight cost and power targets while maintaining sufficient performance for inference workloads.

These bandwidth characteristics directly influence deployment decisions: cloud training prioritizes raw bandwidth for maximum model capacity, edge inference optimizes bandwidth efficiency for energy constraints, and mobile deployment balances bandwidth with cost limitations. Beyond the accelerator’s internal memory system, however, data must also flow between the host CPU and the accelerator, introducing another potential bottleneck. This host-accelerator interface often becomes the unexpected chokepoint: even with 2 TB/s of HBM bandwidth on the accelerator, data must first traverse a PCIe link that provides only 64 GB/s, a 30$\times$ bandwidth reduction that can dominate total latency for small, frequent transfers (NVIDIA Corporation 2020; C. NVIDIA 2020).

Host-accelerator communication

Machine learning accelerators, such as GPUs and TPUs, achieve high computational throughput through parallel execution. However, their efficiency is often constrained by host-accelerator data movement between the CPU and accelerator memory. Compared to many traditional workloads that keep most data within a single memory domain, AI workloads can require frequent transfers between CPU memory and accelerator memory, introducing latency, consuming bandwidth, and affecting overall performance.

Bandwidth tapers steeply as data moves farther from the accelerator.

Host-accelerator data movement follows a structured sequence, shown in figure 13 for a GPU as the concrete accelerator (its “Memory for GPU” lane is the accelerator’s memory). Before computation begins, data is copied from CPU memory to the accelerator’s memory (step 1). The CPU then issues execution instructions (step 2), and the accelerator processes the data in parallel (step 3). Once computation completes, the accelerator writes its output to accelerator memory (the “Store results” arrow), and that result is copied back to the CPU (step 4). Consider the latency cost at every arrow: each transfer represents a potential bottleneck that must be managed to optimize end-to-end performance.

Figure 13: **Host-Accelerator Data Transfer**: AI workloads require frequent data movement between CPU memory and accelerators. The four sequential steps of copying input data, issuing execution instructions, parallel computation, and transferring results each introduce potential performance bottlenecks.

The key challenges in host-accelerator data movement include latency, bandwidth constraints, and synchronization overheads. The efficiency of ML accelerators depends not only on their computational power but also on the continuous supply of data. Even high-performance GPUs and TPUs remain underutilized if data transfers are inefficient. Host and accelerator memory exist as separate domains, requiring explicit transfers over interconnects such as PCIe, NVLink, or proprietary links. Ineffective data movement causes execution stalls, making transfer optimization a priority.

Node-level interconnect topology

To optimize data movement, we must understand the physical topology of the compute node. A typical AI server is not a flat mesh of connected devices but a hierarchy of bandwidths that tapers as we move away from the chip.

At node level, three links define the bandwidth taper:

Device-device interconnect (NVLink/Infinity Fabric): Modern multi-GPU nodes use specialized high-speed bridges like NVLink²⁹ to connect accelerators directly, bypassing the host CPU. Bandwidth ranges from 600 GB/s to 900 GB/s per GPU (C. NVIDIA 2020; Choquette 2023). This link matters whenever tensors must move between accelerators within one server, including model partitioning, activation exchange, and gradient synchronization during training. The hardware lesson for this chapter is the boundary: traffic that stays on the accelerator fabric is far cheaper than traffic that falls back through the host.
Host-device interconnect (PCIe): The link between the CPU and the accelerator. Bandwidth ranges from 32 to 64 GB/s (PCIe Gen4/Gen5). This link represents the “Data Loading Bottleneck”: all training data must pass through this thin pipe. Even with eight GPUs providing 5 TB/s of aggregate compute bandwidth, the system is fed by a single ~64 GB/s PCIe switch.
Node-network interconnect (NIC): The link to the outside world, connecting to other nodes. Bandwidth ranges from 25 to 50 GB/s (200 Gb/s to 400 Gb/s Ethernet/InfiniBand³⁰). This interconnect is the first step from single-node hardware reasoning into the next frontier: scale. Here, the point is that leaving the node moves traffic onto a much narrower and higher-latency path.

²⁹ NVLink (NVIDIA Link): This direct GPU-to-GPU interconnect exists to keep accelerator-to-accelerator traffic off PCIe when tensors must move inside a server. Its 600–900 GB/s aggregate bandwidth is close to an order of magnitude more, per direction, than the standard PCIe bus, so workloads with frequent cross-device tensor movement can remain in the fast part of the bandwidth taper (C. NVIDIA 2020; Choquette 2023). The training algorithms that determine how much tensor traffic must cross this link are developed later; the hardware fact needed here is the bandwidth gap.

NVIDIA, Corporation. 2020. “NVLink: Scalable High-Performance Interconnect.” NVIDIA Technical Report 2020.

Choquette, Jack. 2023. “NVIDIA Hopper H100 GPU: Scaling Performance.” IEEE Micro 43 (3): 9–17. https://doi.org/10.1109/mm.2023.3256796.

³⁰ InfiniBand: Its key feature for multi-node scaling is RDMA (Remote Direct Memory Access), which allows a GPU in one node to access memory in another directly, bypassing the host CPU. Without RDMA, host involvement and protocol overhead can become part of the gradient-synchronization path. RDMA reduces that overhead so scaling is more often constrained by raw physical bandwidth, topology, and collective implementation rather than CPU-managed packet processing.

These three levels produce a characteristic bandwidth taper:

\[\begin{aligned} \text{HBM (3350 GB/s)} &\gg \text{NVLink (900 GB/s)} \\ &\gg \text{PCIe (64 GB/s)} \gg \text{Network (50 GB/s)} \end{aligned}\]

System efficiency depends on keeping data as high up this hierarchy as possible. Once data drops to PCIe or network speeds, it encounters a 30–100$\times$ slowdown, so placement and scheduling decisions must prevent avoidable host and network crossings.

The host-accelerator sequence in figure 13 begins with step (1), where data is copied from CPU memory to accelerator memory, as GPUs cannot directly access host memory at high speeds. A direct memory access (DMA)³¹ engine typically handles this transfer without consuming CPU cycles. In step (2), the CPU issues execution commands via APIs like CUDA, ROCm, or OpenCL. Step (3) involves parallel execution on the accelerator, where stalls can occur if data is not available when needed. Finally, in step (4), computed results are copied back to CPU memory for further processing.

³¹ DMA (Direct Memory Access): A dedicated hardware unit that manages the data copy (step 1) without direct CPU management, freeing the CPU to immediately issue computation commands (step 2). This concurrency is critical: without it, the accelerator can idle between compute batches, especially when host-to-device movement is on the critical path.

Latency and bandwidth limitations directly impact AI workloads. PCIe-class host interconnects are typically much slower than an accelerator’s on-package high-bandwidth memory, so large transfers can become bottlenecks, particularly in deep learning tasks. Synchronization overheads compound this problem when computation must wait for data transfers to complete. Efficient scheduling and overlapping transfers with execution are necessary to mitigate these inefficiencies.

Transfer optimization

The bandwidth taper described earlier creates a clear optimization hierarchy. Practitioners have two complementary strategies for mitigating transfer overheads: asynchronous data movement and unified memory abstraction.

DMA engines enable the first strategy by offloading data transfers from the CPU entirely. While computation proceeds on the accelerator, a DMA engine copies the next batch of training data from host memory into accelerator memory in the background. This overlap of computation and communication is essential for maintaining high utilization: without it, the accelerator can idle during transfers whenever the input pipeline or host link becomes the critical path.

Unified Memory provides the second strategy, offering a single address space accessible by both CPU and accelerator. Rather than requiring explicit copies, the runtime migrates memory pages on demand when either processor accesses them. The programming model simplifies dramatically (a single malloc replaces complex staging logic), but introduces performance unpredictability. Page migrations triggered by access patterns can cause latency spikes, and small or scattered accesses may thrash pages back and forth across the interconnect. For this reason, production training workloads typically use explicit DMA-based transfers for predictable performance, while Unified Memory finds its niche in prototyping and workloads where development speed outweighs absolute throughput.

These overheads (interconnect latency, bandwidth taper, and synchronization delays) are not merely implementation details. They directly shape how neural network architectures interact with hardware, because different model types create dramatically different memory pressure patterns. A convolutional layer processing images exhibits regular spatial locality that maps well to tiled prefetching, while a transformer’s attention mechanism requires accessing distant tokens across long sequences, stressing bandwidth in qualitatively different ways.

Model memory pressure

Model architecture determines which memory term binds. While multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and transformer networks each require large parameter sets, their distinct access patterns create different pressure on weights, activations, bandwidth, and host transfers, so each demands a different accelerator optimization strategy.

To ground this analysis, we return to the Lighthouse Models introduced in Lighthouse Models as Reference Workloads: ResNet-50 represents CNN workloads with high spatial reuse, GPT-2/Llama exemplifies transformer memory pressure, DLRM illustrates sparse embedding lookups that stress memory systems differently than dense operations, and MobileNetV2 demonstrates efficiency-optimized architectures with depthwise convolutions. These examples will recur throughout the remainder of this chapter as we analyze how memory characteristics translate to hardware utilization.

Multilayer perceptrons

MLPs, also referred to as fully connected networks, are among the simplest neural architectures. Each layer consists of a dense matrix multiplication, requiring every neuron to interact with all neurons in the preceding layer. This results in high memory bandwidth demands, particularly for weights, as every input activation contributes to a large set of computations.

From a memory perspective, MLPs rely on large, dense weight matrices that frequently exceed on-chip memory capacity, necessitating off-chip memory accesses. Since accelerators cannot directly access host memory at high speed, data transfers must be explicitly managed via interconnects such as PCIe or NVLink. These transfers introduce latency and consume bandwidth, affecting execution efficiency.

Despite their bandwidth-heavy nature, MLPs exhibit regular and predictable memory access patterns, making them amenable to optimizations such as prefetching and streaming memory accesses. Dedicated AI accelerators mitigate transfer overhead by staging weight matrices in fast SRAM caches and overlapping data movement with computation through direct memory access engines, reducing execution stalls. These optimizations allow accelerators to sustain high throughput even when handling large parameter sets (Chen, Emer, et al. 2017).

Convolutional neural networks

Convolutional Neural Networks (CNNs) are widely used in image processing and computer vision tasks. Unlike MLPs, which require dense matrix multiplications, CNNs process input feature maps using small filter kernels that slide across the image. This localized computation structure results in high spatial data reuse, where the same input pixels contribute to multiple convolutions.

CNN accelerators benefit from on-chip memory optimizations, as convolution filters exhibit extensive reuse, allowing weights to be stored in fast local SRAM instead of frequently accessing off-chip memory. However, activation maps require careful management due to their size. Since accessing main memory over interconnects like PCIe introduces latency and bandwidth bottlenecks, CNN accelerators employ tiling techniques to divide feature maps into smaller regions that fit within on-chip buffers. This minimizes costly external memory transfers, improving overall efficiency (Chen, Emer, et al. 2017).

Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. 2017. “Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators.” IEEE Micro 37 (3): 12–21. https://doi.org/10.1109/mm.2017.54.

While CNN workloads are more memory-efficient than MLPs, managing intermediate activations remains a challenge. Accelerators use hierarchical caching strategies and DMA engines to optimize memory movement, ensuring that computations are not stalled by inefficient host-accelerator data transfers. These memory optimizations help CNN accelerators maintain high throughput by reducing reliance on off-chip memory bandwidth. Pioneering architectures like Eyeriss introduced row-stationary dataflows to maximize data reuse for convolutional workloads (Chen, Krishna, et al. 2017). Row-stationary is a convolution-specific hybrid in the broader dataflow taxonomy of section 1.7: it keeps rows and partial sums local when that pattern gives better reuse than a purely weight-, input-, or output-stationary mapping.

Transformer networks

The transformer architectures introduced in Transformers: Parallel Sequence Processing have become the dominant architecture for natural language processing and are increasingly used in other domains such as vision and speech recognition. Unlike CNNs, which rely on local computations, transformers perform global attention³² mechanisms, where each token in an input sequence can interact with all other tokens.

³² Attention Mechanism: Introduced to neural networks by Bahdanau, Cho, and Bengio in 2014, attention allows each token to interact with every other token in the input sequence. The hardware consequence is quadratic memory growth: attention scores for a sequence of length $S$ require an $S{\times}S$ matrix, so doubling sequence length quadruples memory consumption. This scaling drives both the KV-cache bottleneck in inference (see Memory and KV cache) and the development of memory-efficient alternatives like FlashAttention, which tiles the computation to avoid materializing the full attention matrix in HBM.

These models are particularly challenging for accelerators because global token interaction creates large attention state while GPT-3-scale language models (Brown et al. 2020) can exceed on-chip memory capacity through sheer parameter count. As a result, frequent movement between HBM, caches, and compute units creates substantial latency and bandwidth pressure. If the model spills beyond accelerator memory or uses host offload, PCIe or NVLink transfers add another bottleneck. Unified Memory architectures can mitigate some programming complexity by handling movement between host and device memory at runtime, but they introduce additional latency when page migrations occur unpredictably. These pressures make high-bandwidth memory, tensor tiling, and memory partitioning central accelerator design concerns for transformer workloads.

Attention caching mechanisms and specialized tensor layouts further reduce redundant memory fetches, improving execution efficiency. Given the bandwidth limitations of traditional interconnects, NVLink-enabled architectures offer clear advantages for large-scale transformer training, as they provide higher throughput and lower latency compared to PCIe. DMA-based asynchronous memory transfers enable overlapping computation with data movement, reducing execution stalls (Narayanan et al. 2021).

Narayanan, Deepak, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, et al. 2021. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–15. https://doi.org/10.1145/3458817.3476209.

Accelerator design implications

The diverse memory requirements of MLPs, CNNs, and transformers highlight the need for workload-specific accelerator design. Table 11 reveals how memory access patterns vary dramatically across model types.

Table 11: ML Model Memory Access: Different machine learning models exhibit distinct memory access patterns and bottlenecks due to variations in weight size, activation reuse, and data movement. Standard dense transformers demand high bandwidth and capacity because large weights, KV caches, and attention traffic dominate memory pressure; sparse MoE or pruned variants add further routing and sparsity considerations. CNNs benefit from spatial locality and high activation reuse, reducing memory pressure.

Model Type	Weight Size	Activation Reuse	Memory Access Pattern	Primary Bottleneck
MLP (Dense)	Large, dense	Low	Regular, sequential (streamed)	Bandwidth (off-chip)
CNN	Small, reused	High	Spatial locality	Feature map movement
Transformer	Large, usually dense; sparse in MoE/pruned variants	Low-to-medium	Mostly regular GEMM plus KV-cache/attention traffic	Memory capacity + bandwidth

Each model type presents unique challenges that directly impact accelerator design. MLPs benefit from fast streaming access to dense weight matrices, making memory bandwidth a critical factor in performance, especially when transferring large weights from host memory to accelerator memory. CNNs, with their high activation reuse and structured memory access patterns, can exploit on-chip caching and tiling strategies to minimize off-chip memory transfers. Transformers, however, impose heavy demands on both bandwidth and capacity: attention mechanisms require frequent access to large key-value matrices, generating high interconnect traffic and substantial memory pressure.

To address these challenges, modern AI accelerators incorporate multi-tier memory hierarchies that balance speed, capacity, and energy efficiency. On-chip SRAM caches and scratchpad memories store frequently accessed data, while high-bandwidth external memory provides scalability for large models. Efficient interconnects, such as NVLink, help alleviate host-accelerator transfer bottlenecks, particularly in transformer workloads where memory movement constraints can dominate execution time.

As ML workloads continue to grow in complexity, memory efficiency becomes as critical as raw compute power. The analysis reveals how memory systems dominate accelerator performance: DRAM access has 100$\times$ or higher energy cost than on-chip arithmetic, carefully structured memory hierarchies can improve effective bandwidth substantially, and different neural network architectures create distinct memory pressure patterns. These constraints (bandwidth limitations, energy costs, and communication overheads) determine whether theoretical computational capabilities translate into real-world performance. The remaining question is whether a specific workload is limited by compute or memory on a given accelerator. The memory wall analysis establishes why memory matters, but practitioners need a quantitative framework to predict which operations will bottleneck on a specific hardware configuration. Without such a framework, optimization becomes guesswork: engineers might spend weeks optimizing compute throughput for an operation that was memory-bound all along.

Self-Check: Question

What is the central quantitative claim of the “AI memory wall” as the section frames it?
1. Compute throughput has scaled faster than memory bandwidth for multiple accelerator generations, so an increasing share of ML workload time and energy is spent moving data rather than computing on it
2. Accelerators have too few arithmetic units to keep up with model demand, so the primary investment direction is adding more MAC units per chip
3. Adding HBM capacity automatically resolves memory-bound workloads by giving models more storage to work with
4. Only CPUs suffer from memory bottlenecks; accelerators avoid them architecturally by integrating compute and memory on one die
Why is on-chip SRAM indispensable on an accelerator that already has multi-TB/s HBM?
1. HBM access takes tens to hundreds of nanoseconds of round-trip latency, too many cycles for arithmetic units to wait on directly; SRAM delivers operands at single-cycle latency, bridging the gap between HBM’s bandwidth and the arithmetic units’ cycle-scale demand
2. HBM cannot hold model parameters, so SRAM is needed to store weights that do not fit elsewhere
3. HBM is only used by CPUs and bypassed by accelerator kernels, which run entirely from SRAM
4. SRAM provides more total capacity than HBM, making it the correct tier for bulk storage
Compare the memory-pressure profile that CNNs impose on an accelerator with the profile that large transformers impose, and explain how this comparison changes accelerator-selection priorities.
Order the following storage and communication tiers from fastest operand delivery (cycles) to slowest (hundreds of thousands of cycles or more) for an accelerator executing a tensor workload: (1) HBM device memory, (2) L2 or shared on-chip cache, (3) register file, (4) PCIe transfer from host memory.
An eight-GPU node uses NVLink between GPUs and multi-TB/s HBM per GPU. The team observes that their input pipeline spends most of its time transferring data from the host CPU rather than computing on the GPUs. Which link is most likely the bottleneck?
1. The PCIe host-device link, which is roughly one to two orders of magnitude slower than HBM and slower than NVLink, so host-fed pipelines often saturate PCIe before any on-device link is stressed
2. The HBM interface, because device-local memory is structurally slower than host DRAM on modern systems
3. The NVLink fabric between GPUs, because inter-GPU links always represent the slowest communication tier in a node
4. The register file, because registers cannot sustain streaming input data for deep-learning batches
Which model family most strongly stresses memory capacity plus interconnect bandwidth rather than compute throughput at inference time?
1. Large transformers with tens to hundreds of billions of parameters and linearly growing KV cache, where per-token work is dominated by reading weights across the hierarchy
2. Small CNNs with tight spatial locality and filter reuse, which fit comfortably in on-chip buffers
3. Standard image-classification CNNs on moderate input resolutions, which achieve high arithmetic intensity
4. Dense GEMM workloads at very large batch size, which amortize weight reads across many samples and become compute bound

See Answers →

Roofline Model

Low arithmetic intensity pins the workload in the memory-bound regime.

The roofline model answers this question by plotting arithmetic intensity against attainable performance, revealing whether each operation hits a compute ceiling or a memory bandwidth ceiling. Rather than relying on peak FLOP/s figures, which reflect marketing rather than achievable throughput, the roofline model maps a workload onto a specific hardware platform and exposes the binding constraint.

The roofline model³³ (Williams et al. 2009) provides the standard framework for understanding whether workloads are compute bound or memory bound, directly connecting the memory wall discussion to practical performance analysis. This model enables quantitative reasoning about accelerator utilization and guides optimization decisions.

³³ Roofline Model: Introduced by Williams et al. (2009) at UC Berkeley, building on earlier I/O complexity work from the 1980s. Their specific contribution was making the compute vs. bandwidth trade-off visual and actionable: the characteristic roofline plot immediately reveals whether a kernel is compute bound (hitting the flat ceiling) or memory bound (hitting the sloped bandwidth line) and quantifies the gap to hardware limits. A kernel operating at only 50 percent of its ceiling has a clear 2$\times$ utilization gap to close, making this the standard diagnostic tool for accelerator optimization.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://doi.org/10.1145/1498765.1498785.

Performance is bounded by two ceilings, as equation 2 formalizes. Here, attainable performance $R_{\text{attain}}$ and peak compute $R_{\text{peak}}$ are in FLOP/s (often reported as TFLOP/s), peak bandwidth $\text{BW}$ is in bytes/s (often TB/s), and arithmetic intensity $I$ is in FLOP/byte: \[R_{\text{attain}} = \min(R_{\text{peak}}, \text{BW} \times I) \tag{2}\]

The key metric that determines which ceiling a workload hits is arithmetic intensity, the ratio of computation to memory traffic.

Definition 1.5: Arithmetic intensity

Arithmetic Intensity is the ratio of floating-point operations to bytes of memory traffic for a given computation ($\text{FLOP}/\text{byte}$), determining whether the workload is limited by compute throughput $(R_{\text{peak}})$ or memory bandwidth $(\text{BW})$ on a given accelerator.

Significance: The intensity threshold separating memory-bound from compute-bound regimes is the roofline ridge point: $R_{\text{peak}} / \text{BW}$. For an A100 (312 TFLOP/s FP16/BF16, 2.04 TB/s), the ridge point is roughly 153 FLOP/byte. Matrix multiplications around 100–200 FLOP/byte straddle that threshold, while a larger well-tiled 1024 $\times$ 1024 matrix multiply reaches about 341.3 FLOP/byte and is compute bound; a pointwise ReLU performs around 0.125 FLOP/byte under a read-plus-write traffic model (memory bound), placing these operations in different optimization regimes on the same hardware.
Distinction: Unlike total FLOPs (a count of operations), arithmetic intensity is a ratio that characterizes the shape of a workload’s hardware demand. Two kernels with identical FLOPs but different memory access patterns have different arithmetic intensities and will be bottlenecked by different hardware resources.
Common pitfall: A frequent misconception is that arithmetic intensity is a fixed property of an operation. In practice, it depends on implementation details: a naive matrix-multiply that reloads operands from DRAM for each output element has low arithmetic intensity; a blocked (tiled) implementation that reuses data from fast SRAM achieves high arithmetic intensity—the same mathematical operation, orders of magnitude apart in hardware efficiency.

Arithmetic intensity (AI) measures floating-point operations per byte of memory traffic. The operation count $O$ is a dimensionless count of floating-point operations and the data volume $D_{\text{vol}}$ is measured in bytes, so AI has units of FLOP/byte, defined by equation 3: \[I = \frac{O}{D_{\text{vol}}} \tag{3}\]

The roofline visualization shows performance (TFLOP/s) on the vertical axis and arithmetic intensity (FLOP/byte) on the horizontal axis. At low arithmetic intensity, performance increases linearly with intensity (memory-bound region). Above the ridge point, performance saturates at peak compute (compute-bound region). Bottleneck diagnostic maps each regime to the optimizations that pay off and the ones that waste effort, so that classifying a workload as memory-bound or compute-bound tells the engineer directly whether a faster accelerator or more bandwidth is the binding investment.

Hardware ridge points

The ridge point, the hardware balance $I_{\text{ridge}}$ established in section 1.4.2, is the arithmetic-intensity threshold at which an accelerator turns from memory-bound to compute-bound. Table 12 quantifies how different accelerators exhibit distinct characteristics based on their compute-to-bandwidth ratios:

Table 12: Hardware Ridge Points: Representative ridge point ranges for different accelerator generations, determined by their compute-to-bandwidth ratios. Values shown are order-of-magnitude approximations; actual ridge points vary by precision mode and specific SKU. Higher ridge points require higher FLOP/byte intensity to achieve peak utilization.

Accelerator	Peak FP16	Bandwidth	Ridge Point
GPU (2017-era)	$\sim 10^2$ TFLOP/s	$\sim 10^3$ GB/s	$\sim 10^2$ FLOP/byte
GPU (2020-era)	$\sim 10^2$ TFLOP/s	$\sim 10^3$ GB/s to $\sim 10^0$ TB/s	$\sim 10^2$ FLOP/byte
GPU (2023-era)	$\sim 10^3$ TFLOP/s	a few TB/s	$\sim 10^2$ FLOP/byte
TPU-class (2023-era)	$\sim 10^2$ to $\sim 10^3$ TFLOP/s	$\sim 1$ TB/s	$\sim 10^2$ FLOP/byte

These ridge point values reveal a surprising trend: as hardware has become more powerful, keeping it fully in use has become harder. A ridge-point comparison makes that trend concrete.

Napkin Math 1.3: The utilization gap

The utilization physics: Why is it harder to get 100 percent utilization on an H100 than a V100?

Metric: The ridge point $I_{\text{ridge}} = R_{\text{peak}} / \text{BW}$ (FLOP/byte) from section 1.4.2: how many math operations the hardware must perform for every byte of data loaded to keep the compute units busy.

Evolution:

V100 (2017): 125 TFLOP/s / 0.9 TB/s ≈ 138.9 FLOP/byte.
A100 (2020): 312 TFLOP/s / 2.04 TB/s ≈ 153 FLOP/byte.
H100 (2023): 989 TFLOP/s / 3.35 TB/s ≈ 295.2 FLOP/byte.

Systems insight: The “bar” for compute intensity has doubled. An algorithm with $I$ = 200 FLOP/byte was compute-bound (good) on A100 but is bandwidth-bound (bad) on H100. This explains why “legacy” code often sees only 1.6× speedup on H100 (bandwidth ratio) instead of the advertised 3.2× (FLOPs ratio).

Practical examples: A standard ReLU performs 1 operation for every 8 bytes (0.125 FLOP/byte), placing it 2,361.8× below the H100 roofline. A well-tiled 1024 $\times$ 1024 dense MatMul reaches about 341.3 FLOP/byte, making it compute bound even on H100. Most operations fall short of the ridge point, which is why kernel fusion is the most important optimization, as explored in section 1.7.1.3.

Depthwise convolution, embedding lookup, LayerNorm, and softmax are useful low-intensity reference points because they spend more time moving bytes than doing arithmetic. Table 13 maps common neural network operations to the Roofline model.

Table 13: Operations on the Roofline: Neural network layers span a wide range of arithmetic intensities. Large, well-tiled convolutions and batched GEMMs can be compute-bound, while small-batch dense projections, MobileNet depthwise layers, attention softmax, normalization, and DLRM embeddings are often memory-bound.

Operation	Arithmetic Intensity	Classification	Lighthouse Example
Conv2D (Dense)	50–200 FLOP/byte	Straddles ridge; high-reuse cases compute-bound	ResNet-50
Dense MatMul (large batch, well-tiled)	64–256+ FLOP/byte	Often compute-bound at large batch	GPT-2 (batched projections)
Depthwise Conv	10–20 FLOP/byte	Memory-bound	MobileNet
Attention Softmax	2–5 FLOP/byte	Memory-bound	GPT-2 (Generation)
LayerNorm	1–2 FLOP/byte	Memory-bound	GPT-2/Llama
Embedding lookup	<1 FLOP/byte	Memory-bound	DLRM

To see how these intensity values translate into real performance predictions, a transformer layer provides a complete arithmetic-intensity calculation across its major sub-operations.

Napkin Math 1.4: Transformer layer analysis

For a transformer with hidden_dim = 768, batch = 32, seq = 512:

Attention QKV projection:

FLOPs: 2 $\times$ 3 $\times$ 32 $\times$ 512 $\times$ 768 $\times$ 768 = 58 GFLOP
Bytes: (input + weights + output) = (32 $\times$ 512 $\times$ 768 + 3 $\times$ 768 $\times$ 768 + 32 $\times$ 512 $\times$ 768 $\times$ 3) $\times$ 2 ≈ 104.2 MB
AI = 58 GFLOP / 104.2 MB = 556.4 FLOP/byte, which is compute bound on A100 (above 153 FLOP/byte threshold)

Softmax:

FLOPs: 32 $\times$ 12 $\times$ 512 $\times$ 512 $\times$ 3 ≈ 302 MFLOP (exp, sum, div)
Bytes: 32 $\times$ 12 $\times$ 512 $\times$ 512 $\times 2 \times 2$ = 402.7 MB
AI = 302 MFLOP / 402.7 MB = 0.75 FLOP/byte, which is memory-bound

This analysis explains why FlashAttention focuses on reducing memory traffic in attention rather than reducing FLOPs.

These classifications directly inform optimization strategy. Memory-bound operations benefit from reducing data movement through operator fusion, using reduced precision (FP16, INT8), and increasing arithmetic intensity through algorithmic changes like FlashAttention. Compute-bound operations, by contrast, benefit from maximizing hardware utilization through batching and parallelism, exploiting Tensor Cores and specialized compute units, and optimizing compute efficiency through tiling and scheduling.

Calculating memory bandwidth bounds

The roofline model’s memory-bound region is determined by the peak memory bandwidth. For an operation to achieve throughput $R_{\text{ops}}$ (FLOP/s, often expressed in TFLOP/s) in the memory-bound regime, equation 4 gives the required bandwidth: \[\text{BW}_{\text{req}} = \frac{R_{\text{ops}}}{I} \text{ bytes/s} \tag{4}\]

When required bandwidth exceeds peak bandwidth, performance is capped according to equation 5. Here $R_{\text{ops}}$ and $R_{\text{attain}}$ are in FLOP/s and $I$ is in FLOP/byte. \[R_{\text{attain}} = \text{BW} \times I \tag{5}\]

A convolution layer provides the compute-bound contrast.

Napkin Math 1.5: Convolutional layer analysis

Consider a Conv2D layer with input shape (batch = 32, channels = 128, height = 56, width = 56), output channels = 256, kernel size $3{\times}3$ on an A100 GPU:

Computational requirements:

Output size: $32 \times 256 \times 56 \times 56$ = 25.7M elements
FLOPs per output: $128 \times 3 \times 3 \times 2$ = 2,304 (multiply-add)
Total FLOPs: 25.7M $\times$ 2,304 = 59.2 GFLOP

Memory traffic analysis:

Input: $32 \times 128 \times 56 \times 56 \times 2$ = 25.7 MB (FP16)
Weights: $256 \times 128 \times 3 \times 3 \times 2$ ≈ 0.6 MB (FP16)

Output: $32 \times 256 \times 56 \times 56 \times 2$ = 51.4 MB (FP16)
Total: 77.7 MB

Arithmetic intensity: $I$ = 59.2 GFLOP / 77.7 MB = 762.2 FLOP/byte

This is well above A100’s ridge point of 153 FLOP/byte, making this operation compute-bound. The layer will achieve near-peak performance of ~312 TFLOP/s (FP16 with Tensor Cores).

The convolutional layer’s high arithmetic intensity arises from its weight reuse pattern: the same $3{\times}3$ kernel is applied across all spatial locations, amortizing the cost of loading weights across millions of output computations. This is the architectural pattern that makes CNNs so efficient on modern accelerators.

However, not all layers in a neural network exhibit this favorable profile. The fully connected (dense) layers that typically appear at the end of classification networks, or as the projection layers in transformers, have different arithmetic intensity characteristics. A dense layer provides the memory-bound contrast needed to predict where bottlenecks will occur in end-to-end model execution.

Napkin Math 1.6: Dense layer analysis

Consider a fully connected layer: input (batch = 32, features = 2048) → output (batch = 32, features = 2048) on the same A100:

Computational requirements:

Matrix multiply: $(32 \times 2048) \times (2048 \times 2048)$
Total FLOPs: $2 \times 32 \times 2048 \times 2048$ = 268.4 MFLOP

Memory traffic analysis:

Input: $32 \times 2048 \times 2$ = 131.1 KB (FP16)
Weights: $2048 \times 2048 \times 2$ = 8.4 MB (FP16)
Output: $32 \times 2048 \times 2$ = 131.1 KB (FP16)
Total: 8.7 MB

Arithmetic intensity: $I$ = 268.4 MFLOP / 8.7 MB = 31 FLOP/byte

This is below A100’s ridge point of 153 FLOP/byte, making this operation memory-bound. Attainable performance: $R_{\text{attain}}$ = 2,039 GB/s $\times$ 31 FLOP/byte = 63.3 TFLOP/s

This is only 20.3 percent of peak compute capability, demonstrating the memory wall effect for small batch sizes.

The dense layer’s lower arithmetic intensity stems from limited weight reuse: each weight is reused across the batch but lacks the additional spatial reuse of convolutional filters, so small-batch dense layers have much lower arithmetic intensity than convolutions. This difference explains why transformer inference (dominated by dense projections) is typically memory bound while CNN inference can be compute bound.

Napkin Math 1.7: LayerNorm analysis

LayerNorm with input shape (batch = 32, seq = 512, hidden = 768):

Computational requirements:

Elements: $32 \times 512 \times 768$ = 12.6M
Operations per element: mean (1 ADD), variance (1 ADD, 1 MUL), normalize (1 ADD, 1 MUL, 1 DIV) ≈ 6
Total FLOPs: 12.6M $\times$ 6 = 75.5 MFLOP

Memory traffic:

Input: 12.6M $\times$ 2 = 25.2 MB
Parameters (scale, bias): $768 \times 2 \times 2$ = 3.1 KB (negligible)
Output: 12.6M $\times$ 2 = 25.2 MB
Total: 50.3 MB

Arithmetic intensity: $I$ = 75.5 MFLOP / 50.3 MB = 1.5 FLOP/byte

This is severely memory-bound (102× below the A100 ridge point). Performance is limited to: $R_{\text{attain}}$ = 2039 GB/s $\times$ 1.5 FLOP/byte = 3.1 TFLOP/s

This represents less than 1 percent of A100’s compute capacity, explaining why normalization layers contribute negligible compute time but significant latency.

The situation becomes even more extreme for element-wise operations like normalization layers. These operations perform negligible computation relative to the data they touch, as LayerNorm makes clear. Each element is loaded, transformed by a simple formula, and written back, leaving essentially no opportunity for data reuse.

Optimization by intensity regime

The roofline analysis directly informs optimization priorities, summarized in table 14.

Table 14: Optimization by Arithmetic Intensity Regime: Roofline position determines whether an optimization should chase compute utilization, memory-traffic reduction, or complete elimination of memory round-trips. The lower the arithmetic intensity, the more valuable fusion and data-movement avoidance become.

Intensity regime	Typical operations	Optimization priority	Common techniques	Expected impact
High AI (>200 FLOP/byte)	Large convolutions	Maximize compute utilization.	Tensor Cores, thread-block tuning, and high occupancy.	Can approach 90–95% of peak TFLOP/s.
Medium AI (20–200 FLOP/byte)	Medium-sized dense layers	Balance compute and memory optimization.	Larger batches, register tiling, and fusion with adjacent operations.	Can move from memory-bound to compute-bound execution.
Low AI (<20 FLOP/byte)	Small dense layers and element-wise operations	Reduce memory traffic.	Aggressive operator fusion, reduced precision (FP16 → INT8), and algorithmic changes.	Fusion alone can yield 2–4$\times$ speedups.
Very low AI (<2 FLOP/byte)	Normalization layers and activation functions	Eliminate memory round-trips.	Mandatory fusion with adjacent operations and in-place computation where possible.	Fusion can yield 10$\times$ speedups; for example, LayerNorm combined with the Gaussian Error Linear Unit (GELU) can become a single fused kernel.

For low-AI operations, operator fusion is often the decisive optimization: LayerNorm combined with the Gaussian Error Linear Unit (GELU), for example, can become a single fused kernel. One of the most accessible levers for moving an operation up and right on the roofline is batching.

Napkin Math 1.8: Batch size and arithmetic intensity

Increasing batch size improves arithmetic intensity for matrix operations by amortizing weight loading. Equation 6 formalizes this relationship for a dense layer $(B{\times}M){\times}(M{\times}N)$: \[I = \frac{2BMN}{2BM + 2MN + 2BN} \approx \frac{2BMN}{2MN} = B \quad (\text{when } 2MN \gg 2B(M+N)) \tag{6}\]

Example: Dense layer with M=N=2048 (FP16)

Batch = 1: AI ≈ 1 FLOP/byte (memory bound)
Batch = 32: AI ≈ 31 FLOP/byte (memory bound)
Batch = 256: AI ≈ 204.8 FLOP/byte (compute bound on A100)

This explains why batching can produce large throughput improvements in production inference systems, as MLPerf Inference, the standardized benchmark suite covered in Benchmarking, demonstrates by separating bulk-throughput runs from latency-constrained serving runs (Reddi et al. 2019).

Reddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2019. “MLPerf Inference Benchmark.” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446–59. https://doi.org/10.1109/isca45697.2020.00045.

The batch size analysis reveals why inference serving systems are designed around batching: it changes the arithmetic intensity regime of memory-bound workloads. However, batching introduces latency trade-offs, since requests must wait in a queue until a batch forms. This tension between throughput (favoring large batches) and latency (favoring small batches) is a central challenge in ML serving systems, explored in depth in Dynamic batching latency-throughput trade-offs.

For workloads where batching is impractical, such as interactive LLM generation where users expect streaming responses, the arithmetic intensity remains inherently low. Understanding this ceiling is essential for setting realistic performance expectations.

A batch-1 GPT-2 calculation makes this bandwidth ceiling concrete.

Napkin Math 1.9: The throughput ceiling

Problem: What is the maximum possible utilization of an NVIDIA A100 when running GPT-2 inference (batch size 1)?

The hardware constraints (the denominators)

Peak compute: 312 TFLOP/s (FP16 Tensor Core).
Peak bandwidth: 2.04 TB/s (HBM2e).
Ridge point $(R_{\text{peak}}/\text{BW})$: 312 TFLOP/s / 2.04 TB/s = 153 FLOP/byte (for FP16 Tensor Core).
- Meaning: Saturating this chip at FP16 precision requires 153 FLOP/byte operations for every byte loaded. The ridge point varies by precision: FP32 operations (19.5 TFLOP/s peak) have a ridge point of only ~9.6 FLOP/byte.

The workload characteristics (the numerator)

Model: GPT-2 XL (1.5 billion parameters).
Operation: Autoregressive generation (1 token at a time).
Data movement: Must load all weights (3 GB @ FP16) for every token.
Compute: Vector-Matrix multiplication. 2 $\times$ Params ≈ 3 GFLOP.
Arithmetic intensity: 3 GFLOP / 3 GB = 1 FLOP/byte

The prediction (iron law)

Since Actual Intensity (1) ≪ Ridge Point (153 FLOP/byte), the system is bandwidth bound.

Maximum throughput: 1 FLOP/byte $\times$ 2.04 TB/s = 2.04 TFLOP/s.
Utilization ceiling: 2.04 TFLOP/s (Actual) / 312 TFLOP/s (Peak) ≈ $0.7\%$

Systems insight: Without batching or caching, a $15,000 runs at less than 1 percent efficiency on LLM inference. This “utilization gap” drives the need for key-value caching and quantization.

As this derivation demonstrates, the Roofline model provides the diagnostic framework for identifying whether operations are compute bound or memory bound. Knowing that a workload is memory bound at 0.7 percent utilization is only the first step; the next challenge is translating this diagnosis into efficient execution plans that exploit accelerator architectures.

Self-Check: Question

On an A100 with a ridge point of about 153 FLOP/byte, which statement correctly classifies whether a kernel is compute bound or memory bound?
1. Kernels with arithmetic intensity above about 153 FLOP/byte sit on the compute ceiling and are compute bound; kernels below that threshold sit on the bandwidth slope and are memory bound
2. Classification depends on the parameter count of the model: larger models are always compute bound
3. Classification depends on whether the workload is training or inference, not on arithmetic intensity
4. Kernels are compute bound if they use tensor cores and memory bound otherwise, regardless of arithmetic intensity
A dense linear layer at batch size 1 sits near 1 FLOP/byte on an A100 (ridge point about 153 FLOP/byte). At batch size 256 the same layer sits at roughly 205 FLOP/byte. Explain the mechanism that shifts the layer across the ridge and quantify which levers moved.
Which operation is most likely to remain severely memory bound on an A100 even at large batch sizes?
1. LayerNorm, which performs on the order of a handful of arithmetic operations per activation element and reads the entire activation tensor to compute per-channel statistics
2. A large $3 \times 3$ convolution with 256 input and output channels and heavy spatial filter reuse
3. A large batched dense matrix multiplication using tensor cores at batch size 1024
4. A transformer QKV projection that shares input activations across the three projection matrices
True or False: Buying an H100 with higher peak FLOP/s than an A100 guarantees that every kernel compute-bound on the A100 stays compute-bound on the H100.
A team profiles a GPT-2 batch-1 decode kernel on an A100 and measures 0.8 FLOP/byte of weight and activation traffic against a 153 FLOP/byte ridge. Because the kernel sits roughly 190$\times$ below the ____, the first productive optimizations reduce bytes moved through fusion, layout changes, or lower-precision weights, rather than adding more compute silicon.
A team profiles GPT-2 autoregressive inference at batch 1 on an A100 and finds realized throughput below 1 percent of peak. Which optimization direction is most justified first?
1. Reduce bytes moved per token through operator fusion, quantization of weights to INT8, or increasing batch size — each raises arithmetic intensity toward the ridge
2. Upgrade to a newer accelerator with 2$\times$ the peak FLOP/s, because more compute throughput is the standard remedy for low utilization
3. Convert all operations to FP64 for better numerical stability, which will let the kernel stay on the compute ceiling longer
4. Replace tensor cores with scalar cores to better match the kernel’s low observed utilization

See Answers →

Hardware Mapping

The Roofline analysis taught us to diagnose whether specific operations are compute bound or memory bound on given hardware. We saw that ResNet-50’s convolutions can reach high arithmetic intensity (50–200 FLOP/byte), with the high-reuse cases crossing into the compute-bound regime, while GPT-2’s attention layers achieve only 2–5 FLOP/byte and are severely memory bound. Diagnosis, however, is only half the challenge. Once we know that LayerNorm achieves only 1–2 FLOP/byte on an A100, the challenge becomes executing it efficiently despite this limitation. This is the domain of hardware mapping, the art of translating abstract computational graphs into concrete execution plans that exploit accelerator architectures while respecting their constraints.

The memory system challenges examined in section 1.4.1 established why memory access dominates modern AI systems: as figure 9 quantified, DRAM access consumes 100–200$\times$ more energy than a multiply-accumulate operation (Horowitz 2014). The Roofline model established how to measure whether a workload is compute bound or memory bound. This section addresses the critical follow-up: how to map computations to maximize data reuse and minimize the energy-intensive transfers that the Roofline analysis revealed as the primary bottleneck.

Consider a $3{\times}3$ convolution running on an accelerator tile. The mathematical operation is fixed, but the execution plan is not. One schedule can keep a filter in local registers while many output pixels stream past it; another can advance pixel by pixel and reload the same filter values repeatedly from a slower memory tier. Both schedules compute the same tensor. Only one turns the reuse in the convolution into real bandwidth savings. Hardware mapping is the discipline that makes that difference explicit: it decides where the work runs, where the data lives while it is reused, and when each loop or kernel executes so the accelerator is fed rather than idle.

Definition 1.6: Mapping in AI acceleration

Mapping in AI Acceleration is the accelerator-compiler process of binding the Logical Computation Graph to the Physical Hardware Topology by deciding which operations execute on which processing elements, which data resides in which memory tier, and in what temporal order.

Significance: Within the D·A·M taxonomy, mapping is the machine-axis decision that determines whether the algorithm’s operations run at $R_{\text{peak}}$ or at $\text{BW} \times I$. Specifically, a general matrix multiply (GEMM) with arithmetic intensity $I$ runs at $\min(R_{\text{peak}},\; \text{BW} \times I)$; a poor tiling choice that forces unnecessary DRAM accesses can reduce effective $I$ enough to collapse a compute-bound operation into a bandwidth-bound one and cause a commensurate drop in sustained throughput.
Distinction: Unlike Traditional Compilation (which targets a linear instruction stream on a von Neumann processor), Mapping targets a Dataflow Architecture where the movement of data is as costly as the computation itself: off-chip DRAM access consumes ~200$\times$ more energy than a multiply-accumulate in local registers.
Common pitfall: A frequent misconception is that mapping is automatically handled by frameworks. For general GPU workloads, compilers like Accelerated Linear Algebra (XLA) can find strong mappings for common kernels; for specialized accelerators (systolic arrays, custom ASICs), compiler-generated mappings may still lag hand-tuned schedules because the compiler’s search space is limited by the time budget at compilation.

The convolution example exposes the three decisions that recur throughout accelerator compilation. Placement assigns the multiply-accumulate work to processing elements so parallelism does not turn into idle time or interconnect congestion. Allocation keeps weights, activations, and partial sums in the memory tier where their next use will occur, rather than letting reuse spill back to DRAM. Scheduling orders loops and kernels so the chosen placement and allocation remain valid over time. A poor choice in any one dimension can collapse a high-arithmetic-intensity operation back into a bandwidth-bound execution. In practice, these choices are too coupled for developers to manage by hand at model scale, which is why a specialized compiler such as NVIDIA’s NVCC or Google’s XLA takes over: it accepts the high-level model from the framework and searches the mapping space for a good execution plan within the compile-time and hardware budget it is given. Section 1.8 examines that compiler support in detail.

Placement and allocation

Translating a model’s computational graph into efficient hardware execution requires solving two tightly coupled problems. Computation placement determines which operations run on which processing elements, balancing parallelism against communication costs. Memory allocation determines where data resides within the memory hierarchy, trading capacity against access latency. These two decisions interact: placing operations on distant processing elements increases the memory bandwidth required to shuttle data between them, while allocating data to fast but small on-chip memory limits which operations can execute concurrently. Getting either wrong leaves thousands of processing elements idle or starved for data.

Computation placement

Computation placement is the process of strategically assigning operations to an accelerator’s processing elements (PEs) to maximize parallelism, minimize idle time, and reduce unnecessary data movement. Modern accelerators contain enormous numbers of PEs: the NVIDIA H100 has over 16,000 streaming processors and more than 500 tensor cores (Choquette 2023), TPUs use systolic arrays of thousands of multiply-accumulate units (Jouppi et al. 2017), and wafer-scale processors like Cerebras’ CS-2 integrate over 850,000 cores (Systems 2021). At these scales, even small placement inefficiencies compound into measurable performance losses because idle cores and redundant memory transfers waste both time and energy.

The difficulty of placement depends on workload regularity. CNNs exhibit structured, spatially local computation: a $256{\times}256$ image can be tiled across thousands of GPU cores with each tile processed independently, yielding balanced utilization. Transformers are harder because self-attention requires every token to interact with every other, creating nonuniform demands where attention score computation is far heavier than other operations. Graph Neural Networks (GNNs) are harder still, as sparse, dynamically changing graph structures make static partitioning ineffective. Table 15 lists the core challenges placement must address across these workload types. The common thread is regularity: the less regular the computation, the less a static placement can balance load and locality, which is why adaptive, runtime-aware placement becomes mandatory for transformers and GNNs even though it is unnecessary for CNNs.

Table 15: Computation Placement Challenges: Effective neural network deployment requires strategic allocation of computations to processing elements, balancing workload distribution, data movement costs, and hardware constraints to maximize execution efficiency. These challenges guide the design of mapping strategies that optimize resource utilization and minimize communication overhead.

Challenge	Impact on Execution	Key Considerations for Placement
Workload Imbalance	Some processing elements finish early while others remain overloaded, leading to idle compute resources.	Distribute operations evenly to prevent stalls and ensure full utilization of PEs.
Irregular Computation Patterns	Models like transformers and GNNs introduce nonuniform computation demands, making static placement difficult.	Use adaptive placement strategies that adjust execution based on workload characteristics.
Excessive Data Movement	Frequent memory transfers introduce latency and increase power consumption.	Keep frequently used data close to the compute units and minimize off-chip memory accesses.
Limited Interconnect Bandwidth	Poorly placed operations can create congestion, slowing data movement between PEs.	Optimize spatial and temporal placement to reduce communication overhead.
Model-Specific Execution Needs	CNNs, transformers, and GNNs require different execution patterns, making a single placement strategy ineffective.	Tailor placement strategies to match the computational structure of each model type.

Because a well-placed workload can reduce latency by 10 to 100 times while a poorly placed one leaves thousands of PEs idle, modern accelerators increasingly rely on runtime-aware scheduling that adapts placement to real-time workload behavior rather than static execution plans. Placement decisions also interact directly with the next concern: where the data those PEs need actually resides in the memory hierarchy.

Memory allocation

While computation placement determines where operations execute, memory allocation defines where data resides and how it flows through the memory hierarchy during execution. The primary goal is to keep frequently accessed data as close as possible to the processing elements, minimizing latency and power consumption. GPUs achieve this through a mix of global memory, shared memory, and registers with careful tiling strategies (NVIDIA Corporation 2020). TPUs use on-chip SRAM scratchpads where activations and weights must be preloaded to sustain systolic array execution (figure 8), with weights streamed in perfect synchronization with input activations to maintain pipelined computation flow (Jouppi et al. 2017). Wafer-scale processors demand careful memory partitioning to avoid excessive interconnect traffic (Systems 2021). Unlike general-purpose computing, where caches abstract memory management, AI accelerators require explicit data placement strategies because poor allocation leads to three compounding penalties: increased memory latency when data must be fetched from higher-latency tiers, higher power consumption from off-chip accesses that cost orders of magnitude more energy than on-chip storage, and reduced computational throughput when processing elements stall waiting for data.

The severity of these penalties varies by workload. CNNs rely on structured, localized access patterns and benefit from well-defined memory layouts that facilitate predictable reuse (Chen, Krishna, et al. 2017). Transformer models require frequent access to large parameter sets and intermediate activations, making them highly sensitive to memory bandwidth constraints. GNNs introduce the greatest challenge, as their irregular and sparse data structures produce unpredictable access patterns that resist static allocation strategies. Table 16 summarizes these allocation challenges. As model sizes continue to grow, accelerators must dynamically manage memory resources rather than relying on static allocation schemes, and memory capacity increasingly dictates how large a model can be deployed on a given accelerator.

Table 16: Memory Allocation Challenges: Efficient memory management in AI accelerators balances data access speed with hardware constraints, mitigating performance bottlenecks caused by latency, bandwidth limitations, and irregular data patterns. Complex models such as transformers and graph networks impose variable and demanding memory requirements that amplify these challenges.

Challenge	Impact on Execution	Key Considerations for Allocation
High Memory Latency	Slow data access delays execution and reduces throughput.	Prioritize placing frequently accessed data in faster memory locations.
Limited On-Chip Storage	Small local memory constrains the amount of data available near compute units.	Allocate storage efficiently to maximize data availability without exceeding hardware limits.
High Off-Chip Bandwidth Demand	Frequent access to external memory increases delays and power consumption.	Reduce unnecessary memory transfers by carefully managing when and how data is moved.
Irregular Memory Access Patterns	Some models require accessing data unpredictably, leading to inefficient memory usage.	Organize memory layout to align with access patterns and minimize unnecessary data movement.
Model-Specific Memory Needs	Different models require different allocation strategies to optimize performance.	Tailor allocation decisions based on the structure and execution characteristics of the workload.

Combinatorial complexity

The small convolution example also explains why hardware mapping becomes a combinatorial search problem. Keeping a filter local improves reuse only if the chosen processing elements have enough nearby storage and if the loop order revisits that filter before the data is evicted. Parallelizing across more processing elements improves throughput only until synchronization and interconnect traffic consume the gain. Table 17 lists these recurring tensions: each row is another way the same three decisions, placement, allocation, and scheduling, constrain one another. Because every row is an independent trade-off with no dominant choice, the optimal mapping cannot be picked greedily one row at a time; the decisions must be searched jointly, which is precisely what makes mapping a combinatorial-search problem with no closed-form optimum.

Table 17: Placement-Allocation-Scheduling Trade-Offs: AI accelerator performance depends on mapping computations to hardware, allocating data to memory tiers, and scheduling execution over time. Careful consideration of these interdependent factors is essential for maximizing throughput and minimizing energy consumption.

Dimension	Placement Considerations	Allocation and Scheduling Considerations
Computational Granularity	Fine-grained placement enables greater parallelism but increases synchronization overhead.	Coarse-grained scheduling reduces synchronization overhead but may limit flexibility.
Spatial vs. Temporal Mapping	Spatial placement enhances parallel execution but can lead to resource contention and memory congestion.	Temporal scheduling balances resource sharing but may reduce overall throughput.
Memory and Data Locality	Placing data closer to compute units minimizes latency but may reduce overall memory availability.	Allocating data across multiple memory levels increases capacity but introduces higher access costs.
Communication and Synchronization	Co-locating compute units reduces communication latency but may introduce contention.	Scheduling synchronization mechanisms mitigates stalls but can introduce additional overhead.
Dataflow and Execution Ordering	Static placement simplifies execution but limits adaptability to workload variations.	Dynamic scheduling improves adaptability but adds scheduling complexity.

These interacting factors define a vast combinatorial design space where small variations in mapping decisions lead to large differences in performance and energy efficiency. Unlike traditional workloads with predictable execution patterns, machine learning models introduce diverse computational structures that require mappings adapted to data reuse, parallelization opportunities, and memory constraints. The search space grows combinatorially, making exhaustive search infeasible. Three sources of variation contribute to this complexity:

Ordering computation and execution

Machine learning workloads are often structured as nested loops that iterate over various dimensions of computation. For instance, a matrix multiplication kernel may loop over batch size ($B$), input features ($C_{\text{in}}$), and output features ($C_{\text{out}}$). The order in which these loops execute has a profound effect on data locality, reuse patterns, and computational efficiency.

The number of ways to arrange $n_{\text{loops}}$ loops follows a factorial growth pattern: \[ N_{\text{order}} = n_{\text{loops}}! \] which scales rapidly. A typical convolutional layer may involve up to seven loop dimensions, leading to: \[ 7! = 5,040 \text{ possible execution orders.} \]

Mapping choices explode combinatorially as loop dimensions grow.

When considering multiple memory levels, the search space expands as: \[ (n_{\text{loops}}!)^{N_{\text{mem}}} \] where $N_{\text{mem}}$ is the number of memory hierarchy levels. This rapid expansion shows why execution order optimization matters: poor loop ordering can lead to excessive memory traffic, while an optimized order improves cache utilization (Sze et al. 2017).

Parallelization across processing elements

Modern AI accelerators use thousands of processing elements to maximize parallelism, but determining which computations should be parallelized requires careful analysis. Excessive parallelization can introduce synchronization overheads and increased bandwidth demands, while insufficient parallelization leads to underutilized hardware.

The number of ordered ways to distribute computations among parallel units follows the permutation count: \[ \mathcal{P}_{\text{parallel}} = \frac{n_{\text{loops}}!}{(n_{\text{loops}}-k_{\text{parallel}})!} \] where $n_{\text{loops}}$ is the number of loops, and $k_{\text{parallel}}$ is the number selected for parallel execution. For a six-loop computation where three loops are chosen for parallel execution, the number of valid configurations is: \[ \frac{6!}{(6-3)!} = 120. \]

Even for a single layer, there can be hundreds of valid parallelization strategies, each affecting data synchronization, memory contention, and overall compute efficiency. Expanding this across multiple layers and model architectures further magnifies the complexity.

Memory placement and data movement

The hierarchical memory structure of AI accelerators introduces additional constraints, as data must be efficiently placed across registers, caches, shared memory, and off-chip DRAM. Data placement impacts latency, bandwidth consumption, and energy efficiency. Frequent access to slow memory creates bottlenecks, while optimized placement reduces costly memory transfers.

The number of ways to allocate data across memory levels follows an exponential growth function: \[ \mathcal{M}_{\text{placement}} = n^{N_{\text{comp}} \times N_{\text{mem}}} \] where:

$n$ = number of placement choices per level,
$N_{\text{comp}}$ = number of computational dimensions,
$N_{\text{mem}}$ = number of memory hierarchy levels.

For a model with:

$N_{\text{comp}} = 5$ computational dimensions,
$N_{\text{mem}} = 3$ memory levels,
$n = 4$ possible placement choices per level,

the number of possible memory allocations is: \[ 4^{5 \times 3} = 4^{15} = 1,073,741,824. \]

Mapping search space

Even a single layer may have over a billion possible memory configurations, making manual optimization impractical. By combining the complexity from computation ordering, parallelization, and memory placement, the total mapping search space can be approximated as: \[ \mathcal{S}_{\text{mapping}} = \left( n^{N_{\text{comp}}} \times n_{\text{loops}}! \times \frac{n_{\text{loops}}!}{(n_{\text{loops}}-k_{\text{parallel}})!} \right)^{N_{\text{mem}}} \] where:

$n^{N_{\text{comp}}}$ represents memory placement choices,
$n_{\text{loops}}!$ accounts for computation ordering choices,
$\frac{n_{\text{loops}}!}{(n_{\text{loops}}-k_{\text{parallel}})!}$ captures parallelization possibilities,
$N_{\text{mem}}$ is the number of memory hierarchy levels.

This equation illustrates the exponential growth of the search space, making brute-force search infeasible for all but the simplest cases. A concrete example makes the impact of these choices tangible.

The combinatorial explosion revealed by this analysis, potentially billions of valid configurations for a single neural network layer, poses a practical challenge: explaining how practitioners achieve strong performance despite this vast search space. Exhaustive enumeration is impossible, yet production systems routinely find useful schedules for common kernels. The answer lies in a small set of principled dataflow patterns that reduce this intractable configuration space to a manageable set of strategic choices.

Example 1.3: Loop ordering in a small convolution

Consider a convolution applying 16 filters of size $3{\times}3$ to an $8{\times}8$ single-channel input. The computation can be expressed as five nested loops iterating over output rows ($H_{\text{out}}$), output columns ($W_{\text{out}}$), filter count ($C_{\text{out}}$), filter height ($F_h$), and filter width ($F_w$). The $5! = 120$ possible orderings of these loops all produce the same numerical result, but they generate dramatically different memory traffic.

Ordering A (weight-stationary): Place the filter loops ($C_{\text{out}}$, $F_h$, $F_w$) outermost and the spatial loops ($H_{\text{out}}$, $W_{\text{out}}$) innermost. Each $3{\times}3$ filter is loaded into registers once and then applied across all 36 output positions before the next filter is loaded. Total weight loads: $16 \times 9 = 144$ values, each loaded exactly once.

Ordering B (output-stationary): Place the spatial loops outermost and the filter loops innermost. For every output position, all 16 filters must be loaded, applied, and their partial sums accumulated before advancing to the next position. If the register file cannot hold all 16 filters simultaneously, filters are repeatedly fetched from cache or DRAM. In the worst case, each of the 36 output positions reloads all 144 filter weights, producing $36 \times 144 = 5{,}184$ weight reads.

Systems insight: Ordering A reduces weight traffic by 36$\times$ compared to Ordering B by matching the loop structure to a weight-stationary dataflow. This single reordering decision, one of the 120 possibilities predicted by the $n_{\text{loops}}! = 5! = 120$ formula, determines whether the accelerator spends its memory bandwidth loading fresh data or redundantly re-fetching weights it has already seen.

Self-Check: Question

Which decomposition best captures the three dimensions of hardware mapping for neural-network execution?
1. Operation placement on specific compute resources, tensor allocation across levels of the memory hierarchy, and temporal execution order with synchronization—all decided jointly
2. Choosing the programming language and framework that will implement the model at deployment time
3. Randomly distributing tensor operations across available cores to achieve fair load balancing
4. Pruning or quantizing the model so fewer parameters need to be stored on the accelerator
Explain why computation placement and memory allocation must be decided jointly rather than independently, using a concrete example of how decoupling them degrades performance.
A 2D convolution has loops over output-height, output-width, input-channels, output-channels, filter-height, and filter-width. Why does reordering these loops materially affect performance even though the arithmetic result is identical?
1. Loop order determines which variables become the innermost (fastest-changing) indices, which in turn controls which operands can be reused from a register or cache versus reloaded from a slower tier—turning a high-reuse schedule into a schedule that reloads the same data repeatedly
2. Loop order determines the mathematical value of the convolution, so different orders produce different numerical outputs
3. Loop order only affects code readability and has no effect on runtime performance on modern hardware
4. The hardware accepts exactly one legal loop nesting per convolution kernel, so reordering is impossible rather than inefficient
A team profiles a transformer kernel and finds that its activations are allocated to HBM while the kernel is placed on a PE cluster with large L2 capacity and strong peer bandwidth. HBM utilization saturates at 95 percent while PE utilization sits at 18 percent. Which mapping decision failed?
1. Memory allocation: the activations should have been tiled and staged into L2 so their reuse across attention heads is served by the fast tier rather than pulled from HBM every access
2. Operation placement: the PE cluster is too fast for this kernel, so moving the operation onto a slower cluster would improve apparent utilization
3. Execution order: serializing attention heads would reduce HBM pressure by limiting concurrent accesses
4. Precision choice: switching from FP16 to FP32 would let the kernel use more of the available HBM bandwidth
Why is brute-force search over all legal mappings impractical even for a single layer?
1. The number of legal loop permutations, parallelization decompositions, and memory-placement choices grows combinatorially, so the legal-mapping space for a single convolution can exceed billions of candidates—well beyond any exhaustive evaluation
2. Modern compilers already know the optimal mapping analytically for every accelerator, so search is unnecessary
3. Only graph neural networks have enough operator complexity to make the search space nontrivial
4. Legal mappings number in the low dozens for typical kernels, so search completes in milliseconds

See Answers →

Dataflow Optimization

The mapping strategies from the preceding section establish where computations execute and where data resides, but they do not specify dataflow optimization: how data flows through processing elements during execution. A systolic array might process a matrix multiplication with weights in local memory, but the order in which weights, inputs, and outputs move through the array directly determines memory bandwidth consumption and energy efficiency. The choice among strategies directly impacts whether an accelerator operates in the compute-bound or memory-bound region identified by the Roofline analysis—which is why compilers (section 1.8) and runtime systems (section 1.9) must select appropriate dataflow patterns based on workload characteristics.

Three decisions structure all dataflow optimization:

Locality: Weight-stationary, output-stationary, and input-stationary strategies each make different choices about what to cache near compute units, trading off different memory access patterns.
Organization: Tensor layouts (NHWC vs. NCHW) determine whether memory accesses align with hardware preferences, with performance impacts that can be large when layout conversions or uncoalesced access block the fast path.
Combination: Kernel fusion and tiling restructure computation to minimize memory traffic, often producing large speedups on low-arithmetic-intensity operations by avoiding intermediate writes and reloads.

By mastering these patterns, we can reason about 90 percent of dataflow optimization decisions without exhaustive search. The next sections examine each decision in turn, then show how they combine for specific neural network architectures including ResNet-50, GPT-2, and MLPs.

Building blocks of mapping strategies

The preceding three decisions map to four foundational techniques: data movement patterns (weight-stationary, output-stationary, input-stationary), memory-efficient tensor layouts (row-major vs. channel-major), kernel fusion (combining operations to eliminate intermediate writes), and tiling (partitioning computations into memory-friendly blocks). Together, these building blocks reduce the mapping search space: heuristic and model-driven optimizers can combine them instead of rediscovering the same data-movement choices from scratch.

Data movement patterns

While computational mapping determines where and when operations occur, its success depends heavily on how efficiently data is accessed and transferred across the memory hierarchy. As discussed in section 1.4.2.2, machine learning workloads exhibit irregular access patterns that challenge standard caching mechanisms. This irregularity makes data movement strategy critical to overall system performance.

Even when computational units are mapped efficiently, poor data movement strategies degrade performance by causing frequent memory stalls and leaving hardware resources idle. If data cannot be supplied to processing elements at the required rate, computational units stall, increasing latency, memory traffic, and energy consumption (Chen, Krishna, et al. 2017). Listing 15 illustrates how data movement inefficiencies affect the backbone computation of many machine learning models through a typical matrix multiplication operation.

Listing 15: Matrix Multiplication: Data movement bottlenecks leave hardware resources idle, demonstrating why efficient data flow determines machine learning model performance.

## Matrix multiplication where:
## weights: [$512{\times}256$] - model parameters
## input:   [$256{\times}32$]  - batch of activations
## Z:       [$512{\times}32$]  - output activations

## Computing each output element Z[i,j]:
for i in range(512):
    for j in range(32):
        for k in range(256):
            Z[i, j] += weights[i, k] * input[k, j]

This computation reveals several critical dataflow challenges. The first challenge is the number of memory accesses required. For each output $Z[i, j]$, the computation must fetch an entire row of weights from the weight matrix and a full column of activations from the input matrix. Since the weight matrix contains 512 rows and the input matrix contains 32 columns, this results in repeated memory accesses that place a heavy burden on memory bandwidth.

The second challenge comes from weight reuse. The same weights are applied to multiple inputs, meaning that an ideal mapping strategy should maximize weight locality to avoid redundant memory fetches. Without proper reuse, the accelerator would waste bandwidth loading the same weights multiple times (Chen et al. 2018).

The third challenge involves the accumulation of intermediate results. Since each element in $Z[i,j]$ requires contributions from 256 different weight-input pairs, partial sums must be stored and retrieved before the final value is computed. If these intermediate values are stored inefficiently, the system will require frequent memory accesses, further increasing bandwidth demands.

One way to mitigate these challenges is to use SIMD and SIMT execution models, which allow multiple values to be fetched in parallel. However, even with these optimizations, data movement remains a bottleneck. The issue is not just how quickly data is retrieved but how often it must be moved and where it is placed within the memory hierarchy (Han et al. 2016).

Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 243–54. https://doi.org/10.1109/isca.2016.30.

Because moving data dominates the energy budget that figure 9 charts, the single most important goal of an accelerator is to minimize memory access. Dataflow strategies achieve this by maximizing data reuse. The central decision is which data is most valuable to keep local. Accelerators answer that decision by determining which data remains fixed in memory and which data streams dynamically: weight-stationary keeps model parameters local, input-stationary maintains activation data, and output-stationary preserves intermediate results. Each approach trades off different memory access patterns to maximize data reuse and minimize the energy-intensive transfers that constitute the primary bottleneck in AI acceleration.

Weight stationary

The weight stationary strategy keeps weights fixed in local memory, while input activations and partial sums are streamed through the system. Weight stationary approaches prove particularly beneficial in CNNs and matrix multiplications, where the same set of weights is applied across multiple inputs. By ensuring weights remain stationary, this method reduces redundant memory fetches, which helps alleviate bandwidth bottlenecks and improves energy efficiency.

A key advantage of weight stationary is that it maximizes weight reuse, reducing the frequency of memory accesses to external storage. Since weight parameters are often shared across multiple computations, keeping them in local memory eliminates unnecessary data movement, lowering the overall energy cost of computation. This makes it particularly effective for architectures where weights represent the dominant memory overhead, such as systolic arrays and custom accelerators designed for machine learning. Listing 16 demonstrates how Weight Stationary execution keeps weights fixed in local memory while streaming inputs and accumulating partial sums.

Listing 16: Weight Stationary Dataflow: Weights stay resident in local memory while inputs and partial sums stream through, minimizing parameter read traffic; best for CNNs and matrix multiplications with heavy weight reuse.

## Weight Stationary Matrix Multiplication
## - Weights remain fixed in local memory
## - Input activations stream through
## - Partial sums accumulate for final output

for weight_block in weights:  # Load and keep weights stationary
    load_to_local(weight_block)  # Fixed in local storage
    for input_block in inputs:  # Stream inputs dynamically
        for output_block in outputs:  # Compute results
            output_block += compute(weight_block, input_block)
            # Reuse weights across inputs

In weight stationary execution, weights are loaded once into local memory and remain fixed throughout the computation while inputs stream dynamically, reducing redundant memory accesses. Partial sums accumulate efficiently, minimizing unnecessary data movement. Because weights need not be reloaded for each new computation, bandwidth requirements drop significantly, making this dataflow highly effective for workloads with heavy weight reuse patterns such as CNNs and matrix multiplications.

However, while this strategy reduces weight-related memory traffic, it introduces trade-offs in input and output movement. Since inputs must be streamed dynamically while weights remain fixed, the efficiency of this approach depends on how well input activations can be delivered to the computational units without causing stalls. Partial sums, which represent intermediate results, must also be carefully accumulated to avoid excessive memory traffic. The total performance gain depends on the size of available on-chip memory, as storing larger weight matrices locally can become a constraint in models with millions or billions of parameters.

The weight stationary strategy is well-suited for workloads where weights exhibit high reuse and memory bandwidth is a limiting factor. It is commonly employed in CNNs, systolic arrays, and matrix multiplication kernels, where structured weight reuse leads to measurable performance improvements. However, for models where input or output reuse is more critical, alternative dataflow strategies, such as output stationary or input stationary, may provide better trade-offs.

Output stationary

Weight stationary keeps weights local and streams inputs through the system. The dominant cost shifts, however, when the bottleneck is not weight loading but the frequent writes of partial sums. In fully connected layers and transformer attention mechanisms, each output element accumulates contributions from hundreds or thousands of weight-input pairs. Writing those intermediate partial sums to external memory after every accumulation step would create a write-bandwidth bottleneck far more severe than the read overhead that weight stationary addresses. The output stationary strategy inverts the priority: it keeps partial sums fixed in local memory while streaming both weights and input activations through the system, so that each output element is written to external memory only once, after all its contributions have been accumulated (Chen, Krishna, et al. 2017).

Chen, Yu-Hsin, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.” IEEE Journal of Solid-State Circuits 52 (1): 127–38. https://doi.org/10.1109/JSSC.2016.2616357.

Listing 17 demonstrates how accumulating partial sums locally minimizes memory writes and enhances efficiency during matrix multiplication. In this implementation, the accumulator buffer stays in local registers or scratchpad throughout the inner loop; weights and inputs stream in, contribute to the running sum, and are discarded. The final result is written out only once per output element, eliminating the repeated write traffic that would otherwise dominate bandwidth.

Listing 17: Output Stationary Dataflow: Partial sums stay resident in local memory while weights and inputs stream through, so each output is written out only once, minimizing accumulation write traffic; best for fully connected layers and attention.

## - Partial sums remain in local memory
## - Weights and input activations stream through dynamically
## - Final outputs are written only once

for output_block in outputs:  # Keep partial sums stationary
    accumulator = 0  # Initialize accumulation buffer
    for weight_block, input_block in zip(weights, inputs):
        accumulator += compute(weight_block, input_block)
        # Accumulate partial sums
    store_output(accumulator)  # Single write to memory

This approach aligns naturally with systolic arrays, where computation progresses through a grid of processing elements and partial sums can flow along one axis without leaving the chip. The trade-off is that both weights and activations must now be streamed dynamically, so the system must sustain high read bandwidth for two data streams simultaneously. Parallel implementations also require careful synchronization when multiple PEs contribute to the same output element. Output stationary is therefore most effective for workloads where accumulation dominates, such as fully connected layers and attention mechanisms, but less suitable when input reuse is the critical bottleneck.

Input stationary

The two strategies examined so far each fix a different operand in local memory: weight stationary fixes weights to reduce read bandwidth for parameters, and output stationary fixes partial sums to reduce write bandwidth for accumulations. The third strategy completes the picture by fixing the remaining operand: input activations. In transformer models, a single input token participates in computations across multiple attention heads and layers; in batch processing, the same activation batch feeds into many different weight matrices. When activation reuse is the dominant memory cost, keeping inputs stationary and streaming weights through the system yields the best energy and bandwidth trade-off. Listing 18 illustrates this approach, maximizing reuse by keeping input activations stationary in local memory while dynamically streaming weights.

Listing 18: Input Stationary Dataflow: Input activations stay resident in local memory while weights stream through, minimizing activation read traffic; best for transformers and large-batch inference where each activation is reused.

## - Input activations remain in local memory
## - Weights stream through dynamically
## - Partial sums accumulate and are written out

for input_block in inputs:  # Keep input activations stationary
    load_to_local(input_block)  # Fixed in local storage
    for weight_block in weights:  # Stream weights dynamically
        for output_block in outputs:  # Compute results
            output_block += compute(weight_block, input_block)
            # Reuse inputs across weights

Here, input activations are loaded once and held fixed while weights stream through. Partial sums accumulate and are eventually written out, but unlike output stationary, the accumulation buffer is not the primary beneficiary of locality; instead, the input data is.

The trade-off mirrors the other two strategies: weights must now be streamed dynamically, so the system needs sustained read bandwidth for the weight stream, and partial sums require buffering before write-back. Input stationary is most effective in transformers (where each token is reused across attention heads), recurrent networks (where the hidden state participates in repeated computations), and large-batch inference (where the same activation batch feeds many weight matrices).

Taken together, the three dataflow strategies illustrate a central design choice rather than a hierarchy of quality. Weight stationary minimizes read traffic for parameters and suits CNNs with small, heavily reused filters. Output stationary minimizes write traffic for accumulations and suits fully connected layers with high fan-in. Input stationary minimizes read traffic for activations and suits transformers and batch processing with high activation reuse. No single strategy dominates; the optimal choice depends on which data element has the highest reuse ratio relative to its size, a determination that the compiler and hardware designer must make based on the specific workload and memory hierarchy. Convolution-specific designs add a fourth label, row-stationary (as in Eyeriss), but it is not a separate primitive: it is a hybrid that keeps rows of inputs and partial sums local when that mapping yields higher reuse than fixing any single operand (section 1.4.6.2).

Memory-efficient tensor layouts

The preceding dataflow strategies determine which data stays close to compute; tensor layouts determine whether that data can be accessed efficiently once it arrives. A perfectly chosen weight-stationary dataflow still suffers if weights are stored in a format that causes scattered memory accesses. Tensor layout is therefore a kernel contract: the physical arrangement of multidimensional data must match the access pattern expected by the selected hardware path, or the accelerator pays in memory stalls, inefficient cache usage, and increased data movement.

In AI accelerators, tensor layout optimization is particularly important because data is frequently accessed in patterns dictated by the underlying hardware architecture. Choosing the right layout ensures that memory accesses align with hardware-friendly access patterns, minimizing overhead from costly memory transactions (NVIDIA Corporation 2021).

While developers can sometimes manually specify tensor layouts, the choice is often determined automatically by machine learning frameworks such as TensorFlow, PyTorch, and JAX, by compilers, or by AI accelerator runtimes. Low-level optimization tools such as cuDNN (for NVIDIA GPUs), XLA (for TensorFlow graphs), and MLIR-based compiler stacks may impose or transform tensor layouts as they lower operations to backend-specific kernels (NVIDIA Corporation 2021; Google 2025; Lattner et al. 2020). In high-level frameworks, layout transformations are typically applied transparently, but developers working with custom kernels or low-level libraries such as CUDA, Metal, or OpenCL may have direct control over tensor format selection.

For example, PyTorch exposes tensor layout operations such as tensor.permute() and tensor.contiguous() for explicit memory-format control (Paszke et al. 2019). TensorFlow often applies layout optimizations internally through the XLA compiler, choosing between NHWC (row-major) and NCHW (channel-major) based on the target hardware (Brain 2022). Hardware-aware libraries such as cuDNN for GPUs and oneDNN for CPUs enforce specific memory layouts to maximize cache locality and SIMD efficiency. The practical rule is to treat layout as part of the selected backend path: the fastest tensor format is the one that avoids conversion overhead and makes the kernel’s memory accesses contiguous.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32: 8024–35.

Brain, Google. 2022. TensorFlow Documentation.

Row-major layout

Row-major layout is the memory storage convention where multi-dimensional tensor elements are arranged row by row, ensuring that all values in a given row are placed contiguously before moving to the next row. This storage format is widely used in general-purpose CPUs and some machine learning frameworks because it aligns naturally with sequential memory access patterns, making it more cache-efficient for certain types of operations (Intel Corporation 2021b).

To understand how row-major layout works, consider a single RGB image represented as a tensor of shape (Height, Width, Channels). If the image has a size of $3{\times}3$ pixels with 3 channels (RGB), the corresponding tensor is structured as (3, 3, 3). The values are stored in memory as follows: \[\begin{gather*} I(0,0,0), I(0,0,1), I(0,0,2), I(0,1,0), I(0,1,1), \\ I(0,1,2), I(0,2,0), I(0,2,1), I(0,2,2), \ldots \end{gather*}\]

Each row is stored contiguously, meaning all pixel values in the first row are placed sequentially in memory before moving on to the second row. This ordering is advantageous because CPUs and cache hierarchies are optimized for sequential memory access. When data is accessed in a row-wise fashion, such as when applying element-wise operations like activation functions or basic arithmetic transformations, memory fetches are efficient, and cache utilization is maximized (Sodani 2015).

Sodani, Avinash. 2015. “Knights Landing (Knl): 2nd Generation Intel® Xeon Phi Processor.” 2015 IEEE Hot Chips 27 Symposium (HCS), 1–24. https://doi.org/10.1109/hotchips.2015.7477467.

The efficiency of row-major storage becomes particularly evident in CPU-based machine learning workloads, where operations such as batch normalization, matrix multiplications, and element-wise arithmetic frequently process rows of data sequentially. Since modern CPUs employ cache prefetching mechanisms, a row-major layout allows the next required data values to be preloaded into cache ahead of execution, reducing memory latency and improving overall computational throughput.

However, layout choice becomes subtle for convolutions because logical dimension order and physical memory format are not the same thing. A tensor may be described as NHWC or NCHW, but the backend ultimately cares whether the memory addresses consumed by a kernel are contiguous, aligned, and coalesced for the specific operator and precision mode.

Despite these limitations, row-major layout remains important in CPU-based machine learning frameworks. TensorFlow, for instance, commonly uses NHWC conventions, while PyTorch commonly exposes NCHW tensors with a separate channels-last memory format option. When targeting GPUs, frameworks and libraries may insert, propagate, or internally perform layout transformations to match the fastest kernel path.

Channel-major layout

In contrast to row-major layout, channel-major layout arranges data in memory such that all values for a given channel are stored together before moving to the next channel. The key insight is that GPUs process data in parallel across threads, and when threads access consecutive memory addresses, the hardware can combine these requests into a single efficient transaction (memory coalescing). Historically, many GPU convolution paths used NCHW effectively, while modern Tensor Core convolution and fusion paths often prefer NHWC or channels-last physical layouts because those layouts align better with vectorized tensor-core kernels.

To understand how channel-major layout works, consider the same RGB image tensor of size (Height, Width, Channels) = (3, 3, 3). Instead of storing pixel values row by row, the data is structured channel-first in memory as follows: \[\begin{gather*} I(0,0,0), I(0,1,0), I(0,2,0), I(1,0,0), I(1,1,0), I(1,2,0), \ldots, \\ I(0,0,1), I(0,1,1), I(0,2,1), \ldots, I(0,0,2), I(0,1,2), I(0,2,2), \ldots \end{gather*}\]

In this format, all red channel values for the entire image are stored first, followed by all green values, and then all blue values. This ordering can allow some hardware accelerators to efficiently load and process data across channels in parallel, which is important for convolution operations and SIMD (Single Instruction, Multiple Data) execution models (Chetlur et al. 2014).

Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: Efficient Primitives for Deep Learning.” arXiv Preprint arXiv:1410.0759.

The advantage of a given layout becomes clear only relative to a specific backend. Convolutional layers process images by applying a shared set of filters across all channels. Depending on the kernel implementation, NCHW, NHWC, or a blocked internal layout may minimize scattered memory fetches, reduce memory latency, and improve data locality for the lowered matrix multiplications that implement convolution.

Because GPUs and TPUs rely on memory coalescing³⁴, a technique in which consecutive threads fetch contiguous memory addresses, the best layout is the one that makes the kernel’s actual thread-access pattern contiguous. For example, in NVIDIA GPU convolution paths, cuDNN may use or internally convert to NHWC/channels-last for Tensor Core kernels, while other kernels may still perform well with NCHW. The rule is backend- and operator-dependent rather than a universal CPU=NHWC, GPU=NCHW split.

³⁴ Memory Coalescing: The GPU hardware mechanism that fuses memory requests from threads in a warp into a single transaction when those threads access contiguous memory. Tensor layout affects coalescing, but the sign of the effect depends on the kernel: NCHW can be efficient for some convolution implementations, while NHWC/channels-last is often preferred for modern Tensor Core convolution and fusion paths. Poor layout choices can still create multi-fold performance gaps, but the correct fix is to match the layout to the backend rather than memorize one universal format.

Despite its advantages for some accelerator kernels, channel-major layout can introduce inefficiencies when running on general-purpose CPUs or kernels optimized for channels-last access. Since CPUs optimize for sequential memory access and vectorized loops, the most efficient layout depends on the operation, framework convention, and library implementation.

Modern AI frameworks and compilers often transform tensor layouts dynamically depending on the execution environment, but this is not guaranteed for every model or operation³⁵. TensorFlow, XLA, cuDNN, and TensorRT may insert or choose layout conversions internally; PyTorch exposes explicit channels-last conversion and propagation paths. Developers still need to profile layout choices when convolution performance is material.

³⁵ NHWC vs. NCHW: NHWC lists dimensions as batch, height, width, channel; NCHW lists batch, channel, height, width. Physical memory format determines whether adjacent threads read adjacent addresses, so the performance effect is backend-specific. Layout-to-hardware mismatch is not a micro-optimization; it can create multi-fold performance gaps, especially when a conversion prevents Tensor Core convolution or fusion paths from being used.

Comparing row-major and channel-major layouts

Both row-major (NHWC) and channel-major (NCHW) layouts serve distinct purposes in machine learning workloads, with their efficiency largely determined by the hardware architecture, memory access patterns, and computational requirements. The choice of layout directly influences cache utilization, memory bandwidth efficiency, and processing throughput. Table 18 contrasts the performance trade-offs and hardware compatibility between these two approaches.

Table 18: Data Layout Strategies: Row-major (NHWC) and channel-major (NCHW) layouts optimize memory access patterns for different backend kernels. NHWC/channels-last often suits CPU vectorization and modern Tensor Core convolution/fusion paths, while NCHW remains common in PyTorch model code and many GPU kernels. Choosing the appropriate layout directly impacts performance by maximizing cache utilization and memory bandwidth efficiency.

Feature	Row-Major (NHWC)	Channel-Major (NCHW)
Memory Storage Order	Pixels are stored row-by-row, channel interleaved	All values for a given channel are stored together first
Best for	CPU loops, element-wise operations, many channels-last kernels	Many legacy GPU convolution paths and channel-first model code
Cache Efficiency	High cache locality for sequential row/channel-last access	Can improve coalescing for channel-first kernels
Convolution Performance	Often preferred by modern Tensor Core convolution/fusion paths	Efficient for many cuDNN and framework kernels
Memory Fetching	Good when kernels vectorize over adjacent channel data	Good when kernels process channel-major tiles
Default in Frameworks	Common TensorFlow convention; PyTorch channels-last option	Common PyTorch tensor shape convention

The decision to use row-major (NHWC) or channel-major (NCHW) layouts is not always made manually by developers. Instead, machine learning frameworks and AI compilers often determine the optimal layout dynamically based on the target hardware and operation type.

In practice, modern AI compilers such as TensorFlow’s XLA, cuDNN, TensorRT, and PyTorch compilation paths may perform layout transformations or propagate layout metadata. The result can be high throughput without manual tensor rewrites, but performance-sensitive deployments should still profile both layout and conversion overhead on the target hardware.

Kernel fusion

One of the most impactful optimization techniques in AI acceleration involves reducing the overhead of intermediate data movement between operations. Kernel fusion³⁶ transforms multiple separate computations into unified operations, dramatically improving memory efficiency and execution performance. The memory bottlenecks created by intermediate writes motivate kernel fusion, which eliminates these inefficiencies.

³⁶ Kernel Fusion: The “intermediate data movement” referenced occurs because each separate GPU function, or kernel, must write its result back to high-bandwidth memory (HBM) before the next one begins. By compiling multiple operations into a single kernel, fusion allows intermediate values to live in fast on-chip memory, completely avoiding the HBM write/read cycle. For memory-bound operations common in transformers, this reduction in memory traffic (often 2–3$\times$) translates directly to a proportional increase in performance.

Intermediate memory write

AI model performance is often constrained by memory bandwidth and intermediate memory writes rather than pure arithmetic operations. Every time an operation produces an intermediate result that must be written to memory and later read back, execution stalls from the data movement overhead.

Kernel fusion represents the critical bridge between the software optimization techniques introduced in Operator fusion and the memory bandwidth constraints analyzed in section 1.4.1. Many AI workloads introduce unnecessary intermediate memory writes, increasing memory bandwidth consumption and reducing execution efficiency (NVIDIA Corporation 2017).

NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture. NVIDIA Whitepaper.

Listing 19 reveals how each operation becomes a separate kernel in a naïve execution model, forcing intermediate results to be written to memory and then read back for the next operation.

Each operation produces an intermediate tensor that must be written to memory and retrieved for the next operation. On large tensors, this overhead of moving data can outweigh the computational cost of the operations (Shazeer et al. 2018). Table 19 illustrates the memory overhead in a naïve execution model. While only the final result $Y$ is needed, storing multiple intermediate tensors creates unnecessary memory traffic and inefficient memory usage.

Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, et al. 2018. “Mesh-TensorFlow: Deep Learning for Supercomputers.” arXiv Preprint arXiv:1811.02084.

Listing 19: Naïve Execution: Each step writes intermediate results to memory before processing the next, leading to increased bandwidth usage and reduced efficiency.

import torch

## Input tensor
X = torch.randn(1024, 1024).cuda()

## Step-by-step execution (naïve approach)
X1 = torch.relu(X)  # Intermediate tensor stored
# in memory
X2 = torch.batch_norm(X1)  # Another intermediate tensor stored
Y = 2.0 * X2 + 1.0  # Final result

Table 19: Intermediate Tensor Storage: Naive execution models require substantial memory to store intermediate tensors generated by each operation. For a $1024{\times}1024$ tensor, storing intermediate results (even when only the final output is needed) quadruples the total memory footprint from 4.2 MB to 16.8 MB. Minimizing intermediate data storage is essential for improving memory efficiency.

Tensor	Size (MB) for $1024{\times}1024$ Tensor
X	4.2 MB
X’	4.2 MB
X’’	4.2 MB
Y	4.2 MB
Total Memory	16.8 MB

Kernel fusion for memory efficiency

The three intermediate tensors waste both memory capacity and bandwidth, limiting scalability on AI accelerators where data movement dominates execution cost. Kernel fusion minimizes intermediate memory writes, reducing the memory footprint and bandwidth consumption of machine learning workloads (Jia et al. 2018). Kernel fusion merges multiple computation steps into a single, optimized operation, eliminating the need for storing and reloading intermediate tensors. Instead of executing each layer or element-wise operation separately, in which each step writes its output to memory before the next step begins, fusion enables direct data propagation between operations, keeping computations within high-speed registers or local memory.

Jia, Zhihao, Matei Zaharia, and Alex Aiken. 2018. “Beyond Data and Model Parallelism for Deep Neural Networks.” arXiv Preprint arXiv:1807.05358.

A common machine learning sequence might involve applying a nonlinear activation function (e.g., ReLU), followed by batch normalization, and then scaling the values for input to the next layer. In a naïve implementation, each of these steps generates an intermediate tensor, which is written to memory, read back, and then modified again: \[ \begin{aligned} X' &= \text{ReLU}(X) \\ X'' &= \text{BatchNorm}(X') \\ Y &= \alpha \cdot X'' + \beta \end{aligned} \]

With kernel fusion, these operations are combined into a single computation step, allowing the entire transformation to occur without generating unnecessary intermediate tensors: \[ Y = \alpha \cdot \text{BatchNorm}\big(\text{ReLU}(X)\big) + \beta \]

Table 20 quantifies the local-memory benefit of fusion before the section generalizes the rule: eliminating intermediate tensors cuts both stored state and memory traffic.

Table 20 highlights the impact of operation fusion on memory efficiency. By keeping intermediate results in registers or local memory rather than writing them to main memory, fusion significantly reduces memory traffic. This optimization is especially beneficial on highly parallel architectures like GPUs and TPUs, where minimizing memory accesses translates directly into improved execution throughput. Compared to the naïve execution model, fused execution eliminates the need for storing intermediate tensors, dramatically lowering the total memory footprint and improving overall efficiency.

Table 20: Operation Fusion Benefits: Fused execution reduces memory usage by eliminating the need to store intermediate tensors, directly improving efficiency on memory-bound hardware like GPUs and TPUs. Memory consumption drops from 16.8 MB in naive execution to 4.2 MB with fused operations, a 4$\times$ reduction.

Execution Model	Intermediate Tensors Stored	Total Memory Usage (MB)
Naïve Execution	X’, X’’	16.8 MB
Fused Execution	None	4.2 MB

Performance benefits and constraints

Kernel fusion brings several key advantages that enhance memory efficiency and computation throughput. By reducing memory accesses, fused kernels ensure that intermediate values stay within registers instead of being repeatedly written to and read from memory. This significantly lowers memory traffic, which is one of the primary bottlenecks in machine learning workloads. GPUs and TPUs, in particular, benefit from kernel fusion because high-bandwidth memory is a scarce resource, and reducing memory transactions leads to better utilization of compute units (NVIDIA Corporation 2020).

However, not all operations can be fused arbitrarily. Element-wise operations, such as ReLU, batch normalization, and simple arithmetic transformations, are ideal candidates for fusion since their computations depend only on single elements from the input tensor. Matrix multiplications and convolutions constrain fusion because they involve reductions and large data movement; they are often fused with element-wise epilogues such as bias, normalization variants, or activation, but cannot be freely fused with unrelated global operations.

Another major consideration is register pressure. Fusing multiple operations means all temporary values must be kept in registers rather than memory. While this eliminates redundant memory writes, it also increases register demand. If a fused kernel exceeds the available registers per thread, the system must spill excess values into shared memory, introducing additional latency and potentially negating the benefits of fusion. On GPUs, where thread occupancy (the number of threads that can run in parallel) is limited by available registers, excessive fusion can reduce parallelism, leading to diminishing returns.

Different AI accelerators and compilers handle fusion in distinct ways. NVIDIA GPUs, for example, favor warp-level parallelism, where element-wise fusion is straightforward (NVIDIA Corporation 2020). TPUs, on the other hand, prioritize systolic array execution for dense matrix operations (Jouppi et al. 2017). Compiler and inference stacks such as TVM, XLA, TensorRT, and MLIR apply graph rewrites, lowering passes, or engine-building heuristics to balance memory savings against execution constraints (Chen et al. 2018; Google 2025; NVIDIA 2024b; Lattner et al. 2020).

Despite its advantages, fusion is not always beneficial. Some AI frameworks allow developers to disable fusion selectively, especially when debugging performance issues or making frequent model modifications. The decision to fuse operations must consider trade-offs between memory efficiency, register usage, and hardware execution constraints to ensure that fusion leads to tangible performance improvements.

These fusion decisions are ultimately about data locality, which now ties together the chapter’s core data movement strategies.

Checkpoint 1.3: Data movement and kernel fusion

The dataflow building blocks are now in place: data locality, tensor layout, and kernel fusion. You should be able to reason through each one:

Locality choice: Given weight-stationary, output-stationary, and input-stationary dataflows, which tensor does each hold near the compute units, and how does that choice shift where the memory traffic lands?
Layout match: A kernel runs on a backend expecting NHWC but receives an NCHW tensor. Predict what happens to its memory access pattern and to delivered throughput.
Fusion eligibility: For a Conv2D, a BatchNorm, and a ReLU executed in sequence, decide which pairs are worth fusing and what determines whether fusing them pays off.
Remaining gap: Locality, layout, and fusion are all chosen well, yet the operation still stalls on memory. What is the missing transformation, and why do the other three not address it?

Memory-efficient tiling strategies

While modern AI accelerators offer high computational throughput, their performance is often limited by memory bandwidth rather than raw processing power. If data cannot be supplied to processing units fast enough, execution stalls occur, leading to wasted cycles and inefficient hardware utilization.

Tiling³⁷ mitigates this issue by restructuring computations into smaller, memory-friendly subproblems. The core insight is direct: if we cannot make memory faster, we can at least make fewer trips to it. Instead of processing entire matrices or tensors at once, which leads to excessive memory traffic, tiling partitions computations into smaller blocks (tiles) that fit within fast local memory (for example, caches, shared memory, or registers) (Lam et al. 1991).

³⁷ Tiling (Loop Blocking): This restructuring directly enables the “fewer trips to memory” insight by partitioning a computation into blocks that fit entirely within fast local cache. Instead of fetching an element from slow DRAM $\mathcal{O}(N)$ times in a naive matrix multiply, it is fetched once per tile and then reused while resident in the fast tier (Lam et al. 1991). This reduction in memory traffic is the primary source of the large gap between naive matrix multiplication and optimized GEMM routines.

Lam, Monica D., Edward E. Rothberg, and Michael E. Wolf. 1991. “The Cache Performance and Optimizations of Blocked Algorithms.” Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS-IV, 63–74. https://doi.org/10.1145/106972.106981.

Matrix multiplication, widely used in AI models, demonstrates inefficient memory access when implemented naively. Listing 20 shows how, without tiling, repeated memory accesses for the same data lead to unnecessary bandwidth consumption.

Listing 20: Naïve Matrix Multiplication: Direct implementation without tiling requires $\mathcal{O}(N^3)$ memory accesses for $N{\times}N$ matrices, repeatedly fetching the same elements from slow DRAM memory and limiting performance to a fraction of theoretical peak throughput.

for i in range(N):
    for j in range(N):
        for k in range(N):
            C[i, j] += A[i, k] * B[k, j]  # Repeatedly fetching
            # A[i, k] and B[k, j]

Each iteration requires loading elements from matrices $A$ and $B$ multiple times from memory, causing excessive data movement. As the size of the matrices increases, the memory bottleneck worsens, limiting performance.

Tiling addresses this problem by ensuring that smaller portions of matrices are loaded into fast memory, reused efficiently, and only written back to main memory when necessary. This technique is especially important in AI accelerators, where memory accesses dominate execution time. Figure 14 labels the product $C = AB$ by its three dimensions: $M$ rows and $K$ columns in $A$, $K$ rows and $N$ columns in $B$, and $M{\times}N$ in the output $C$. Each highlighted tile is the working set that fits in fast memory at one moment: a green $M_{\text{tile}}{\times}K_{\text{tile}}$ row band of $A$ multiplies a pink $K_{\text{tile}}{\times}N_{\text{tile}}$ column band of $B$ to accumulate one blue $M_{\text{tile}}{\times}N_{\text{tile}}$ output block ($\text{Block}_{m,n}$) of $C$. The key insight is that we process all computations for each tile before moving to the next, rather than bouncing between tiles and repeatedly paying the DRAM access penalty.

Figure 14: **Matrix Tiling**: Partitioning large matrices into smaller tiles optimizes data reuse and reduces memory access overhead during computation. This technique improves performance on AI accelerators by enabling efficient loading and processing of data in fast memory, minimizing transfers from slower main memory.

Tiling fundamentals

Tiling is based on a straightforward principle: instead of operating on an entire data structure at once, computations are divided into smaller tiles that fit within the available fast memory. By structuring execution around these tiles, data reuse is maximized, reducing redundant memory accesses and improving overall efficiency.

Consider matrix multiplication, a key operation in machine learning workloads. The operation computes $C = A \times B$ where each element $C[i,j] = \sum_{k} A[i,k] \times B[k,j]$. The naive implementation shown earlier in listing 20 demonstrates the core problem: every iteration of the innermost loop fetches elements from matrices $A$ and $B$ from memory, performs a multiplication, and updates matrix $C$. Because matrices are large, the processor repeatedly reloads the same values from memory, even though they were just used in previous computations.

This data movement overhead is expensive: fetching from DRAM is 100–1,000$\times$ slower than accessing on-chip cache or registers (Horowitz 2014; Sze et al. 2017). The solution is tiling.

Performance benefits of tiling

Instead of computing one element at a time and constantly moving data in and out of slow memory, tiling processes submatrices (tiles) at a time, keeping frequently used values in fast memory. The idea is to divide the matrices into smaller blocks that fit within the processor’s cache or shared memory, ensuring that once a block is loaded, it is reused multiple times before moving to the next one. Listing 21 demonstrates cache-friendly loop blocking: the loop bounds partition the matrices into tiles, and the hardware cache hierarchy keeps recently used tile data close to the compute units when access order has locality.

Listing 21: Cache-Blocked Matrix Multiplication: This high-level loop blocking approach divides matrices into smaller index ranges so hardware caches can reuse data within each tile, improving computational efficiency without explicit scratchpad loads.

TILE_SIZE = 32  # Choose a tile size based on
# hardware constraints

# Cache blocking: partition data via loop bounds.
# Loads are implicit through the hardware cache hierarchy.
for i in range(0, N, TILE_SIZE):
    for j in range(0, N, TILE_SIZE):
        for k in range(0, N, TILE_SIZE):
            # Each tile computed independently
            for ii in range(i, i + TILE_SIZE):
                for jj in range(j, j + TILE_SIZE):
                    for kk in range(k, k + TILE_SIZE):
                        C[ii, jj] += A[ii, kk] * B[kk, jj]

This restructuring significantly improves performance through three reinforcing effects. Memory reuse improves because the approach visits a small tile repeatedly while it is likely to remain in cache before moving on to the next tile, rather than fetching elements from $A$ and $B$ repeatedly from slow memory, which minimizes redundant memory accesses. Memory bandwidth usage drops as a direct consequence: since each tile is used multiple times before being evicted, most required data is available in L1/L2 cache or shared memory rather than DRAM, so traffic falls and execution speeds up. Compute efficiency rises in turn, because processors spend less time waiting for data and more time performing useful work; in architectures like GPUs and TPUs, where thousands of parallel processing units operate simultaneously, tiling keeps data read and processed in a structured manner that avoids unnecessary stalls.

This technique is particularly effective in AI accelerators, where machine learning workloads consist of large matrix multiplications and tensor transformations. Without tiling, these workloads quickly become memory bound, meaning performance is constrained by how fast data can be retrieved rather than by the raw computational power of the processor.

Tiling methods

While the general principle of tiling remains the same, which involves partitioning large computations into smaller subproblems to improve memory reuse, there are different ways to apply tiling based on the structure of the computation and hardware constraints. The two primary tiling strategies are spatial tiling and temporal tiling. These strategies optimize different aspects of computation and memory access, and in practice, they are often combined to achieve the best performance.

Spatial tiling partitions data structures into smaller blocks that fit within fast memory. The tiled matrix multiplication in listing 21 demonstrates cache-friendly loop blocking: the code does not issue explicit scratchpad loads, but the tile-shaped access pattern lets hardware caches reuse nearby values before they are evicted. This strategy is particularly beneficial for large tensors that exceed fast memory capacity—by breaking computations into smaller tiles, data movement between memory levels is minimized, keeping operations localized within cache hierarchies.

Temporal tiling complements spatial tiling by explicitly staging data in shared memory or registers and reorganizing the computation order around that staged data. Many ML workloads access the same data repeatedly across iterations—without temporal tiling, this results in redundant memory fetches. Temporal tiling restructures the computation to ensure that frequently used data stays in fast memory for as long as possible before the next computation begins.

A classic example where temporal tiling is beneficial is convolutional operations, where the same set of weights is applied to multiple input regions. Without loop blocking, these weights might be loaded from memory multiple times for each computation. With temporal tiling, the computation is reordered so that the weights remain in fast memory across multiple inputs, reducing unnecessary memory fetches and improving overall efficiency. Listing 22 illustrates explicit tile staging: the code loads blocks of $A$ and $B$ into temporary fast storage, then reuses them across multiple inner-loop operations.

Listing 22: Explicit Tile Staging: Reduces redundant memory accesses by loading tiles into fast temporary storage and reusing them across multiple inner-loop operations.

# Explicit tile staging: load data into fast
# temporary storage before the inner loops.
for i in range(0, N, TILE_SIZE):
    for j in range(0, N, TILE_SIZE):
        for k in range(0, N, TILE_SIZE):
            # Explicitly load tiles into fast memory
            A_tile = A[i:i+TILE_SIZE, k:k+TILE_SIZE]
            B_tile = B[k:k+TILE_SIZE, j:j+TILE_SIZE]

            # Reuse loaded tiles for all inner iterations
            for ii in range(TILE_SIZE):
                for jj in range(TILE_SIZE):
                    for kk in range(TILE_SIZE):
                        C[i+ii, j+jj] += A_tile[ii, kk] *
                                         B_tile[kk, jj]

Explicit tile staging improves performance by ensuring that the data loaded into fast memory is used multiple times before being evicted. In this implementation, small tiles of matrices $A$ and $B$ are explicitly loaded into temporary storage before performing computations, reducing memory fetch overhead. This restructuring allows the computation to process an entire tile before moving to the next, thereby reducing the number of times data must be loaded from slower memory.

This technique is particularly useful in workloads where certain values are used repeatedly, such as convolutions, recurrent neural networks (RNNs), and self-attention mechanisms in transformers. By applying loop blocking, AI accelerators can significantly reduce memory stalls and improve execution throughput.

Tiling challenges and trade-offs

Tiling improves performance only when the tile matches the locality budget of the hardware. If the tile is too small, memory fetches still dominate execution time because reuse is too limited. If the tile is too large, it exceeds fast memory and causes cache thrashing or scratchpad spills. Selecting the right tile size therefore directly determines computational efficiency and memory bandwidth usage.

The tile choice also controls load balance. In architectures such as GPUs and TPUs, computations execute in parallel across thousands of processing units. If tiles are not evenly distributed, some units remain idle while others are overloaded, leading to suboptimal utilization of computational resources. Effective tile scheduling keeps parallel execution balanced and efficient.

Data movement remains the limiting cost even after tiling. Although tiling reduces the number of slow memory accesses, transferring tiles between hierarchy levels still incurs latency and energy cost, especially when data falls from cache or scratchpad back to DRAM. Efficient memory prefetching and scheduling strategies minimize this residual movement and ensure that data is available when needed.

Hybrid tiling combines spatial and temporal strategies when neither dimension alone captures the workload’s reuse pattern. Some AI accelerators use spatial tiling for matrix multiplications while employing temporal tiling for weight reuse in convolutional layers, dynamically adjusting tile sizes or reordering computations based on real-time execution conditions.

Register blocking, double buffering, and hierarchical tiling extend the same locality principle at smaller and larger memory tiers. AI compilers and runtime systems such as TensorFlow XLA, TVM, and MLIR automatically select these tiling strategies based on hardware constraints, enabling fine-tuned performance optimization without manual intervention. Table 21 provides a comparative overview of spatial, temporal, and hybrid tiling approaches, highlighting their respective benefits and trade-offs.

Table 21: Tiling Strategies: Spatial, temporal, and hybrid tiling optimize memory access patterns for improved performance. Spatial tiling maximizes data reuse within fast memory, temporal tiling exploits loop structure for reduced accesses, and hybrid tiling combines both approaches. AI compilers and runtime systems use these techniques to automatically optimize model execution on diverse hardware.

Aspect	Spatial Tiling (Data Tiling)	Temporal Tiling (Loop Blocking)	Hybrid Tiling
Primary Goal	Reduce memory accesses by keeping data in fast memory longer	Increase data reuse across loop iterations	Adapt dynamically to workload constraints
Optimization Focus	Partitioning data structures into smaller, memory-friendly blocks	Reordering computations to maximize reuse before eviction	Balancing spatial and temporal reuse strategies
Memory Usage	Improves cache locality and reduces DRAM access	Keeps frequently used data in fast memory for multiple iterations	Minimizes data movement while ensuring high reuse
Common Use Cases	Matrix multiplications, CNNs, self-attention in transformers	Convolutions, recurrent neural networks (RNNs), iterative computations	AI accelerators with hierarchical memory, mixed workloads
Performance Gains	Reduced memory bandwidth requirements, better cache utilization	Lower memory fetch latency, improved data locality	Maximized efficiency across multiple hardware types
Challenges	Requires careful tile size selection, inefficient for workloads with minimal spatial reuse	Can increase register pressure, requires loop restructuring	Complexity in tuning tile size and execution order dynamically
Best When	Data is large and needs to be partitioned for efficient processing	The same data is accessed multiple times across iterations	Both data partitioning and iteration-based reuse are important

When machine learning models grow in size and complexity, tiling remains a critical tool for improving hardware efficiency, ensuring that AI accelerators operate near their practical potential. While manual tiling strategies can provide substantial benefits, compilers and hardware-aware optimization techniques further enhance performance by automatically selecting effective tiling strategies for a given workload.

Applying mapping strategies to neural networks

While these foundational mapping techniques apply broadly, their effectiveness varies based on the computational structure, data access patterns, and parallelization opportunities of different neural network architectures. Each architecture imposes distinct constraints on data movement, memory hierarchy, and computation scheduling, requiring tailored mapping strategies to optimize performance.

A structured approach to mapping is required to address the combinatorial explosion of choices that arise when assigning computations to AI accelerators. Rather than treating each model as a separate optimization problem, we recognize that the same principles apply across different architectures; only their priority shifts based on workload characteristics. The goal is to systematically select and apply mapping strategies that maximize efficiency for different types of machine learning models.

These principles apply to three representative AI workloads, each characterized by distinct computational demands. CNNs benefit from spatial data reuse, making weight-stationary execution and the application of tiling techniques especially effective. In contrast, transformers are inherently memory bound and rely on strategies such as efficient KV-cache management, fused attention mechanisms, and highly parallel execution to mitigate memory traffic. MLPs, which involve substantial matrix multiplication operations, demand the use of structured tiling, optimized weight layouts, and memory-aware execution to enhance overall performance.

Despite their differences, each of these models follows a common set of mapping principles, with variations in how optimizations are prioritized. Table 22 summarizes the suitability of different optimization strategies for CNNs, transformers, and MLPs.

Table 22: Architecture-Specific Mapping Strategies: Each neural network architecture benefits from different optimization priorities based on its computational and memory characteristics.

Optimization Technique	CNNs	Transformers	MLPs	Rationale
Dataflow Strategy	Weight Stationary	Input Stationary	Weight Stationary	CNNs reuse filters across spatial locations; transformers reuse activations (KV-cache) under the input-stationary strategy introduced earlier; MLPs reuse weights across batches.
Memory-Aware Tensor Layouts	Backend-dependent (often NCHW for cuDNN convolutions, channels-last on Tensor Cores)	Backend-dependent (row-major activations typical)	Row-major typical	Layout choice depends on backend kernel path and precision mode (see the layout discussion above); the entries name common defaults, not universal prescriptions.
Kernel Fusion	Convolution + Activation	Fused Attention	GEMM Fusion	CNNs optimize convolution+activation fusion; Transformers fuse attention mechanisms; MLPs benefit from fused matrix multiplications.
Tiling for Memory Efficiency	Spatial Tiling	Temporal Tiling	Blocked Tiling	CNNs tile along spatial dimensions; Transformers use loop blocking to improve sequence memory efficiency; MLPs use blocked tiling for large matrix multiplications.

With the mapping strategies summarized in table 22, we now examine why each architecture maps the way it does. The table captures the specific strategy choices; the following subsections explain the architectural insight behind each one.

Convolutional neural networks (ResNet-50)

For ResNet-50 and similar CNNs, the defining characteristic from a hardware mapping perspective is spatial weight reuse. A single small filter is applied to every spatial location in the input feature map, meaning the same weights participate in hundreds or thousands of multiply-accumulate operations before the next filter is needed. This reuse pattern makes weight stationary execution the natural choice: pinning filter weights in fast on-chip memory and streaming activations through the compute units avoids repeatedly fetching the same weights from slower external memory. The result is high arithmetic intensity with modest bandwidth demand, which is precisely the profile that tensor cores and systolic arrays are designed to exploit.

This spatial regularity also enables aggressive fusion and tiling. Because convolution, batch normalization, and activation are applied at every spatial position in lockstep, compilers can fuse the sequence into kernels that avoid unnecessary intermediate writes. Spatial tiling then partitions the feature map into subregions sized to fit within on-chip SRAM, so the fused kernel processes each tile entirely from fast memory before moving to the next. The combination of weight stationarity, kernel fusion, and spatial tiling is what makes CNNs among the most hardware-friendly architectures.

Transformer architectures (GPT-2/Llama)

Where CNNs are defined by weight reuse, transformers are defined by the memory pressure of the key-value (KV) cache. During attention computation, every query vector must access stored key and value pairs across the entire sequence length. As sequences grow, the full KV cache usually lives in HBM or device DRAM, while attention kernels tile active blocks through SRAM, registers, or shared memory. This access pattern motivates activation stationary execution: keep the currently used KV tiles close to compute while streaming queries through them, rather than repeatedly materializing large attention intermediates in external memory.

The memory-bound nature of attention also explains why fused attention kernels, such as FlashAttention (Dao et al. 2022), deliver outsized performance gains for transformers. By fusing the query-key dot product, softmax normalization, and value-weighted summation into a single kernel that tiles along the sequence dimension, these implementations avoid materializing the full attention matrix in main memory. This temporal tiling approach processes sequence blocks that fit within on-chip SRAM, substantially reducing HBM traffic while preserving the $\mathcal{O}(S^2)$ attention computation. For transformers, the mapping strategy is primarily an exercise in memory management rather than compute scheduling.

Multilayer perceptrons (DLRM)

MLPs present the most straightforward mapping problem because their computation reduces largely to dense General Matrix Multiplication (GEMM)³⁸. Each fully connected layer multiplies an activation matrix by a weight matrix. The weight matrix is fixed across all samples in a batch, so weight stationary execution allows the accelerator to load weights once and reuse them across every batch element, with reuse scaling linearly with batch size. This makes MLPs highly sensitive to batching: a batch size of one leaves the weight matrix underutilized, while large batches push arithmetic intensity into the compute-bound regime where accelerators operate most efficiently. General matrix multiply (GEMM) derives the linear scaling that governs this sensitivity, showing that a square GEMM in FP16 reaches an arithmetic intensity of only $n/3$ FLOP/byte, so a small $64{\times}64$ multiply falls far below the accelerator’s ridge point.

³⁸ GEMM (General Matrix Multiplication): The operation $C = \alpha AB + \beta C$ is the dense linear-algebra primitive that many deep-learning layers lower to. Optimized GEMM libraries such as cuBLAS and oneDNN use register blocking, vectorization, and hierarchical tiling to approach hardware limits on favorable shapes (NVIDIA 2024a; Intel Corporation 2021b). Modern AI accelerators are heavily specialized for GEMM-like tiles: tensor cores, systolic arrays, and matrix extensions all exist to accelerate this primitive, which is why GEMM performance is an important predictor of end-to-end throughput across architectures.

NVIDIA. 2024a. cuBLAS: CUDA Basic Linear Algebra Subprograms.

Intel Corporation. 2021b. oneDNN: Intel’s Deep Learning Neural Network Library.

Because MLP layers are typically followed by activation functions and bias additions, GEMM fusion combines these steps into a single kernel, avoiding intermediate memory writes. Blocked tiling partitions the large matrix multiplications into sub-blocks sized for the accelerator’s shared memory, ensuring high cache utilization throughout computation. The simplicity of the MLP mapping, dominated by a single primitive with predictable access patterns, is precisely why hardware vendors optimize GEMM libraries so aggressively: gains in GEMM performance translate directly to MLP throughput.

Hybrid mapping strategies

The preceding subsections treat each architecture in isolation, but real models rarely consist of a single layer type. A vision transformer, for example, combines a patch embedding stage, self-attention layers, and MLP blocks (Dosovitskiy et al. 2021). Those layers create different reuse patterns: the embedding stage can benefit from weight-stationary mapping, attention emphasizes activation movement and tiling, and MLP blocks demand blocked GEMM tiling and fusion. No single dataflow strategy is optimal across all these layers, so hardware mapping becomes hybrid and layer-specific.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations.

Hybrid mapping addresses this heterogeneity by allowing the accelerator to switch strategies at layer boundaries. Each layer presents a different balance of compute intensity, data reuse, and memory access pattern, and the optimal mapping must shift accordingly (Sze et al. 2017). Rather than committing to one dataflow for the entire model, hybrid approaches select weight stationary execution for layers with high weight reuse, activation stationary execution for attention layers with large KV caches, and output stationary execution for layers where minimizing write traffic matters most.

Modern accelerators provide the architectural features needed to realize hybrid mapping in practice. TPU-style systolic arrays, NVIDIA GPUs, and tile-based accelerator designs expose different combinations of local memory, tensor layouts, fusion, and scheduling controls, allowing compilers and runtimes to choose layer-specific strategies rather than one global dataflow for the whole model (Jouppi et al. 2023; NVIDIA Corporation 2020; Chen et al. 2018). These implementations require programmable memory hierarchies, efficient interconnects, and specialized execution pipelines, reinforcing the hardware-software co-design principle.

However, hybrid mapping remains a design-time optimization. In production workloads, execution conditions change dynamically due to varying input sizes, memory contention, and hardware resource availability. Machine learning compilers and runtime systems extend these static mapping choices by introducing dynamic scheduling, memory optimizations, and automatic tuning, ensuring that deep learning workloads operate efficiently across diverse accelerators and deployment environments.

The mapping strategies and dataflow optimizations examined in preceding sections represent the “what” of efficient execution: which data to keep local, how to tile computations, and which parallelization strategies to employ. Determining optimal configurations for specific hardware and workloads, however, requires systematic automation. Machine learning compilers address this gap by transforming abstract mapping principles into concrete execution plans tailored to target accelerators.

Self-Check: Question

A CNN uses 3×3 filters reused across thousands of spatial positions per input image. Which stationary dataflow best matches this reuse profile?
1. Weight-stationary, because filter weights are read once at the array edge and reused across every spatial application while activations stream through
2. Output-stationary, because CNNs generate more output positions than the hardware can accumulate in place
3. Input-stationary, because filters are too large to fit in local memory and must stream in from DRAM
4. No stationary choice matters because dataflow has no effect on memory traffic once arithmetic is executed
A GPU team is deciding between NHWC and NCHW tensor layout for a convolution-heavy inference pipeline. Why is channel-major (NHWC on most modern GPUs) typically preferred for the forward pass?
1. NHWC places values from the same spatial position but different channels at contiguous addresses, so hardware processing adjacent channels for one pixel can read memory efficiently in a single block rather than scattered reads
2. NHWC maximizes branch-prediction efficiency, which dominates convolution performance on GPUs
3. NHWC is the universally fastest layout on CPUs, so GPU engineers inherit the convention
4. NHWC eliminates the need for any tensor layout transformations in any framework
Fusing Conv2D, BatchNorm, and ReLU into a single kernel does not change the total floating-point operation count. Explain quantitatively why fusion can still double or triple throughput on a bandwidth-bound inference segment.
What is the main purpose of tiling in accelerator execution?
1. To partition a large tensor computation into blocks sized for fast local memory so each block’s operands can be reused many times before eviction, raising arithmetic intensity by reducing redundant traffic from slower memory tiers
2. To change the numerical order of operations enough that the final result becomes slightly more accurate
3. To eliminate parallelism by serializing work into smaller sequential pieces that are easier to schedule
4. To guarantee that every operation becomes compute bound regardless of its underlying arithmetic intensity
True or False: A single stationary dataflow is optimal across CNNs, transformers, and MLPs, so hybrid mapping is mostly implementation overhead without real performance benefit.
A compiler maps a transformer composed of attention blocks and MLP layers onto an A100. Which mapping strategy best matches the section’s argument?
1. Use memory-aware attention kernels (fused softmax and masking with on-chip tiling) for attention blocks while using blocked GEMM-style tiling with operator fusion for MLP layers—matching each layer class to the dataflow that serves its reuse pattern
2. Use the same weight-stationary scheme for every layer because consistency across the graph is more important than per-layer locality
3. Prefer row-major layouts uniformly because they are cache-friendly on CPUs, regardless of the target accelerator
4. Disable fusion and tiling globally so runtime decisions remain maximally flexible

See Answers →

Compiler Support

Machine learning compilers automate the translation of dataflow strategies into executable code, addressing a critical challenge: the mapping decisions analyzed earlier must be instantiated differently for each hardware target. The gap between “knowing what optimizations exist” and “applying them correctly” is vast: a single convolution can be implemented with dozens of valid tiling strategies, kernel variants, and memory layouts, most of which perform poorly on any given hardware. Compilers navigate this complexity systematically. Compiling ResNet-50 for GPU inference exemplifies the process:

Graph optimization fuses repeated Conv2D-BatchNorm-ReLU patterns into fewer kernels, eliminating intermediate writes that would otherwise consume bandwidth
Kernel selection chooses Tensor Core implementations for compatible convolutions, exploiting the high arithmetic intensity calculated in the Roofline analysis
Memory planning determines whether intermediate activations fit in accelerator memory and whether buffers can be reused safely
Computation scheduling overlaps memory transfers with computation when dependencies allow, hiding part of the transfer latency

In the worked example, inference time drops from approximately 47 ms (naive execution) to approximately 8 ms (optimized), roughly a 5.9× improvement from compilation alone, before any algorithmic changes to the model. The values are illustrative, but the mechanism is the same one used by production compiler and inference stacks: graph rewrites, kernel selection, memory planning, and scheduling translate dataflow strategies such as fusion (section 1.7.1.3) and tiling (section 1.7.1.4) into real performance (Chen et al. 2018; NVIDIA 2024b).

This process exemplifies the hardware-software co-design principle established in Acceleration Fundamentals, where machine learning compilers bridge high-level model representations with low-level hardware execution. The compiler optimizes models by restructuring computations, selecting efficient execution kernels, and maximizing hardware utilization (Chen et al. 2018). Unlike traditional compilers designed for general-purpose computing, ML workloads require specialized approaches for tensor computations and parallel execution.

ML compiler design

Machine learning compilers differ from traditional compilers because ML workloads are expressed as computation graphs that describe large-scale tensor operations rather than primarily sequential or multi-threaded program flow. These graphs require specialized optimizations that traditional compilers cannot efficiently apply (Li et al. 2021).

Li, Mingzhen, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2021. “The Deep Learning Compiler: A Comprehensive Survey.” IEEE Transactions on Parallel and Distributed Systems 32 (3): 708–27. https://doi.org/10.1109/tpds.2020.3030548.

The distinction is not merely quantitative (more parallelism) but qualitative: where traditional compilers optimize individual instructions within sequential control flow, ML compilers optimize entire dataflow graphs in which the dominant cost is data movement rather than computation. Table 23 highlights this divergence across input representation, execution model, optimization priorities, and compilation output—notice how every row reflects the shift from instruction-centric to data-movement-centric optimization.

Table 23: Compiler Optimization Priorities: Traditional and machine learning compilers diverge in their optimization targets: traditional compilers prioritize efficient execution of sequential code, while ML compilers focus on optimizing tensor operations within computation graphs for specialized hardware. ML compilers incorporate domain-specific transformations such as kernel fusion and memory-aware scheduling, unlike the instruction scheduling and register allocation techniques used in conventional compilation.

Aspect	Traditional Compiler	Machine Learning Compiler
Input Representation	Linear program code (C, Python)	Computational graph (ML models)
Execution Model	Sequential or multi-threaded execution	Massively parallel tensor-based execution
Optimization Priorities	Instruction scheduling, loop unrolling, register allocation	Graph transformations, kernel fusion, memory-aware execution
Memory Management	Stack and heap memory allocation	Tensor layout transformations, tiling, memory-aware scheduling
Target Hardware	CPUs (general-purpose execution)	GPUs, TPUs, and custom accelerators
Compilation Output	CPU-specific machine code	Hardware-specific execution plan (kernels, memory scheduling)

The table explains why compiler configuration can change performance even when model code is unchanged. ML compilers own the hidden layer that maps graph-level tensor operations onto hardware-specific kernels, layouts, and schedules; when that mapping is poor, the model leaves arithmetic units idle or moves the same bytes repeatedly.

Systems Perspective 1.2: The hidden optimization layer

Most practitioners never interact directly with ML compilers, yet compiler quality often determines whether a model achieves a low or high fraction of hardware peak performance. Calling model.compile() in Keras, torch.compile() in PyTorch, or deploying through TensorRT invokes multi-stage optimization pipelines. These pipelines perform four hidden optimizations:

Fuse operations never explicitly combined by the developer (Conv2D + BatchNorm + ReLU → single kernel)
Reorder computations to improve memory locality (tiling large matrix multiplies)
Select kernels from libraries containing hundreds of hand-tuned implementations
Transform tensor layouts between what the model definition expects and what hardware prefers

This matters practically: the same model definition can run substantially faster or slower depending on compilation backend, graph lowering, kernel selection, and runtime configuration. When performance does not meet expectations, compiler configuration and backend selection are often the first optimization levers, requiring no changes to model architecture or training procedure.

ML compilation pipeline

Machine learning models, as defined in modern frameworks, are initially represented in a high-level computation graph that describes operations on tensors. However, these representations are not directly executable on hardware accelerators such as GPUs, TPUs, and custom AI chips. To achieve efficient execution, models must go through an ML compilation pipeline that transforms them into optimized execution plans suited for the target hardware (Chen et al. 2018; Google 2025; Lattner et al. 2020).

The machine learning compilation workflow proceeds through five stages that progressively lower abstraction. Graph optimization restructures the computation graph to eliminate inefficiencies. Kernel selection then maps each operation to a hardware-specific implementation optimized for the target accelerator. Memory planning optimizes tensor layouts and access patterns to reduce bandwidth consumption. Computation scheduling distributes workloads across parallel processing elements to maximize hardware utilization. Finally, code generation translates the optimized plan into machine-specific instructions for execution.

At each stage, the compiler applies the optimizations developed in section 1.7: kernel fusion, tiling, data movement strategies, and computation placement. These optimizations are systematically incorporated into the final execution plan, which is why machine learning acceleration depends as much on compiler-driven software optimization as on hardware improvements.

Graph optimization

AI accelerators provide specialized hardware to speed up computation, but raw model representations are not inherently optimized for execution on these accelerators. Machine learning frameworks define models using high-level computation graphs, where nodes represent operations (such as convolutions, matrix multiplications, and activations), and edges define data dependencies. However, if executed as defined, these graphs often contain redundant operations, inefficient memory access patterns, and suboptimal execution sequences that can prevent the hardware from operating at peak efficiency.

For example, transformer self-attention can create large intermediate score and probability matrices. A naïve implementation that materializes and rereads those intermediates from high-bandwidth memory pays excessive memory traffic, while IO-aware attention kernels tile the computation through fast memory to avoid that traffic (Dao et al. 2022). Similarly, in a CNN, applying batch normalization and activation functions as separate operations after each convolution leads to unnecessary intermediate memory writes, increasing memory bandwidth usage. These inefficiencies are addressed during graph optimization, where the compiler restructures the computation graph to eliminate unnecessary operations and improve memory locality (Chen et al. 2018).

Jia, Zhihao, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2019. “Optimizing DNN Computation with Relaxed Graph Substitutions.” Proceedings of Machine Learning and Systems (MLSys).

Graph optimization transforms this high-level computation graph into an optimized execution plan before hardware mapping. Rather than requiring manual optimization, the compiler systematically applies transformations that improve data movement, reduce redundant computations, and restructure operations for efficient parallel execution (Chen et al. 2018; Jia et al. 2019). At this stage, the compiler works at a hardware-agnostic level, focusing on high-level restructuring before hardware-specific optimizations are applied in later phases.

Graph optimization first removes traffic that the high-level graph representation would otherwise create. Kernel fusion merges consecutive operations to eliminate unnecessary memory writes and reduce the number of kernel launches, which is particularly effective in convolutional neural networks where convolution, batch normalization, and activation functions appear in fixed sequences. Computation reordering adjusts execution order to improve data locality and parallel execution; in transformer models, this reordering enables reuse of cached key-value pairs rather than repeated memory reloads.

Redundant computation elimination serves the same goal from the compute side. By identifying and removing duplicate or unnecessary operations, the compiler avoids repeated work in models with residual connections where common subexpressions might otherwise be recomputed. Memory-aware dataflow adjustments then refine tensor layouts and optimize movement; for example, tiling matrix multiplications to meet the structural requirements of systolic arrays in TPUs aligns the graph with the accelerator’s strengths. Together, these techniques prepare the model for acceleration by minimizing overhead and balancing computation against memory resources.

Modern AI compilers implement these rewrites through automated pattern recognition and structured rules, but the core responsibilities are the same across compiler stacks: find fusible patterns, choose layouts that match the target memory hierarchy, and preserve the model’s mathematical meaning while exposing hardware-specific optimization opportunities. XLA, TVM, TensorRT, and MLIR are representative systems that emphasize different target constraints, from graph-level fusion to tensor-layout search and multi-stage lowering. The systems lesson is not the product list; it is that compiler restructuring turns a framework graph into an execution plan the accelerator can sustain. Without this restructuring, a large transformer model on an edge device may suffer excessive memory stalls; with it, reduced bandwidth consumption and latency can make real-time inference feasible on resource-constrained devices.

With the computation graph now fully optimized, the next step in compilation is kernel selection, where the compiler determines which hardware-specific implementation to use for each operation. Kernel selection translates the structured execution plan into optimized low-level instructions for the target accelerator.

Kernel selection

Kernel selection turns the optimized graph into a hardware contract. A kernel is a specialized implementation of a computational operation designed to run efficiently on a particular hardware architecture. Most accelerators, including GPUs, TPUs, and custom AI chips, provide multiple kernel implementations for the same operation, each optimized for different execution scenarios. Choosing the right kernel determines whether the accelerator maximizes computational throughput, avoids memory stalls, and keeps specialized processing elements busy (Chen et al. 2018; Zheng et al. 2020).

Kernel selection builds upon graph optimization, mapping the structured execution plan to the most efficient implementation available for each operation. Poor kernel choices can nullify the benefits of prior optimizations by introducing unnecessary computation overhead or memory bottlenecks (Chen et al. 2018).

In a transformer model, the matrix multiplications that dominate self-attention computations can be executed using different strategies depending on the available hardware. On a CPU, a general-purpose matrix multiplication routine is typically employed, exploiting vectorized execution to improve efficiency. In contrast, on a GPU, the compiler may select an implementation that uses tensor cores to accelerate matrix multiplications using mixed-precision arithmetic. When the model is deployed on a TPU, the operation can be mapped onto a systolic array, ensuring that data flows through the accelerator in a manner that maximizes reuse and minimizes off-chip memory accesses. For inference workloads, an integer arithmetic kernel may be preferable, as it performs computations in INT8 instead of floating-point precision, thereby reducing power consumption without significantly compromising accuracy.

In many cases, the compiler’s decision is which existing implementation to trust rather than whether to generate a kernel from scratch. cuDNN and cuBLAS offer optimized kernels for deep learning on NVIDIA GPUs, oneDNN provides optimized execution for Intel architectures, ACL (Arm Compute Library) targets Arm-based devices, and Eigen and BLIS provide efficient CPU-based implementations. These libraries encode hardware-specific knowledge so the compiler can choose a preoptimized kernel rather than reinventing an execution strategy for each platform.

AI compilers use heuristics³⁹, profiling, and cost models to decide among these options. The selection method depends on how much uncertainty the compiler can tolerate before execution begins.

³⁹ Heuristic in Kernel Selection: A practical rule-of-thumb that finds good solutions quickly without exhaustively searching all possibilities. AI compilers face an exponential search space when selecting kernels: for a single GEMM operation, tile sizes, data layouts, precision modes, and fusion opportunities create thousands of valid configurations. Heuristics encode expert knowledge about hardware behavior (for example, “use tensor cores when matrix dimensions are multiples of 16”) to make fast decisions, though they can miss 10–30 percent of achievable performance compared to autotuning approaches like TVM’s AutoTVM, which profile actual hardware.

Rule-based selection applies predefined heuristics based on known hardware capabilities. For instance, XLA, the compiler used in TensorFlow, automatically selects tensor core-optimized kernels for NVIDIA GPUs when mixed-precision execution is enabled. These predefined rules allow fast, reliable decisions without extensive analysis.

Profile-guided selection pays more search cost to reduce uncertainty. TVM uses AutoTVM to benchmark kernel options empirically and tune execution strategies based on real execution times, so operations are assigned to implementations that perform well under actual deployment conditions.

Cost model-based selection estimates execution time and memory consumption before profiling every option. MLIR applies this technique to determine effective tiling and memory access strategies (Lattner et al. 2020). By modeling how candidate kernels interact with the accelerator’s compute units and memory hierarchy, the compiler can select a kernel that minimizes execution cost while maximizing performance.

Lattner, Chris, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2–14. https://doi.org/10.1109/cgo51591.2021.9370308.

Precision-aware selection adds the numerical constraint to the same decision. Training workloads often prioritize FP32 or BF16 to maintain model accuracy, whereas inference workloads favor FP16 or INT8 to increase speed and reduce power consumption. For example, an NVIDIA GPU running inference with TensorRT can select among calibrated FP16 and INT8 engine profiles that were built for the model’s accuracy constraints and input shapes. This trade-off between precision and performance is a key aspect of kernel selection, especially in resource-constrained environments.

Some compilers extend selection into adaptive tuning, where execution strategies adjust to workload and resource conditions. AutoTVM in TVM measures kernel performance across workloads and refines execution strategies; TensorRT applies optimized engine profiles based on batch size, memory constraints, and supported precision; Google’s TPU compiler specializes execution plans for the target TPU topology and shape profile. The consequences of poor kernel selection are significant: a transformer model assigned a nontensor-core kernel for matrix multiplications may execute at only a fraction of possible performance, while a model designed for FP32 execution may lose accuracy if forced onto an INT8-optimized kernel. Kernel selection is therefore as much about numerical correctness as performance.

With kernel selection complete, the next stage in compilation involves memory planning and computation scheduling, where the compiler determines how data is allocated across the memory hierarchy and how kernels are launched for execution. As kernel selection determines what to execute, these subsequent phases dictate when and how those operations run, ensuring that AI accelerators operate at peak efficiency.

Memory planning

The memory planning phase ensures that data is allocated and accessed in a way that minimizes memory bandwidth consumption, reduces latency, and maximizes cache efficiency (Roesch et al. 2018; Chen et al. 2018). Even with the most optimized execution plan, a model can still suffer from severe performance degradation if memory is not managed efficiently.

Machine learning workloads are memory-intensive, requiring frequent movement of large tensors between different levels of the memory hierarchy. The compiler must determine how tensors are stored, how they are accessed, and how intermediate results are handled to prevent memory from becoming the bottleneck.

The memory planning phase optimizes tensor layouts, memory access patterns, and buffer reuse to prevent unnecessary stalls and memory contention during execution. Tensors are arranged in formats that align with hardware access patterns, minimizing format conversions. Memory accesses are structured to reduce cache misses and stalls, lowering overall bandwidth consumption. Buffer reuse reduces redundant memory allocations by managing intermediate results so that completed buffers are reclaimed promptly. Together, these strategies ensure that data is efficiently placed and accessed, enhancing both computational performance and energy efficiency.

Balancing memory availability, reuse, and access efficiency across multiple hierarchy levels makes memory planning one of the most complex compiler problems. AI compilers use several strategies to manage memory effectively and prevent unnecessary data movement.

Tensor layout optimization determines how tensors should be arranged in memory to maximize locality and prevent unnecessary format conversions. As section 1.7.1.2 established, different hardware accelerators favor different physical layouts depending on the backend kernel and precision mode. NVIDIA’s cuDNN convolution path historically expected NCHW for many FP32 kernels, while the channels-last NHWC layout aligns with Tensor Core memory coalescing for FP16 and INT8 paths; TensorFlow/XLA may choose internal layouts during lowering for the target backend. Compiler and library stacks transform tensor layouts based on the kernel and precision selected for the target hardware, ensuring that memory accesses are aligned for maximum efficiency (NVIDIA Corporation 2021; Google 2025).

Google. 2025. XLA: Optimizing Compiler for Machine Learning.

Roesch, Jared, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. 2018. “Relay: A New IR for Machine Learning Frameworks.” Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 58–68. https://doi.org/10.1145/3211346.3211348.

Buffer allocation and reuse complements layout optimization: the compiler minimizes memory footprint by reusing intermediate storage whenever possible. Deep learning workloads generate many temporary tensors, such as activations and gradients, which can quickly overwhelm on-chip memory if not carefully managed. Instead of allocating new memory for each tensor, the compiler analyzes the computation graph to identify opportunities for buffer reuse, ensuring that intermediate values are stored and overwritten efficiently (Roesch et al. 2018).

Minimizing data movement between hierarchy levels is equally critical. AI accelerators typically have a mix of high-speed on-chip memory (such as caches or shared SRAM) and larger, but slower, external DRAM. If tensor data is repeatedly moved between these memory levels, the model may become memory bound, reducing computational efficiency. To prevent this, compilers use tiling strategies that break large computations into smaller, memory-friendly chunks, allowing execution to fit within fast, local memory and reducing the need for costly off-chip memory accesses. The consequences of neglecting memory planning are concrete: a CNN running on a GPU may achieve high computational efficiency in theory, but if its convolutional feature maps are stored in an incompatible layout that necessitates repeated format conversions, the resulting overhead can negate the gains from graph optimization and kernel selection entirely. With memory allocation determined, the compiler must next decide when and where each computation executes.

Computation scheduling

With graph optimization completed, kernels selected, and memory planning finalized, computation scheduling determines the execution order and resource assignment for each operation. This phase determines when and where each computation should be executed, ensuring that workloads are efficiently distributed across available processing elements while avoiding unnecessary stalls and resource contention (Zheng et al. 2020).

Without effective scheduling, massive parallelism goes to waste: computational units sit idle, memory bandwidth goes underutilized, and execution efficiency degrades. Computation scheduling keeps processing elements active, manages execution dependencies correctly, and distributes workloads across the hardware schedule space (Chen et al. 2018; Zheng et al. 2020).

The scheduling phase coordinates parallel execution, synchronization, and resource allocation. Task partitioning decomposes computations into units that can be distributed among multiple compute cores. Execution order optimization determines the sequence for launching operations, maximizing hardware performance while reducing stalls. Resource allocation and synchronization ensure that compute cores, memory bandwidth, and shared caches are used without contention.

Implementation in AI compilers

Scheduling strategies are highly dependent on the underlying hardware architecture, since different AI accelerators have unique execution models. AI compilers implement several strategies to optimize scheduling for efficient execution.

Task partitioning divides large computational graphs into smaller units that can execute in parallel. On GPUs, this typically means mapping matrix multiplications and convolutions to thousands of CUDA cores, while on TPUs, tasks are partitioned to fit within systolic arrays that operate on structured data flows (Norrie et al. 2021). In CPUs, partitioning is often focused on breaking computations into vectorized chunks that align with SIMD execution. In each case, the goal is to keep every core active throughout execution.

Norrie, Thomas, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. “The Design Process for Google’s Training Chips: TPUv2 and TPUv3.” IEEE Micro 41 (2): 56–63. https://doi.org/10.1109/mm.2021.3058217.

Dao, T., D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 35: 16344–59. https://doi.org/10.52202/068431-1189.

Beyond task partitioning, scheduling involves optimizing execution order to minimize dependencies and maximize throughput. Many AI models include operations that can be computed independently (for example, different batches in a batch processing pipeline) alongside operations that have strict dependencies (for example, recurrent layers in an RNN). AI compilers analyze these dependencies and attempt to rearrange execution where possible, reducing idle time and improving parallel efficiency. In transformer attention, IO-aware kernels make this scheduling problem concrete by loading blocks of queries, keys, and values into fast memory, using them while resident, and evicting them in an order that reduces high-bandwidth-memory traffic (Dao et al. 2022).

Resource allocation and synchronization determine how compute cores share memory and coordinate execution. Modern AI accelerators often support overlapping computation and data transfers, meaning that while one task executes, the next task can begin fetching its required data. Compilers take advantage of this by scheduling tasks in a way that hides memory latency, ensuring that execution remains compute bound rather than memory-bound (Chen et al. 2018). In production inference stacks, optimized runtimes and compiler-generated schedules coordinate kernel launch order, stream execution, and synchronization so the accelerator does not stall between dependent kernels (NVIDIA 2024b; Zheng et al. 2020). Poor scheduling decisions can negate the benefits of all prior compilation phases: a CNN with highly optimized kernels and efficient memory layouts will still suffer reduced throughput if compute units remain idle between kernel launches, and a transformer on a TPU may underperform if attention layers are not scheduled to overlap with memory transfers.

Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, et al. 2018. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 578–94.

Zheng, Lianmin, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, et al. 2020. “Ansor: Generating High-Performance Tensor Programs for Deep Learning.” Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 863–79.

Code generation

With scheduling complete, the final compilation stage translates this optimized execution plan into hardware-specific instructions. Unlike the previous phases, which required AI-specific optimizations, code generation follows many of the same principles as traditional compilers. This process includes instruction selection, register allocation, and final optimization passes, ensuring that execution makes full use of hardware-specific features such as vectorized execution, memory prefetching, and instruction reordering. Crucially, however, instruction selection for ML targets is not generic: the compiler must emit instructions that engage the hardware’s matrix-specific ISA extensions. On NVIDIA GPUs, this means emitting Parallel Thread Execution (PTX) instructions such as mma.sync.aligned to invoke Tensor Cores directly, as shown in listing 14. On Intel CPUs with Advanced Matrix Extensions (AMX), the compiler targets tile-multiply instructions operating on 2D register tiles. On Arm CPUs with the Scalable Matrix Extension, the target is outer-product accumulation across scalable matrix tiles. A code generation backend that emits generic floating-point instructions instead of these extensions leaves the hardware’s primary matrix engines idle, which can reduce effective throughput by an order of magnitude regardless of how well the earlier compilation phases performed. For CPUs and GPUs, AI compilers typically generate machine code or optimized assembly instructions, while for TPUs, FPGAs⁴⁰, and other accelerators, the output may be optimized bytecode or execution graphs that are interpreted by the hardware’s runtime system.

⁴⁰ FPGA (Field-Programmable Gate Array): “Field-programmable” means the logic fabric is configurable after manufacturing, contrasting with fixed-function ASICs. FPGAs can improve performance for latency-sensitive data center services by implementing custom pipelines matched to a particular workload (Putnam et al. 2014). This reconfigurability makes FPGAs attractive for rapidly evolving ML architectures where committing to an ASIC risks obsolescence, but the requirement for hardware description languages (Verilog/VHDL) and compilation times measured in hours creates a productivity barrier that limits adoption to deployments where the efficiency benefit justifies the engineering cost.

Putnam, Andrew, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, et al. 2014. “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.” ACM SIGARCH Computer Architecture News 42 (3): 13–24. https://doi.org/10.1145/2678373.2665678.

From compilation to runtime

The compiler transforms high-level machine learning models into optimized execution plans tailored to specialized hardware, but that plan is still an assumption about future runtime conditions: batch shape, available memory, accelerator occupancy, and competing workloads. Graph optimization restructures computation, kernel selection maps operations to hardware-efficient implementations, memory planning optimizes data placement, and computation scheduling ensures efficient parallel execution. Together, these phases enable AI models to fully use modern accelerators with high throughput, minimal memory overhead, and efficient execution pipelines.

All compiler optimizations share a critical limitation: they occur before execution begins. This static nature is both a strength, enabling aggressive whole-program optimization, and a weakness, unable to adapt when reality diverges from assumptions. The compiler makes decisions based on what it expects to happen, not what actually happens. Graph restructuring, kernel selection, memory planning, and computation scheduling all produce a single, optimized execution plan based on assumptions about batch sizes, dedicated hardware availability, and clean memory state.

Production AI systems inhabit a dynamic world that rarely matches these static assumptions. Batch sizes vary from one (latency-sensitive single requests) to 128 (throughput-oriented batch serving) within the same deployment. GPU memory fragments during long-running inference servers, forcing suboptimal tensor layouts. Multiple workloads compete for accelerator resources in multi-tenant cloud environments. Thermal throttling reduces sustained performance below the peaks observed in short benchmarks. The runtime system bridges static optimization and dynamic reality, continuously adapting execution to actual conditions rather than assumed conditions. The serving chapter later treats batching, admission control, and service-level objectives as end-to-end system problems (Model Serving); here the narrower question is how the accelerator runtime keeps one model execution efficient once a request or batch reaches the hardware.

Self-Check: Question

Why do ML compilers need different optimization priorities from traditional compilers?
1. ML programs are tensor-level computation graphs, so the dominant optimization axes are graph transformations (fusion, layout, partitioning), memory-traffic planning, and accelerator-specific execution—not instruction-level optimizations on sequential scalar code
2. ML programs no longer require any memory management, so compilers can skip allocation and placement entirely
3. Traditional compilers already optimize GPUs and TPUs perfectly, so ML compilers exist only to provide a friendlier API
4. ML models are too small for instruction-level optimization to matter, so any compiler will produce equivalent code
Order the following ML compilation stages by their causal dependencies: (1) graph optimization rewrites the computation graph (fusion, layout), (2) kernel selection picks a concrete implementation (cuBLAS, cuDNN, handwritten) for each operator, (3) memory planning assigns tensor buffers to hierarchy tiers, (4) computation scheduling fixes the temporal execution order across resources, (5) code generation emits device-specific machine code.
A model runs 4× faster under an optimizing backend than in eager execution while the model architecture is unchanged. Explain which compiler transformations are most likely responsible and why they deliver the gain without touching the model’s math.
What is the main goal of graph optimization before hardware-specific code generation?
1. Rewrite the computation graph to eliminate redundant operations, fuse adjacent kernels, adjust tensor layouts, and reduce intermediate activations before committing to kernel and placement choices that depend on the final graph structure
2. Bind every operation permanently to a fixed physical core, ignoring later scheduling or runtime adaptation
3. Convert all arithmetic to FP64 for numerical accuracy regardless of the target hardware’s capabilities
4. Replace vendor-optimized libraries (cuBLAS, cuDNN) with generic scalar implementations for portability
A compiler lowers the same GEMM into code for an A100, a TPU, and a CPU inference server. Why is kernel selection not a trivial library lookup?
1. The same abstract GEMM has many concrete implementations whose best choice varies with matrix shape, precision, batch size, and current hardware state; a cuBLAS call that wins at 4096×4096 FP16 may lose at 128×128 FP32, so the compiler must choose among implementations using cost models or profiling
2. Accelerators execute only one kernel type for the entire model, so kernel selection collapses to a single per-hardware choice
3. Kernel choice is unrelated to tensor shape or memory bandwidth, so a fixed library call suffices in every case
4. Vendors do not provide optimized kernel libraries for ML workloads, so all kernels must be generated from scratch
True or False: A complete compile-time execution plan is sufficient by itself, so runtime systems function as little more than administrative wrappers around compiled binaries.

See Answers →

Runtime Support

AI runtimes solve the production variability problem by extending compile-time optimizations with runtime decisions about memory, kernels, and execution profiles; TensorRT is one representative production inference runtime for this role (NVIDIA 2024b). Unlike traditional compiled programs that execute a fixed instruction sequence, AI workloads require adaptive control over memory allocation, kernel execution, and resource scheduling, continuously monitoring execution conditions and making on-the-fly adjustments to maintain hardware utilization despite changing production conditions.

NVIDIA. 2024b. NVIDIA TensorRT: Programmable Inference Accelerator.

Huang, Y., Y. Cheng, A. Bapna, O. Firat, D. Chen, M. X. Chen, H. Lee, et al. 2019. “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism.” Advances in Neural Information Processing Systems 32: 103–12.

Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. “Device Placement Optimization with Reinforcement Learning.” International Conference on Machine Learning (ICML) 70: 2430–39.

AI runtimes manage three interrelated aspects of execution. First, kernel execution management: runtimes dynamically select and dispatch computation kernels based on the current system state to minimize latency. Second, memory adaptation: because AI workloads process large tensors with varying footprints, runtimes adjust allocation dynamically to prevent bottlenecks and excessive data movement. Third, execution scaling: runtimes and training systems distribute workloads across multiple accelerators for multi-chip, multi-node, or cloud environments, as in pipeline-parallel systems such as GPipe [splitting layers across accelerators; Huang et al. (2019)] and device-placement methods [automatic assignment of operations to devices; Mirhoseini et al. (2017)].

AI runtimes complement compiler-based optimizations by handling these execution aspects dynamically. Comparing AI runtimes to traditional software runtimes clarifies why machine learning workloads require specialized execution strategies.

ML runtime architecture

Traditional software runtimes are designed for managing general-purpose program execution, primarily handling sequential and multi-threaded workloads on CPUs. These runtimes allocate memory, schedule tasks, and optimize execution at the level of individual function calls and instructions. In contrast, AI runtimes are specialized for machine learning workloads, which require massively parallel computation, large-scale tensor operations, and dynamic memory management.

Table 24 highlights the key differences between traditional and AI runtimes. One of the key distinctions lies in execution flow. Traditional software runtimes operate on a predictable, structured execution model where function calls and CPU threads follow a predefined control path. AI runtimes, however, execute computational graphs, requiring complex scheduling decisions that account for dependencies between tensor operations, parallel kernel execution, and efficient memory access.

Table 24: Runtime Execution Models: Traditional runtimes prioritize sequential or multi-threaded instruction processing, while AI runtimes use massively parallel tensor operations for accelerated computation on machine learning workloads. This divergence necessitates specialized AI runtime architectures designed for efficient parallelization and memory management of large-scale tensor data.

Aspect	Traditional Runtime	AI Runtime
Execution Model	Sequential or multi-threaded execution	Massively parallel tensor execution
Task Scheduling	CPU thread management	Kernel dispatch across accelerators
Memory Management	Fine-grained allocation (stack and heap)	Dynamic tensor allocation, buffer reuse
Optimization Priorities	Low-latency instruction execution	Minimizing memory stalls, maximizing parallel execution
Adaptability	Mostly static execution plan	Adapts to batch size and hardware availability
Target Hardware	CPUs (general-purpose execution)	GPUs, TPUs, and custom accelerators

Memory management is another major differentiator. Traditional software runtimes handle small, frequent memory allocations, optimizing for cache efficiency and low-latency access. AI runtimes, in contrast, must dynamically allocate, reuse, and optimize large tensors, ensuring that memory access patterns align with accelerator-friendly execution. Poor memory management in AI workloads can lead to performance bottlenecks, particularly due to excessive off-chip memory transfers and inefficient cache usage.

AI runtimes are inherently designed for adaptability. While traditional runtimes often follow a mostly static execution plan, AI workloads typically operate in highly variable execution environments, such as cloud-based accelerators or multi-tenant hardware. As a result, AI runtimes must continuously adjust batch sizes, reallocate compute resources, and manage real-time scheduling decisions to maintain high throughput and minimize execution delays. AI runtimes must oversee large-scale tensor execution, multi-device coordination, and real-time workload adaptation, all of which become acutely visible when models move from development to production.

The deepest reason an AI runtime cannot run a fixed plan is that the development environment lies about the production one. The clearest case is thermal: a model tuned on a short benchmark never sees the throttling that a sustained workload triggers, and even the same chip is not the same chip. The A100 SXM operates at 400 W TDP while the A100 PCIe operates at 300 W; these are different form factors with different cooling envelopes, not boost versus sustained states, so a kernel that holds peak throughput in the lab can quietly lose it in production depending on which variant is deployed. The same development-to-production gap shows up in batch size that swings from single requests to bursts, in long-running servers that fragment GPU memory, and in shared accelerators where a neighbor workload steals the bandwidth a latency target depended on. None of these conditions is known when the model is compiled, which is why the runtime, not the compiler, has to absorb them.

To see how these runtime mechanisms work together in practice, consider a concrete scenario: a transformer inference request arrives at a production server. The runtime must adapt execution parameters such as tiling and memory allocation to current conditions (dynamic execution), determine which kernel implementation to use for each operation based on real-time hardware state (kernel selection), and schedule the selected kernels across available compute units to maximize utilization (kernel scheduling). These are not independent systems but three interrelated phases of a single runtime pipeline, and the following subsections examine each phase using this transformer inference request as a running example.

Dynamic kernel execution

While static compilation provides a solid foundation, efficient execution of machine learning workloads requires real-time adaptation to fluctuating conditions. When our transformer inference request arrives, the runtime cannot execute a fixed plan: available memory, input sequence length, and computational load may differ from what the compiler assumed. The runtime continuously adjusts execution strategies to match both hardware constraints and workload characteristics.

Individual computational operations (matrix multiplications, convolutions, activation functions) must be assigned to appropriate processing units, and this mapping is not fixed. As input data, memory availability, and system load change during execution, the runtime makes real-time decisions about execution order and memory management to keep workloads efficient despite shifting conditions.

The same adaptation appears in image classification. If an incoming batch of high-resolution images requires more memory than the compiler assumed, a static execution plan can cause cache thrashing or excessive off-chip memory accesses. A dynamic runtime can adjust tiling strategies during execution, breaking tensor operations into smaller tiles that fit within high-speed on-chip memory. This prevents memory stalls and improves cache utilization.

For the running transformer inference request, sequence length may vary between calls. A static execution plan optimized for one fixed sequence length can underutilize compute resources on shorter sequences or create excessive memory pressure on longer sequences. Dynamic kernel execution mitigates this by selecting kernel implementations based on the actual sequence length and adjusting memory allocation to maintain efficiency.

Overlapping computation with memory movement mitigates the bottleneck that has governed the chapter: data movement between memory hierarchy levels limits computation speed. AI runtimes implement asynchronous execution and double buffering so computations proceed without waiting for memory transfers to complete. In a large-scale model, a DMA engine transfers the next batch of data over the host-to-device link while the accelerator executes the current batch, maintaining a steady flow of data and avoiding pipeline stalls. Framework APIs expose this pattern through page-locked host buffers and asynchronous copies, but the mechanism is architectural: transfer batch $n{+}1$ while computing batch $n$ so the accelerator is not serialized behind memory movement.

Convolutional layers on a GPU show the scheduling version of the same problem. When multiple convolution kernels differ in size and compute demand, static scheduling can leave compute units partially occupied. Dynamic scheduling lets AI runtimes prioritize smaller kernels when capacity is available, improving hardware utilization. NVIDIA’s TensorRT runtime can fuse small kernels into larger execution units to avoid launch overhead, optimizing latency-sensitive inference tasks.

Dynamic adjustment of execution strategies in response to real-time system conditions optimizes both training and inference performance across hardware platforms. These adaptations, however, depend on having the right kernel in the first place. Returning to our transformer inference example: before the runtime can adjust tiling or memory allocation for a matrix multiplication, it must first decide which kernel implementation to invoke.

Runtime kernel selection

While compilers perform an initial selection of kernels based on static analysis, AI runtimes may still choose among precompiled or library-provided variants during execution. Real-time factors, such as available memory, hardware utilization, and workload priorities, may differ from the assumptions made during compilation. In our transformer example, the compiler and framework determine the legal precision paths, while the runtime selects the kernel variant that best fits the current sequence length, batch shape, and available hardware resources. Runtime selection adapts execution to changing conditions, but it remains bounded by the numerical formats and kernels the model has been prepared to use.

For instance, consider transformer-based language models, where a significant portion of execution time is spent on matrix multiplications. Mixed-precision transformer systems such as Megatron-LM use FP16 execution on GPU Tensor Cores to increase throughput (Shoeybi et al. 2019). At serving time, the AI runtime must still determine the most efficient kernel variant based on the current system state. If lower precision causes unacceptable numerical instability for a particular operation, the runtime can opt for mixed-precision execution, selectively using FP32 where higher precision is necessary.

Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint arXiv:1909.08053.

Memory constraints also influence kernel selection. When memory bandwidth is limited, the runtime may adjust its execution strategy, reordering operations or changing the tiling strategy to fit computations into the available cache rather than relying on slower main memory. For example, a large matrix multiplication may be broken into smaller chunks, ensuring that the computation fits into the on-chip memory of the GPU, reducing overall latency.

Batch size also influences kernel selection. For workloads that handle a mix of small and large batches, the AI runtime may choose a latency-optimized kernel for small batches and a throughput-optimized kernel for large-scale batch processing. This adjustment ensures that the model continues to operate efficiently across different execution scenarios, without the need for manual tuning. With the appropriate kernels selected and their execution parameters adapted, the final pipeline stage determines when and where each kernel runs.

Kernel scheduling and utilization

Kernel scheduling completes the runtime pipeline by determining how selected kernels execute across available hardware to maximize parallelism and resource utilization. Returning to the transformer inference request: the runtime has selected FP16 kernels for the attention matrix multiplications and adapted tiling to fit the current sequence length. Now the scheduler must distribute these operations across GPU streaming multiprocessors, interleave them with layer normalization and activation kernels, and ensure that intermediate data is prefetched before each operation needs it. Unlike traditional task schedulers that manage CPU threads, AI runtimes coordinate a much larger number of tasks across parallel execution units: GPU cores, TPU systolic arrays (Jouppi et al. 2017), or custom AI accelerators. Keeping these resources fully engaged prevents bottlenecks and maximizes throughput.

In image recognition models that use convolutional layers, the scheduler can distribute filters across multiple processing units so independent work runs concurrently. Batch normalization and activation functions then become scheduling hazards: if they are not interleaved with other computation, they block the pipeline and reduce throughput. Runtime scheduling preserves utilization by keeping these smaller kernels from serializing the larger convolution path.

Real-time memory management reinforces the same scheduling goal. AI runtimes preload intermediate data, such as feature maps in deep neural networks, into cache before kernels need it. This proactive movement prevents delays from slower memory tiers and keeps execution continuous.

Together, kernel selection, dynamic execution adaptation, and scheduling form a tightly coupled runtime pipeline. For our transformer inference request, the pipeline determined the best kernel for each operation, adapted tiling and precision to current memory and hardware conditions, and distributed work across compute units to sustain high utilization. These three phases operate continuously and interdependently: a scheduling decision may trigger re-selection of a different kernel, which in turn requires new execution parameter adaptation.

The compiler and runtime systems examined thus far optimize execution within single accelerators, but the largest AI workloads exceed what any single chip can deliver. Single-chip optimizations can achieve impressive results through compiler optimization, dataflow selection, fusion, and memory planning. Yet for the largest AI workloads, even well-optimized single-chip execution proves insufficient.

Consider the scale of training GPT-3, which required approximately $3.14 \times 10^{23}$ floating-point operations (Brown et al. 2020), a number so large it defies intuition without concrete comparison. To grasp this magnitude: even at the H100’s peak FP8 throughput of 1.98 PFLOP/s, completing this computation on a single accelerator would require about 5 years of continuous operation at theoretical peak, and roughly 8.4–12.6 years under the utilization assumptions used in this worked example (Choquette 2023). High-volume inference services create the same pressure from the opposite direction: each request is smaller than a full training run, but global request volume can exceed what any single accelerator can serve. These computational requirements necessitate scaling beyond single-chip systems, introducing different engineering challenges from those we have examined.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–901. https://doi.org/10.48550/arxiv.2005.14165.

Self-Check: Question

Why do AI systems need specialized runtime support even after a model has been fully compiled?
1. Production conditions—variable batch size, memory fragmentation, multi-tenant contention, thermal throttling—routinely diverge from the fixed assumptions baked into compile-time plans, so execution must adapt to actual state rather than follow a static script
2. Compilation cannot produce executable machine code for accelerators, so a runtime is required to translate the plan into instructions
3. Runtimes exist only to provide a user interface for monitoring accelerator temperatures and utilization
4. Compiled execution plans are mathematically incomplete and produce wrong results without runtime correction
A transformer inference service receives requests with sequence lengths ranging from 64 to 8192 tokens. Explain how runtime dynamic kernel execution improves performance compared to a single compile-time plan built for an average sequence length.
A runtime observes a batch shape that would benefit from FP16 tensor cores and replaces a compiler-selected FP32 matrix kernel with an FP16 tensor-core kernel for that step. Which justification best matches the section’s argument?
1. Runtime kernel selection can exploit actual hardware state and workload shape: if current conditions permit reduced precision (precision budget, tensor-core availability, input range), the runtime substitutes a faster implementation that the compile-time plan conservatively rejected
2. The runtime swaps kernels only to reduce accuracy for easier debugging, never for performance
3. Compiled kernels are unexecutable on real hardware until the runtime rewrites the entire graph
4. FP16 tensor-core substitution is a CPU-only optimization and has no effect on accelerators
True or False: An inference service that benchmarks at a stable 10 ms p99 latency in isolation will deliver roughly the same p99 under multi-tenant production load because the compiled execution plan and the hardware are unchanged.
What is the main role of kernel scheduling inside the AI runtime?
1. Determine when and where selected kernels execute on available resources so compute, data movement, and host-device transfers are overlapped and accelerators stay continuously fed
2. Rewrite the model architecture to reduce parameter count before execution begins
3. Write training checkpoints to SSD at fixed intervals during inference
4. Guarantee identical wall-clock latency for every kernel, regardless of operand shape or resource contention

See Answers →

Multi-Chip Scaling

A single H100 delivers nearly 1.98 PFLOP/s of FP8 throughput, yet training a very large language model can still require thousands of such chips working in concert. The techniques covered in previous sections (dataflow optimization, kernel fusion, memory hierarchy exploitation, and compiler optimization) remain the foundation for efficient execution even in multi-chip scaling; each individual accelerator must still be optimized using these principles. The new lesson is that communication becomes another memory hierarchy. Moving data from registers to SRAM, HBM, NVLink, and then a data center fabric gets progressively slower and more expensive, so scaling is no longer only a question of adding chips. The goal here is to understand the hardware boundary where communication starts to dominate.

When single-accelerator capacity proves insufficient, the design problem shifts from feeding one chip to choosing which communication boundary the workload can tolerate. Practitioners encounter these boundaries in production even when most optimization work remains inside a single accelerator.

Multi-chip scaling approaches

Large AI systems scale beyond individual accelerators by moving the communication boundary outward, and each boundary changes the dominant trade-off. The sequence begins inside the package. Chiplet-based architectures partition large designs into smaller, modular dies interconnected within one package, bypassing manufacturing limits of monolithic chips while preserving relatively low communication latency. The next boundary is the node: multi-accelerator servers connect several chips through board- or server-level interconnects. Each accelerator has dedicated memory and compute resources, so workloads split through data parallelism (each accelerator processes different batches) or model parallelism (different accelerators handle different network layers). High-bandwidth intra-node interconnects can enable efficient gradient synchronization, though realized performance depends on topology and collective communication efficiency.

Beyond the node, the boundary expands into the cluster. Purpose-built data center fabrics coordinate hundreds of accelerators, making topology and collective communication algorithms central determinants of scaling efficiency; near-linear scaling is achievable on some workloads when communication overhead is controlled. Wafer-scale integration is the counter-move: instead of pushing the boundary outward, it collapses more computation back into one large device. Platforms such as Cerebras WSE-class systems integrate extremely large numbers of transistors and cores on a single device, reducing inter-chip communication overhead while introducing their own challenges in thermal dissipation, fault tolerance, and manufacturing yield.

Why scaling introduces new constraints

The transition from single-chip to multi-chip architectures introduces communication overhead and other qualitatively different constraints that reshape system optimization. Communication overhead emerges as the primary limit on scaling efficiency. Amdahl’s Law⁴¹ quantifies how communication during gradient synchronization creates sequential bottlenecks. For hundred-billion-parameter-scale models, AllReduce operations⁴² can require exchanging hundreds of gigabytes of gradients per training step.

⁴¹ Amdahl’s Law (Scaling Limit): Amdahl's Law and Gustafson's Law formalizes Amdahl’s Law, which bounds speedup by the serial fraction of a workload. In multi-accelerator training, gradient synchronization can become a dominant serial fraction: at just 5 percent exposed synchronization overhead, maximum speedup is capped at 20$\times$ regardless of how many accelerators are added. This hard ceiling explains why distributed-training systems focus so heavily on reducing, hiding, or overlapping communication.

⁴² AllReduce: A collective operation from MPI that aggregates values across processes (the “reduce”) and distributes the result back to every process (the “all”). In this chapter, the term appears only to identify why accelerator-to-accelerator bandwidth matters for training workloads. At scale, algorithms, topology choices, and runtime protocols determine how costly this synchronization becomes.

The first-order quantity is the gradient payload. For a model with $P$ parameters, that payload is roughly the parameter count multiplied by the bytes stored for each gradient element before optimizer state, padding, and protocol overhead enter. Scaling therefore improves only when the saved compute time exceeds the time to move this payload through the chosen interconnect and collective algorithm.

This communication overhead explains why scaling to very large accelerator counts can show diminishing returns unless the system reduces the amount of data exchanged, hides communication behind useful computation, or chooses a parallelization strategy with a better communication pattern. The same expansion also changes the memory model. Memory coherence becomes expensive because ensuring that all processors see consistent views of shared memory adds protocol traffic and latency. For AI accelerators with thousands of cores, this overhead can become prohibitive, forcing explicit memory management where programmers control data placement and synchronization manually.

Once computation spans many links, chips, and memory stacks, reliability and energy become part of the same scaling story. Large-scale systems must handle component failures gracefully because the probability of at least one failure rises with system size. For this chapter, the hardware lesson is enough: TPU Pods and wafer-scale systems both need hardware-level redundancy and communication paths that tolerate component failures (Jouppi et al. 2023; Systems 2021). Data movement also grows more expensive with distance, transforming distributed training into a careful balance between computation parallelism and communication efficiency.

Systems, Cerebras. 2021. “Wafer-Scale Deep Learning Acceleration with the Cerebras CS-2.” Cerebras Technical Paper 2021.

Data center scaling and edge deployment represent opposite ends of a deployment spectrum, yet they share the same core principles. Data center scaling coordinates many high-throughput accelerators, while edge scaling fits useful AI into a few constrained watts. Both cases require matching workload characteristics to hardware capabilities while minimizing data movement. The principles of compute specialization, memory hierarchy optimization, and workload mapping apply at both scales; only the constraints differ. Data centers optimize for aggregate throughput within power budgets measured in megawatts; edge devices optimize for responsiveness within tight battery and thermal envelopes. The same vision model that runs comfortably in a data center may need a radically different mapping strategy on a smartphone or microcontroller.

Self-Check: Question

A training run scales from 8 to 64 accelerators and sees only a 5.2$\times$ speedup rather than 8$\times$. Which factor best explains the diminishing return?
1. Gradient synchronization, parameter broadcast, and inter-node coordination create a communication overhead that acts as Amdahl’s serial fraction—growing as more devices must participate in every all-reduce, eventually capping speedup regardless of added compute
2. Accelerators lose arithmetic capability when connected together through NVLink or Ethernet, so per-device throughput drops as the cluster grows
3. Memory capacity decreases as more accelerators are added, forcing smaller batches that scale worse
4. All-reduce removes the need for per-device computation once two or more devices are present
Which scaling approach most directly reduces inter-chip communication overhead by treating an entire wafer as one compute fabric?
1. Wafer-scale integration, which keeps communication on a single large silicon die and avoids package boundaries, PCIe links, and network switches for most cross-compute-unit traffic
2. Chiplet packaging across standard CPU sockets, which routes most communication through motherboard interconnect
3. A conventional PCIe-attached multi-GPU workstation, which uses the host PCIe bus for inter-GPU traffic
4. A cluster linked only by commodity Ethernet, which amortizes communication across TCP/IP
Explain why memory coherence becomes a materially harder problem on multi-chip AI systems than on single-chip accelerators, and give one concrete consequence for parallelism design.
A team scales a training cluster from 64 to 4,096 GPUs. Which factor grows sharply and forces fault tolerance into the design’s first-class concerns?
1. Component mean-time-to-failure is roughly constant per GPU, so aggregate failure rate scales linearly with device count; at 4,096 GPUs a single failure that would lose 15 minutes of progress at 64 devices now occurs many times per day, requiring checkpointing and automatic recovery
2. Software-stack complexity grows but hardware failure rates stay negligible, so fault tolerance is still optional
3. Network topology becomes irrelevant at scale because all-reduce self-adapts to any shape
4. Thermal and power limits disappear at scale because heat dissipates across a larger footprint

See Answers →

Heterogeneous SoC Design

Mobile, automotive, and IoT deployments operate under much tighter power, thermal, and latency constraints than data center hardware. These constraints force specialized compute units, memory hierarchy optimization, and workload mapping strategies to operate under dramatically different rules. The result is heterogeneous System-on-Chip (SoC) architectures that integrate CPU cores, GPU shaders, digital signal processors (DSPs), and dedicated neural processing units (NPUs) within a single chip. Orchestrating these diverse processors to achieve optimal performance under strict power, thermal, and latency requirements demands wholly different approaches than data center deployments.

Example 1.4: Heterogeneous microcontrollers

Context: The Smart Doorbell (Wake Vision) pushes heterogeneity to its logical limit. Unlike a smartphone SoC with a multi-watt budget, a doorbell camera often runs on a microcontroller with a milliwatt budget.

Insight: To achieve real-time person detection (30 FPS) within this envelope, MCUs can adopt the same heterogeneous strategy as their larger mobile cousins but at a micro-scale. A typical architecture pairs a general-purpose core (for example, Cortex-M) for system logic with a dedicated micro-NPU (for example, Ethos-U) for CNN acceleration.

Systems lesson: The micro-NPU can execute the Wake Vision MobileNetV2-style workload far more energy efficiently than the CPU alone when its operators map to the accelerator. Without specialized acceleration, the always-on promise of the Smart Doorbell would be difficult to meet within a microcontroller-class energy budget.

Mobile SoC architecture evolution

Modern mobile AI engines exemplify heterogeneous computing by coordinating CPU cores, GPU shaders, DSPs, and dedicated NPUs⁴³ across a shared memory hierarchy. Workload distribution lets computer vision kernels execute on GPU or NPU paths, audio processing use DSP arithmetic units, and matrix-heavy neural-network layers use NPU-optimized engines when the operator set is supported. This coordination requires careful scheduling to meet real-time constraints while managing thermal throttling and battery life.

⁴³ NPU (Neural Processing Unit): The NPU’s specialized matrix engines are optimized for dense tensor operations, providing the hardware basis for the workload distribution described. This specialization creates a critical constraint for the scheduler: any AI operator not mapped to the NPU’s supported data paths must “fall back” to a GPU or CPU. This fallback can erase the NPU’s energy-efficiency advantage, complicate real-time latency budgets, and contribute to thermal pressure on mobile devices.

Some mobile SoC designs emphasize diverse processor specialization, while vertically integrated strategies highlight how tight hardware-software co-design can enable tightly coordinated heterogeneous execution. Unified memory architectures can reduce explicit data copying overhead, and different compute blocks can be scheduled for different operator types (for example, matrix-heavy layers on an NPU, convolutional operators on a GPU, and control flow on the CPU). This coordination supports interactive on-device experiences, though realized latency depends on the full pipeline and device thermal conditions.

Beyond vertically integrated solutions, IP licensing models allow SoC designers to customize processor combinations based on target applications, mixing CPU, GPU, DSP, and NPU blocks. This modular flexibility allows automotive SoCs to emphasize deterministic real-time processing while smartphone SoCs optimize for interactive performance and battery efficiency.

Strategies for dynamic workload distribution

With multiple specialized processors available on heterogeneous SoCs, the critical challenge becomes intelligently distributing neural network operations across these resources to maximize performance while respecting power and latency constraints. Consider a concrete example: an engineer deploying a real-time object detection pipeline on a mobile SoC with a CPU, GPU, and NPU. The pipeline has three stages: a MobileNet backbone for feature extraction, nonmaximum suppression (NMS) for postprocessing, and a display overlay for rendering bounding boxes. The backbone consists of depthwise separable convolutions with regular, predictable access patterns and low compute cost, making it a good fit for an NPU when the operator set is supported, even though depthwise layers are often memory-bound rather than high-arithmetic-intensity kernels. NMS, by contrast, involves conditional branching over variable-length candidate lists, with irregular memory access that maps poorly to the NPU’s fixed dataflow. The CPU handles NMS more efficiently because its branch predictor and large caches accommodate the unpredictable control flow. Finally, the display overlay involves pixel-level compositing across the entire frame, a massively parallel but arithmetically simple workload that maps naturally to the GPU’s shader cores. This three-way split, NPU for the backbone, CPU for NMS, GPU for the overlay, achieves lower latency and lower power than running the entire pipeline on any single processor.

This example illustrates the general principle: neural networks require intelligent partitioning across heterogeneous processors based on operation characteristics and current system state. Convolutional layers with regular data access patterns typically execute efficiently on GPU shader cores or NPU matrix engines, while operations with irregular sparsity patterns or conditional control flow may perform better on general-purpose CPU cores with large caches. Attention mechanisms in transformers benefit from NPU matrix engines when sequences are long, but may execute more efficiently on CPU when sequence lengths are small due to the NPU setup overhead.

Beyond static operation-to-processor mapping, the optimal assignment can change moment to moment. Returning to the object detection example: during battery operation, the system might shift the MobileNet backbone from the NPU to lower-power DSP cores, accepting higher latency to extend battery life. Thermal state introduces another dimension: when approaching thermal limits, the runtime may reduce frame rate, switch to a smaller model, down-clock the NPU, or move unsupported branch-heavy stages to the CPU rather than assuming the CPU is always the more efficient neural-network engine. Safety-critical automotive applications add latency requirements that prioritize deterministic execution over peak throughput. Finally, concurrent workload interference from multiple AI applications may require load balancing across available processors to maintain quality of service.

Compounding the processor selection challenge, shared memory architectures require arbitration when multiple processors access LPDDR simultaneously. Mobile memory controllers may prioritize real-time camera or display paths over background AI tasks, forcing neural-network runtimes to adapt their execution patterns to available bandwidth. This arbitration becomes critical during memory-intensive operations like large language model inference, where parameter streaming from DRAM must be carefully coordinated across processors.

Power and thermal management

Mobile AI workloads must maintain high performance while operating within strict power budgets and thermal envelopes. These constraints require tight coordination across heterogeneous processors.

Heterogeneous SoCs implement coordinated dynamic voltage and frequency scaling (DVFS) across multiple processors to optimize the power-performance envelope. When one processor increases frequency to meet latency demands, the system may reduce voltage on other processors to maintain total power budget. This coordination becomes complex in AI workloads where computational phases may shift rapidly between processors. The system must predict upcoming workload transitions to preemptively adjust operating points while avoiding voltage/frequency oscillations that degrade efficiency.

When DVFS alone cannot maintain the power envelope, mobile SoCs implement thermal throttling through a mixture of frequency reduction, model adaptation, and task migration. When the NPU approaches thermal limits during intensive neural network processing, the runtime can shift selected operators to another supported processor, lower inference frequency, or choose a smaller model profile. This approach preserves service availability during thermal events, though it requires detailed workload characterization to predict execution time and power consumption across different processors.

Beyond real-time power and thermal management, mobile AI systems must also adapt their computational strategies based on battery state and charging status. During low battery conditions, the system may switch from high-accuracy models to efficient approximations, migrate workloads from power-hungry NPU to energy-efficient DSP, or reduce inference frequency while maintaining application responsiveness. Conversely, during charging, the system can enable higher-performance models and increase processing frequency to deliver enhanced user experiences.

Automotive heterogeneous AI systems

Automotive applications introduce unique heterogeneous computing challenges that combine mobile-style power efficiency with hard real-time latency requirements and functional safety requirements. This combination demands distinct architectural approaches.

Automotive SoCs aim to provide deterministic inference latency for safety-critical functions while supporting advanced driver assistance systems (ADAS). Redundant processing elements support functional safety objectives while high-performance accelerators handle perception, planning, and control algorithms. This architecture requires temporal isolation between safety-critical and convenience functions, implemented through hardware partitioning and time-triggered scheduling. The concrete ML workload motivation is bandwidth contention: a natural-language voice assistant running a large language model on the same SoC can consume substantial memory bandwidth during decoding, interfering with the bandwidth required by a safety-critical 3D object detection network that must complete its next inference within a hard deadline. Temporal isolation prevents convenience workloads from appearing in the object detection network’s scheduling window. Similarly, sensor fusion models that ingest radar, lidar, and camera streams simultaneously must meet strict per-frame deadlines coordinated by the time-triggered scheduler; any bandwidth or compute interference from a lower-priority AI service can push these models past their deadline and trigger a safety fallback.

These safety requirements become even more complex when considering that modern vehicles integrate multiple AI-enabled SoCs for different domains. Vision processing SoCs handle camera-based perception, radar processing SoCs manage RF sensor data, while central compute platforms coordinate high-level decision-making. These distributed systems must maintain temporal coherence across sensor modalities, requiring specialized inter-SoC communication protocols and distributed synchronization mechanisms.

Extending beyond the vehicle’s internal sensors, vehicle-to-everything (V2X) communication adds another layer of heterogeneous processing where AI algorithms must coordinate local sensor processing with information received from other vehicles and infrastructure. This requires low-latency processing chains where modems, AI accelerators, and control systems operate under strict timing and functional-safety requirements.

Software stack challenges

The architectural sophistication of heterogeneous SoCs turns software into the coordination layer for power, thermal state, determinism, and operator fallback. Programming heterogeneous SoCs requires frameworks that abstract processor differences while still exposing performance-critical optimization opportunities. OpenCL and Vulkan provide cross-processor execution, but optimal performance still depends on processor-specific tuning that complicates portability. Modern ML frameworks such as TensorFlow Lite and PyTorch Mobile implement automatic processor selection, but developers still need to understand heterogeneous execution patterns because unsupported operators may fall back to less efficient processors.

Shared memory architectures compound the coordination problem. Memory management must account for processor-specific caching behavior, memory access patterns, and coherency requirements. CPU caches may interfere with GPU memory access patterns, while NPU direct memory access (DMA) operations must be synchronized with CPU cache operations to maintain data consistency.

Heterogeneous SoCs address this complexity through machine learning-based runtime optimization that learns from execution patterns to improve processor selection, thermal management, and power allocation. These systems collect telemetry on workload characteristics, processor utilization, and power consumption to predict execution strategies for new workloads.

No single processor architecture can optimally handle the diverse computational patterns in AI applications, so heterogeneous acceleration becomes a coordination problem rather than a hardware inventory. Efficient mobile AI systems deliver high performance only when processor assignment, memory coherence, thermal limits, and latency constraints are managed together.

The same coordination problem also has an energy consequence. If hardware selection determines how much data moves, how often accelerators stall, and how efficiently arithmetic maps to silicon, then it also determines how much energy the deployment consumes for each useful prediction.

Self-Check: Question

Why do modern mobile and edge SoCs integrate CPUs, GPUs, DSPs, and NPUs rather than rely on a single general-purpose processor?
1. Different tasks in a mobile AI pipeline (convolution backbone, audio DSP, control logic, display rendering) have different compute, control-flow, latency, and power profiles; specialized blocks execute each task at a fraction of the energy-per-op that a general-purpose core would need
2. Heterogeneous chips eliminate the need for any runtime scheduling because each workload has one obvious home
3. CPUs are physically incapable of executing AI-related code, so alternative processors are required
4. A single general-purpose processor would always exceed legal transistor limits at the silicon density required for mobile devices
A mobile object-detection pipeline has three stages: a convolutional backbone, non-maximum suppression (NMS), and display overlay rendering. Justify a sensible partition of these stages across an NPU, CPU, and GPU and explain what the energy cost would be of running all three on any single processor.
A mobile SoC under sustained AI inference approaches its thermal envelope. Which response best matches the section’s thermal-management argument?
1. Migrate work across processors and coordinate DVFS—moving one pipeline stage from a saturated NPU to an idle GPU while reducing voltage on the hottest block—keeping the aggregate within the thermal envelope while sustaining acceptable frame latency
2. Keep all work on the hottest accelerator and accept short-term throttling, since migration costs exceed benefits
3. Disable the memory hierarchy, because cache activity is the dominant heat source on mobile SoCs
4. Force every processor into its maximum-performance state simultaneously to clear the work queue faster
True or False: Automotive heterogeneous AI can treat latency optimization the same way smartphone inference does, because both are power-constrained edge deployments.
What is a unique software-stack challenge heterogeneous SoCs pose that a single-accelerator system does not?
1. The software must coordinate processor-specific execution models, manage shared-memory interactions across heterogeneous cores (processor-specific caching behavior, DMA synchronization), and schedule across processors whose instruction sets and precision formats differ—all while presenting a usable programming model
2. Models no longer need tensor layout decisions because each processor has a universal layout
3. Compilation becomes unnecessary because runtime processor selection replaces every optimization pass
4. Power and thermal management move entirely into hardware, so software concerns decrease at scale

See Answers →

Hardware Sustainability

At fleet scale, the energy a chip burns per inference stops being a footnote and becomes a primary hardware-selection criterion. A single high-volume inference workload, replicated across thousands of servers, turns a modest per-operation efficiency gap into hundreds of metric tons of CO₂ per year. Performance per watt therefore governs an AI service’s operational carbon footprint and electricity bill, not only battery life on a phone. The notebook below makes the stake concrete, comparing a generic-CPU fleet against specialized accelerators on the same billion-inference-per-day workload.

Napkin Math 1.10: The carbon ROI of specialized silicon

Problem: Should an inference fleet run on generic CPUs or invest in specialized NPUs (Neural Processing Units)?

Physics: Specialized hardware allocates a larger fraction of its transistors to arithmetic units and fewer to control logic.

CPU inference: 100 W for 1 TFLOP/s (efficiency = 0.01 TFLOP/s/W).
NPU inference: 5 W for 10 TFLOP/s (efficiency = 2 TFLOP/s/W).
The gap: The NPU is 200× more energy-efficient per operation.

Math:

Workload: 1 billion inferences per day.
CPU fleet energy: 1,000 CPU servers $\times$ 100 W $\times$ 24 h $\approx$ 2,400 kWh/day.
NPU fleet energy: 100 NPU chips $\times$ 5 W $\times$ 24 h $\approx$ 12 kWh/day.
Carbon savings: At 0.429 kg/kWh, switching to NPUs saves ~373.9 t of CO₂ per year.

Systems insight: Specialized accelerators can materially reduce the energy use of matched inference workloads. For workloads with stable arithmetic patterns and high deployment volume, the per-operation energy gains compound into substantial reductions in operational carbon footprint relative to general-purpose hardware.

The sustainability perspective reinforces a theme that has recurred throughout this chapter: hardware selection is never a purely technical decision. Performance per watt, carbon cost, and total cost of ownership must all enter the decision framework alongside peak FLOP/s and memory bandwidth. With scaling, heterogeneity, and sustainability now in view, the remaining step is to identify the misconceptions that cause teams to choose the wrong hardware path.

Self-Check: Question

Why does the section treat performance per watt as a first-class metric for AI hardware rather than a concern limited to battery-powered devices?
1. At fleet scale, every watt spent per inference multiplies by billions of inferences per day—so efficiency sets annual operating cost, carbon footprint, and datacenter capacity planning, making joule-per-op a system-level constraint comparable to latency and throughput
2. Wattage matters only on battery devices; the chapter uses it metaphorically when discussing datacenters
3. Carbon impact is unrelated to inference fleet design because training dominates total lifecycle emissions
4. Specialized silicon always consumes more power than general-purpose hardware, so perf-per-watt is a constraint on specialization rather than a motivation
Explain how investing in specialized inference silicon can simultaneously improve system performance and sustainability at fleet scale, using a concrete energy comparison.
True or False: The most sustainable hardware choice is always the cheapest device to purchase, because amortized upfront cost dominates environmental impact over the device’s service life.

See Answers →

Fallacies and Pitfalls

Hardware acceleration involves counterintuitive performance characteristics where impressive specifications mask underlying bottlenecks. The fallacies and pitfalls here capture hardware selection and optimization errors that waste expensive accelerator resources and lead to deployments that achieve only 10–30 percent of theoretical performance.

Fallacy: More specialized hardware always provides better performance than general-purpose alternatives.

Engineers assume specialized accelerators automatically outperform general-purpose processors for all AI workloads. In reality, specialized hardware achieves peak performance only when workloads match architectural assumptions, the core of algorithm-machine co-design. As demonstrated in section 1.5, operations must exceed the accelerator’s ridge point to be compute bound; an A100 GPU has a ridge point of 153 FLOP/byte, meaning operations with arithmetic intensity below this threshold are memory bound regardless of the accelerator’s 312 TFLOP/s peak compute. A transformer attention softmax with AI = 2 FLOP/byte–5 FLOP/byte achieves only 4.1 TFLOP/s–10.2 TFLOP/s (3.3 percent utilization) on an A100. CPUs with ridge points around 10 FLOP/byte–20 FLOP/byte still treat this kernel as memory-bound, but the same AI range corresponds to about 10 percent–50 percent of a CPU’s lower peak. Models with irregular memory access, small batch sizes, or dynamic computation graphs may perform better on flexible processors. Effective hardware selection requires matching workload arithmetic intensity to architectural ridge points, not assuming specialization always wins.

Pitfall: Ignoring memory bandwidth limitations when selecting acceleration strategies.

Practitioners focus on peak TFLOP/s without analyzing whether their workloads can achieve compute-bound performance. As quantified in section 1.4.1, the energy model used here assigns about 640 pJ to a DRAM access versus 0.5 pJ for an on-chip SRAM access, creating orders-of-magnitude energy penalties. An accelerator advertising 300 TFLOP/s with 2 TB/s bandwidth has a ridge point of 150 FLOP/byte; LayerNorm operations with AI = 1.5 FLOP/byte achieve only 3 TFLOP/s (1 percent utilization) in this worked example. Organizations can deploy expensive high-compute accelerators for memory-bound workloads and still see low utilization if bandwidth, not compute, is the bottleneck. Teams must calculate workload arithmetic intensity and compare against hardware ridge points before purchasing accelerators.

Fallacy: Hardware acceleration benefits scale linearly with additional accelerators.

Teams expect eight GPUs to train 8$\times$ faster than one GPU. Multi-accelerator scaling introduces communication overhead that violates linear scaling assumptions. As noted in section 1.10, AllReduce operations for gradient synchronization can require exchanging large gradient payloads for large models. With NVLink delivering 600 GB/s bidirectional (half that, per direction, for a one-way gradient stream), synchronizing 1 GB of gradients requires 3.33 ms; for a 50 ms training step, this represents 6.7 percent of step time. Without compute-communication overlap, this worked eight-GPU scenario achieves about 7.5× speedup (93.8 percent efficiency) before load imbalance, synchronization barriers, and insufficient parallel work reduce scaling further.

Pitfall: Planning accelerator capacity from peak FLOP/s specifications.

Vendors advertise peak FLOP/s as the definitive measure of accelerator capability, but real-world performance equals Peak FLOP/s $\times$ Utilization, where utilization is dictated by the roofline model (section 1.5). An A100 advertises 312 TFLOP/s at FP16, yet the representative budgeting scenario here sustains only 120 TFLOP/s–180 TFLOP/s (40 percent–60 percent utilization) for transformer training because memory-bound operations such as attention and LayerNorm drag down the average throughput. The recommender-system scenario fares even worse, reaching only 10 TFLOP/s–30 TFLOP/s (3.2 percent–9.6 percent utilization) because sparse, irregular memory access patterns leave compute units idle. Engineers should budget projects based on sustained throughput, measured or estimated via the roofline model, rather than peak marketing specifications.

Fallacy: Any FLOP/s rating can estimate a low-precision workload.

Accelerators have separate datapaths for different precisions, and the peak throughput varies dramatically across them. An H100 delivers roughly 1,000 TFLOP/s in FP16 tensor operations but only about 67 TFLOP/s in FP32 CUDA-core operations: a 15–16$\times$ gap within the same chip (Choquette 2023). Estimating training time with the FP32 number when the workload actually uses BF16 produces utilization figures that look catastrophic for no reason, and matching against the wrong roofline misclassifies kernels as compute-bound when they are memory-bound (or vice versa). Always match the peak constant to the precision the workload actually issues, and quote precision explicitly when reporting MFU.

Pitfall: Deploying small-batch inference workloads on high-compute accelerators.

Teams deploy high-throughput training accelerators (A100, H100) for latency-sensitive inference with batch size 1–4. As the roofline model (section 1.5) predicts, small batches severely reduce arithmetic intensity: a dense layer with M=N=2048 achieves AI = 1 FLOP/byte at batch=1 vs. AI = 204.8 FLOP/byte at batch=256. At batch=1, the memory-bound roofline ceiling is only about 2.04 TFLOP/s on A100 and 0.3 TFLOP/s on T4. The T4’s peak is 65 TFLOP/s (FP16 Tensor Core) with a ridge point of 203.1 FLOP/byte (65 TFLOP/s / 320 GB/s). Small-batch inference remains memory bound on both accelerators, so the T4’s lower cost can make it more economical despite its much lower peak compute. Training-class accelerator instances often rent for several times more than inference-class instances for little latency gain in this regime. Inference deployments should match batch size to accelerator characteristics, using high-compute accelerators only for batched serving where arithmetic intensity exceeds ridge points.

Fallacy: Vendor-specific optimizations have no long-term portability cost.

Organizations optimize exclusively for specific vendors to maximize performance without considering system flexibility. As discussed in section 1.8, deep integration with vendor-specific libraries (CUDA, TensorRT, XLA) and custom kernels creates lock-in. A codebase with many hand-written accelerator kernels can require substantial engineering effort to port to a different vendor, delaying hardware upgrades and preventing multi-vendor deployments. Vendor-specific optimizations should therefore be isolated behind hardware abstraction layers. Maintaining portable code paths enables vendor competition, hardware flexibility, and faster adoption of emerging accelerators while still capturing most performance benefits through framework-level optimizations.

Pitfall: Embedding vendor-specific kernels directly into application logic.

The portability cost becomes hardest to manage when accelerator-specific code is scattered through model code, preprocessing, build scripts, and serving paths. A cleaner design localizes vendor kernels behind capability checks, framework dispatch layers, or narrow operator libraries, so the system can keep a fast path without making every higher-level component vendor-aware. The engineering goal is not to avoid specialization; it is to keep specialization replaceable when the hardware fleet changes.

The fallacies above reduce to a concrete procurement test.

Checkpoint 1.4: Feasibility assessment: Can you run it?

Before procuring hardware, validate all three hard constraints.

Memory capacity: Compute $M_{\text{req}} = \text{Weights} + \text{KV Cache} + \text{Activation Buffer}$ and verify $M_{\text{req}} < M_{\text{device}}$ for a 7-billion-parameter Llama model with 14 GB FP16 weights on a 16 GB GPU.
Bandwidth: Compute $T_{\text{token}} = D_{\text{vol}} / \text{BW}$ for a 70-billion-parameter model (140 GB) on 1 TB/s memory bandwidth, then compare the result with a 50 ms latency target.
Compute: Compute $T_{\text{process}} = O / (R_{\text{peak}} \cdot \eta_{\text{hw}})$ and compare it with the throughput target. Processing video at 30 FPS requires completing inference within 33.3 ms.
Roofline placement: For an accelerator at roughly 2 TFLOP/s peak and 200 GB/s bandwidth, estimate the ridge-point arithmetic intensity, then place a 10 FLOP/byte kernel against it: is it compute- or memory-bound, and would a faster clock help it at all?

This checklist synthesizes the principles developed throughout this chapter, translating theoretical understanding into practical engineering decisions. Together, these fallacies reduce the chapter’s machinery to a diagnostic habit: start from the workload, choose the bottleneck metric, match the hardware path to that metric, and budget for the portability, scaling, and energy consequences of that choice.

Self-Check: Question

Which purchasing mistake does this section most directly criticize?
1. Choosing hardware on peak FLOP/s alone without first checking whether the target workload’s arithmetic intensity places it above or below the accelerator’s ridge point
2. Comparing deployment cost and sustained power draw across candidate accelerators
3. Estimating sustained (realizable) throughput on the actual workload rather than advertised peak
4. Using arithmetic intensity as one input to hardware selection alongside cost and latency
Why is deploying batch-1 inference on a top-tier training accelerator often economically poor compared to using a cheaper inference-class chip?
1. Batch-1 inference has very low arithmetic intensity, so it sits far below the training accelerator’s ridge point; the expensive compute silicon runs largely idle while latency is set by HBM bandwidth, which both chips can provide at far lower cost
2. Batch-1 inference raises arithmetic intensity enough to saturate tensor cores, so the expensive chip’s extra compute delivers a proportional latency improvement
3. Training-class accelerators cannot execute inference kernels at all, so the investment is structurally wasted
4. Reducing batch size eliminates memory traffic, so the expensive chip’s bandwidth advantage disappears
Explain why expecting linear speedup from adding more GPUs is a fallacy even when the workload is highly data-parallel, and quantify the mechanism.
True or False: Vendor-specific kernel optimizations are always the right choice because portability has little practical value once a team commits to a hardware vendor.
A team buys a specialized ML accelerator for a workload with irregular memory access, arithmetic intensity near 1 FLOP/byte, and batch size 1, then measures only 1.2× speedup over a CPU baseline. Which diagnosis best matches the chapter?
1. The workload is mismatched to the accelerator’s assumptions: regular dataflow, high arithmetic intensity, and parallel work. The operation mix sits far below the ridge point, the irregular access defeats the array’s structured-memory path, and batch 1 starves the array’s parallelism—the specialized device cannot exercise its peak compute
2. The workload is likely compute bound because specialized accelerators make most workloads compute bound regardless of input shape
3. The SSD is probably too slow to deliver input tensors, so the fix is a faster storage tier
4. The accelerator needs more branch predictors to match CPU performance on irregular access patterns

See Answers →

Summary

Hardware acceleration is the force that transformed machine learning from academic curiosity to practical reality, reshaping how we design both computational systems and the algorithms that run on them. The evolution from general-purpose processors to specialized AI accelerators reflects a shift toward domain-specific computing where hardware and software are co-designed to optimize specific computational patterns. The progression from CPUs through GPUs to specialized TPUs, NPUs, and wafer-scale systems demonstrates how understanding workload characteristics drives architectural innovation, creating opportunities for orders-of-magnitude performance improvements through targeted specialization.

Key Takeaways: Moving data costs more than computing it

The Roofline model identifies performance bottlenecks: Plotting arithmetic intensity against throughput reveals whether workloads are memory bound (attention, embeddings) requiring bandwidth optimization, or compute bound (high-reuse convolutions, GEMMs) requiring FLOP/s optimization.
Memory bandwidth constrains performance: GPU compute capacity has grown orders of magnitude faster than memory bandwidth over the past two decades. Most inference workloads are memory bound, making data movement optimization the primary concern.
Hardware-software co-design compounds performance: Matching algorithm patterns to architectural capabilities (systolic arrays for dense GEMM, sparse accelerators for pruned models) can produce large improvements and typically outperforms raw hardware upgrades.
Tensor Cores require specific conditions: A supported precision format (TF32, BF16, FP16, or INT8 depending on architecture), appropriate tensor dimensions, and sufficient batch size are necessary for peak utilization. Batch size directly affects arithmetic intensity and determines whether workloads reach the compute-bound regime.
Arithmetic intensity determines optimization strategy: Operations with low arithmetic intensity (1–2 FLOP/byte, like LayerNorm) are memory bound; operations around 50–200 FLOP/byte, like convolutions, straddle modern accelerator ridge points and become compute bound only when reuse and tiling push them above the threshold. The ridge point (for example, 153 FLOP/byte for A100) marks the transition.

The technical challenges of AI acceleration span multiple layers of the computing stack, from low-level memory hierarchy optimization to high-level compiler transformations and runtime orchestration. Memory bandwidth limitations create bottlenecks that require targeted techniques like data tiling, kernel fusion, and hierarchy-aware scheduling to overcome. Mapping neural network computations to hardware involves complex trade-offs between different dataflow patterns, memory allocation strategies, and execution scheduling approaches that must balance computational efficiency with resource utilization. Multi-chip and distributed acceleration extend the same logic outward, adding communication overhead, memory coherence, and workload partitioning to the system-level optimization problem.

Engineers who internalize the Roofline model and arithmetic intensity analysis gain a diagnostic framework: when inference runs slower than expected, they can immediately determine whether the bottleneck lies in compute throughput, memory bandwidth, or software overhead, and then select the appropriate optimization strategy. This systems-level understanding transforms hardware selection from vendor comparison into principled engineering.

What’s Next: From optimization to validation

We have now optimized the full D·A·M stack: data selection minimized training requirements, model compression reduced algorithmic complexity, and hardware acceleration maximized machine throughput. Optimization without measurement, however, is guesswork. In Benchmarking, we move from theoretical FLOPs to measured latency, applying the roofline model and statistical methods to validate our optimization claims against reality.

An accelerator is easy to mistake for a faster computer. It is better understood as a machine built to move less data per useful operation, because the cost it fights is fixed by physics; moving a byte takes far more time and energy than computing on it. Every technique in this chapter (tiling, kernel fusion, the memory hierarchy, the systolic array) exists to raise arithmetic intensity, the ratio of computation to data movement, and the roofline is the bookkeeping of that single fact. None of them repeals the cost; they only relocate the bottleneck between bandwidth and compute. This is the machine constraint made concrete: the accelerator cannot make a byte cheaper to move, only ensure that fewer bytes must move for each result worth keeping.

Self-Check: Question

Which statement best captures the chapter’s overall systems lesson about hardware acceleration?
1. Modern ML performance is governed by matching workload data-movement patterns and arithmetic intensity to the hardware’s memory system and execution model—peak FLOP/s is a ceiling, not a prediction of realized throughput
2. The fastest accelerator is always the one with the highest advertised peak FLOP/s, because FLOP/s scales linearly with throughput on modern workloads
3. Compiler and runtime support matter only after hardware has been chosen correctly, so performance engineering splits cleanly into a hardware phase and a software phase
4. Once a model is quantized, hardware choice becomes essentially irrelevant because low-precision execution is uniformly fast across devices
A production model runs at a small fraction of its accelerator’s advertised peak throughput. Using the chapter’s diagnostic framework, walk through the first three questions an engineer should ask and what each answer would imply for the next optimization step.
A team is deciding between two accelerators: chip X with 2× the peak FLOP/s of chip Y, and chip Y with 1.5× the HBM bandwidth of chip X. Their target workload is a transformer inference service at batch 1 with arithmetic intensity around 1 FLOP/byte. Which chip is the better choice and why?
1. Chip Y, because at 1 FLOP/byte the workload sits two orders of magnitude below either chip’s ridge point and is bandwidth bound—realized throughput tracks HBM bandwidth, not peak FLOP/s, so the bandwidth advantage translates directly to lower latency
2. Chip X, because more FLOP/s always produces better latency regardless of workload intensity
3. Either chip will deliver identical performance because both support tensor cores
4. Chip X, because FLOP/s per dollar is the only metric that matters in production

See Answers →

Self-Check Answers

Self-Check: Answer

What recurring structural pattern best explains the specialization path from the Intel 8087 through GPUs to TPUs?
1. A dominant computational bottleneck in each era made general-purpose processors inefficient, prompting a specialized unit that was later absorbed into mainstream silicon as the workload stabilized
2. Each generation became progressively more general-purpose to maximize software portability, so specialization is essentially a transitional phase
3. Clock-frequency scaling drove each transition, with specialization emerging only after the final frequency ceiling was reached
4. Each generation replaced memory hierarchies with larger on-chip arithmetic arrays so data movement stopped constraining performance
Answer: The correct answer is A. The 8087 solved floating-point cost, GPUs solved graphics throughput, and TPUs solved ML data movement—each followed the same pattern: specialized unit emerges to attack the dominant bottleneck, then mainstream silicon absorbs the successful features. The clock-frequency story is historically inverted: most specialization momentum came after Dennard scaling stalled, not before. The “more general-purpose” framing gets the arrow wrong: the trend is toward more specialization, not less.

Learning Objective: Identify the bottleneck-then-specialization pattern that produced each wave of domain-specific hardware.
Why did domain-specific architectures become structurally necessary (not merely attractive) after Moore’s Law slowed and Dennard scaling ended?
1. Power-density and thermal limits produced dark silicon: architects could no longer power every transistor simultaneously, so dedicating powered transistors to narrow high-value workloads became the only way to keep performance scaling
2. Model compute demand began growing slower than hardware supply, so architects had free transistor budget to devote to specialized units
3. CPUs lost the ability to execute floating-point arithmetic, forcing the workload onto dedicated accelerators
4. Programmers preferred fixed-function hardware because it simplified debugging and deployment pipelines
Answer: The correct answer is A. Once every transistor could no longer be powered simultaneously at fixed silicon area, generic scaling stopped delivering performance; dedicating the powered subset to workloads that use them intensely became the only remaining path. The “demand slowed” claim is the opposite of the systems-gap argument the chapter makes: ML compute demand outpaced hardware supply, which is precisely why DSAs became urgent rather than optional.

Learning Objective: Explain how dark silicon and the end of Dennard scaling forced the shift to domain-specific architecture.
Explain why machine learning created an “integration bottleneck” rather than merely another arithmetic bottleneck, and why this distinction drives accelerator design choices.

Answer: ML workloads combine massive parallelism, predictable dataflow, and precision tolerance, so the binding constraint becomes keeping thousands of arithmetic units fed with weights and activations rather than adding more arithmetic units. A single DRAM access can cost 100–1000× the energy of a multiply-accumulate, so a design that doubles MACs without improving locality spends more energy on data movement than on compute. The consequence is that modern accelerators prioritize on-chip scratchpads, operand reuse structures (systolic arrays, tensor cores), and hierarchical memory staging—design choices that would be wasteful if arithmetic itself were the bottleneck.

Learning Objective: Analyze why data movement rather than arithmetic throughput became the dominant design axis for ML accelerators.
Order the following specialization waves by what each one made architecturally necessary for the next: (1) Domain-specific AI accelerators emerge to exploit ML’s regular dataflow, (2) Floating-point coprocessors establish the pattern of offloading a dominant primitive, (3) Parallel graphics processors prove that thousands of lightweight arithmetic units can be managed coherently, (4) ML-specific units refine DSAs around systolic arrays and mixed precision.

Answer: The correct order is: (2) Floating-point coprocessors establish the pattern of offloading a dominant primitive, (3) Parallel graphics processors prove that thousands of lightweight arithmetic units can be managed coherently, (1) Domain-specific AI accelerators emerge to exploit ML’s regular dataflow, (4) ML-specific units refine DSAs around systolic arrays and mixed precision. Each wave made the next possible: FPUs normalized the offload-a-primitive playbook; GPUs proved mass parallelism was manageable; that management experience enabled DSAs to target ML dataflow; mature DSA techniques in turn enabled tensor-core and systolic refinement. Swapping GPUs ahead of FPUs breaks the enabling chain—mass parallelism without an offload playbook had no economic path to market.

Learning Objective: Order the four specialization waves causally and explain why each era’s lessons enabled the next.
A startup profiles batch-1 inference for a 7-billion-parameter autoregressive model on an A100 and observes 5–10 percent compute utilization. Which diagnosis best matches the section’s analysis?
1. Autoregressive token-by-token generation produces too little parallel work per step to saturate the accelerator’s thousands of arithmetic lanes, and weight reads dominate per-token time
2. The model is too parallel for the hardware, so the scheduler is oversupplying work to the arithmetic units and forcing them to stall
3. The GPU lacks adequate branch-prediction hardware for the control flow of the decoder loop
4. Reduced-precision arithmetic is unavailable during inference, forcing FP64 execution on every kernel
Answer: The correct answer names the two conditions the chapter identifies: insufficient parallel work per step plus weight-read dominance at batch 1. The “too parallel” framing is self-contradictory given the hardware’s own parallelism budget. The branch-prediction claim imports a CPU-centric concern that does not fit accelerators optimized around regular dataflow; the control flow of autoregressive decoding is trivial, but the per-token weight traffic is the actual bottleneck.

Learning Objective: Diagnose why an ML accelerator remains severely underutilized on small-batch autoregressive inference.

← Back to Questions

Self-Check: Answer

Which mapping of neural-network operations to accelerator primitives best matches the architectural argument of the section?
1. Dense projections map to tensor cores or systolic matrix units, element-wise operations (bias add, masking, ReLU) map to vector units, and transcendental activations like exp or sigmoid benefit from dedicated special-function units
2. Dense projections map to special-function units because matrix multiplication is a transcendental operation; softmax maps to tensor cores; element-wise masking runs on systolic arrays
3. Element-wise activations map to tensor cores because they share the same arithmetic shape; matrix multiplications run on vector units because both do multiply-accumulate; reductions run on branch predictors
4. All three operation classes map equally well to the same generic scalar pipeline because modern accelerators unify them into one execution unit
Answer: The correct answer is A. The three patterns decompose the workload: regular dense reductions (tensor cores or systolic arrays), regular element-wise work (vector units), and irregular transcendentals (special-function units). The “dense projections on special-function units” claim inverts the taxonomy: transcendentals are the specialty path, not the bulk matrix path. The “activations on tensor cores” mapping confuses operation shape with arithmetic shape: ReLU has no multiply-accumulate reduction structure to exploit.

Learning Objective: Match common neural-network operation classes to the accelerator primitives that execute them efficiently.
A 128×128 systolic array performs a large matrix multiplication. Explain quantitatively why its energy per multiply-accumulate is dramatically lower than a vector-unit implementation of the same multiplication.

Answer: The systolic array loads each operand once at the edge and pulses it through the grid, so each operand participates in 128 multiply-accumulates before being discarded—one DRAM fetch amortizes across 128 MACs rather than one. With 16,384 processing elements firing per cycle, the array delivers ~16,000 MACs/cycle while moving bytes mostly on short nearest-neighbor wires instead of the long global wires a vector-unit implementation would traverse. Because on-chip wire energy scales with distance and DRAM energy is two to three orders of magnitude above register energy, cutting off-chip traffic by two orders of magnitude translates directly into an energy-per-MAC reduction of the same order. The practical consequence: systolic wins decisively once the operand matrix is large enough to keep the array full.

Learning Objective: Quantify how systolic operand reuse converts memory-traffic reduction into energy-per-MAC savings relative to a vector-unit baseline.
True or False: A 50-percent unstructured pruning pass delivers roughly the same inference speedup as a 2:4 structured-sparsity pass on modern tensor hardware, because both halve the number of nonzero multiplies.

Answer: False. Structured 2:4 sparsity encodes the nonzero pattern in a compact metadata format, letting tensor cores skip zeros while maintaining predictable memory access. Unstructured sparsity scatters nonzeros irregularly, so index overhead and irregular access kill performance and the theoretical FLOP reduction rarely converts to a hardware speedup.

Learning Objective: Distinguish sparsity patterns that reduce parameter count from sparsity patterns that accelerate hardware execution.
A compiler must map a 4096×4096 matrix multiplication onto a 128×128 systolic array. Why is tiling required rather than optional?
1. The hardware has a fixed physical array size and a bounded on-chip scratchpad, so the logical computation must be partitioned into 128×128 sub-problems that fit and that maximize operand reuse within each block
2. Tiling changes the operation from matrix multiplication to vector reduction once the matrix exceeds a threshold size
3. Tiling is required so FP16 accumulations can be promoted to FP32 inside each tile, without which numerical precision would collapse
4. The accelerator can execute only one row of the output matrix per cycle regardless of how much on-chip memory is available
Answer: The correct answer is A. Tiling exists because graph-level operations are unbounded while silicon is not: the physical 128×128 array can only multiply blocks of that shape, and the scratchpad holds only a finite working set per block. The precision claim confuses tiling with mixed-precision accumulation, which is an orthogonal optimization. The “one row at a time” distractor misdescribes the hardware: systolic arrays specifically exploit full two-dimensional parallelism within each tile.

Learning Objective: Explain why tiling is required to bridge arbitrarily large tensor operations with fixed-size accelerator hardware.
A convolution layer produces 257 output channels and runs on a 128-wide tensor unit. Two full tiles cover 256 channels, but the 257th channel forces a third tile with only one active lane out of 128. What fraction of the third tile’s compute bandwidth is wasted, and what is the general lesson for dimension selection?
1. 127/128 ≈ 99 percent of the third tile is wasted; because utilization loss grows sharply near tile boundaries, architects and model designers pick output channel counts that are multiples of the tile width (128, 256, 512) to keep the final tile full
2. Approximately 50 percent is wasted because any partial tile halves throughput; the fix is to pad activations to the next batch size
3. No bandwidth is wasted because modern accelerators dynamically resize their tensor units to match odd dimensions
4. Approximately 1/128 is wasted because only the unused lanes consume energy; the shape choice is cosmetic and has no performance consequence
Answer: The correct answer is A. When one lane is active out of 128, 127/128 ≈ 99 percent of that tile’s arithmetic capacity runs empty; the cost grows sharply at tile boundaries, which is why production architectures choose channel counts that align with hardware tile width. The “dynamically resize” claim misdescribes silicon: the array is physically 128 wide. The “1/128 wasted” answer inverts the fraction—used versus unused—and misses the design lesson about picking aligned shapes.

Learning Objective: Quantify the utilization loss from non-aligned tensor dimensions and justify why aligned channel counts are a co-design choice.
Why does a switch from FP32 to FP16 or INT8 deliver more than the naive “half the bits, half the time” speedup on modern accelerators?
1. Reduced precision attacks both sides of the roofline: more low-precision MAC units fit in a fixed silicon area (raising the compute ceiling), and fewer bytes traverse the memory hierarchy per operation (raising arithmetic intensity)
2. Reduced precision removes the need for on-chip memory entirely because values fit in registers
3. Reduced precision improves model accuracy on large accelerators, so fewer training steps are needed to reach target quality
4. Tensor cores function only on integer operands, so any FP32 path is an emulation that runs orders of magnitude slower
Answer: The correct answer is A. The roofline has two axes and reduced precision moves favorably on both: the FLOP/s ceiling rises because more arithmetic units fit per mm², and the per-operation byte count falls, which raises arithmetic intensity and shifts the operating point rightward on the roofline. The memory-hierarchy-disappears claim contradicts the chapter’s central argument about data movement. The tensor-cores-only-on-integers claim misdescribes the hardware, which supports multiple precision modes natively.

Learning Objective: Analyze how reduced precision acts simultaneously on the compute ceiling and the arithmetic-intensity axis of the roofline.

← Back to Questions

Self-Check: Answer

What is the central quantitative claim of the “AI memory wall” as the section frames it?
1. Compute throughput has scaled faster than memory bandwidth for multiple accelerator generations, so an increasing share of ML workload time and energy is spent moving data rather than computing on it
2. Accelerators have too few arithmetic units to keep up with model demand, so the primary investment direction is adding more MAC units per chip
3. Adding HBM capacity automatically resolves memory-bound workloads by giving models more storage to work with
4. Only CPUs suffer from memory bottlenecks; accelerators avoid them architecturally by integrating compute and memory on one die
Answer: The correct answer is A. The memory wall is a growth-rate gap: compute has scaled faster than bandwidth, so the ratio of bytes-per-FLOP the hardware can deliver shrinks each generation. The “more capacity fixes it” claim inverts the bottleneck: bandwidth and latency, not total storage, are the binding constraints. The “only CPUs” claim contradicts the existence of HBM, NVLink, and on-chip scratchpads, which exist specifically because accelerators face the wall most acutely.

Learning Objective: Explain the AI memory wall as a compute-bandwidth growth-rate gap rather than a capacity gap.
Why is on-chip SRAM indispensable on an accelerator that already has multi-TB/s HBM?
1. HBM access takes tens to hundreds of nanoseconds of round-trip latency, too many cycles for arithmetic units to wait on directly; SRAM delivers operands at single-cycle latency, bridging the gap between HBM’s bandwidth and the arithmetic units’ cycle-scale demand
2. HBM cannot hold model parameters, so SRAM is needed to store weights that do not fit elsewhere
3. HBM is only used by CPUs and bypassed by accelerator kernels, which run entirely from SRAM
4. SRAM provides more total capacity than HBM, making it the correct tier for bulk storage
Answer: The correct answer is A. The argument is about latency, not bandwidth: even a 2 TB/s HBM link cannot feed a MAC array that demands operands every cycle, because a round trip to HBM is already dozens of cycles. SRAM closes that latency gap by sitting physically next to the arithmetic units. The capacity claim inverts the hierarchy: HBM has more bytes, SRAM has faster bytes; they serve different roles precisely because they cannot.

Learning Objective: Justify why cycle-scale operand delivery requires on-chip memory even in the presence of high-bandwidth off-chip memory.
Compare the memory-pressure profile that CNNs impose on an accelerator with the profile that large transformers impose, and explain how this comparison changes accelerator-selection priorities.

Answer: CNNs reuse small convolutional filters across many spatial positions, giving high operand-reuse ratios and letting tiled execution keep weights resident in on-chip SRAM; the binding constraint is compute throughput once the working set fits. Large transformers read billions of parameters with minimal reuse at batch 1 and add KV-cache growth that scales linearly with sequence length, so each decoded token pulls most of the model’s weights across the memory hierarchy. The consequence for hardware selection is that CNN accelerators can lean on dense compute ratios and modest HBM, while transformer inference accelerators must prioritize HBM bandwidth, KV-cache-aware memory layouts, and interconnect bandwidth for sharded weights—making peak-FLOP/s benchmarks actively misleading for transformer deployment.

Learning Objective: Compare CNN and transformer memory-pressure profiles and derive the implications for accelerator-selection priorities.
Order the following storage and communication tiers from fastest operand delivery (cycles) to slowest (hundreds of thousands of cycles or more) for an accelerator executing a tensor workload: (1) HBM device memory, (2) L2 or shared on-chip cache, (3) register file, (4) PCIe transfer from host memory.

Answer: The correct order is: (3) register file, (2) L2 or shared on-chip cache, (1) HBM device memory, (4) PCIe transfer from host memory. Registers deliver operands at the clock cycle; on-chip caches take tens of cycles; HBM costs hundreds of cycles; PCIe sits at the thinnest taper of the hierarchy at hundreds of thousands of cycles or more once queueing and protocol overhead are included. Swapping HBM and PCIe would ignore the bandwidth-taper argument the chapter develops—PCIe is one to two orders of magnitude slower than HBM and is precisely why host-to-device transfers often dominate end-to-end wall clock.

Learning Objective: Order the accelerator memory hierarchy by operand-delivery latency and justify why PCIe sits at the slowest end.
An eight-GPU node uses NVLink between GPUs and multi-TB/s HBM per GPU. The team observes that their input pipeline spends most of its time transferring data from the host CPU rather than computing on the GPUs. Which link is most likely the bottleneck?
1. The PCIe host-device link, which is roughly one to two orders of magnitude slower than HBM and slower than NVLink, so host-fed pipelines often saturate PCIe before any on-device link is stressed
2. The HBM interface, because device-local memory is structurally slower than host DRAM on modern systems
3. The NVLink fabric between GPUs, because inter-GPU links always represent the slowest communication tier in a node
4. The register file, because registers cannot sustain streaming input data for deep-learning batches
Answer: The correct answer is A. The section’s bandwidth-taper ordering puts PCIe at the narrow end: HBM >> NVLink >> PCIe, and once a pipeline is host-fed, PCIe becomes the binding link. The HBM-slower-than-host-DRAM claim inverts reality (HBM is roughly 10–50× faster than typical DDR). The “NVLink always slowest” claim contradicts its design purpose as a high-bandwidth peer link. The register-file framing confuses tier with role: registers feed MAC units, not input pipelines.

Learning Objective: Identify the likely bottleneck link in a host-fed multi-GPU pipeline using the bandwidth-taper hierarchy.
Which model family most strongly stresses memory capacity plus interconnect bandwidth rather than compute throughput at inference time?
1. Large transformers with tens to hundreds of billions of parameters and linearly growing KV cache, where per-token work is dominated by reading weights across the hierarchy
2. Small CNNs with tight spatial locality and filter reuse, which fit comfortably in on-chip buffers
3. Standard image-classification CNNs on moderate input resolutions, which achieve high arithmetic intensity
4. Dense GEMM workloads at very large batch size, which amortize weight reads across many samples and become compute bound
Answer: The correct answer is A. Transformer inference pulls most of the weight matrix per token at small batch and compounds that with KV-cache growth, putting the binding constraint on HBM bandwidth and, at large model sizes, on the interconnect that links sharded weights. The three other choices describe regimes the chapter explicitly identifies as compute-bound or reuse-friendly, so they do not match the capacity-plus-interconnect pressure profile.

Learning Objective: Classify which model family most strongly stresses memory capacity and interconnect bandwidth.

← Back to Questions

Self-Check: Answer

On an A100 with a ridge point of about 153 FLOP/byte, which statement correctly classifies whether a kernel is compute bound or memory bound?
1. Kernels with arithmetic intensity above about 153 FLOP/byte sit on the compute ceiling and are compute bound; kernels below that threshold sit on the bandwidth slope and are memory bound
2. Classification depends on the parameter count of the model: larger models are always compute bound
3. Classification depends on whether the workload is training or inference, not on arithmetic intensity
4. Kernels are compute bound if they use tensor cores and memory bound otherwise, regardless of arithmetic intensity
Answer: The correct answer is A. The ridge point is the arithmetic-intensity threshold where attainable performance switches from bandwidth-limited to compute-limited; a kernel’s position relative to that threshold is what determines the regime. The parameter-count framing confuses capacity with intensity. The training-versus-inference framing ignores that the same operation can be on either side of the ridge depending on its FLOP-per-byte ratio. The tensor-core framing inverts cause and effect: using tensor cores raises the compute ceiling, but it does not determine which ceiling the workload sits under.

Learning Objective: Classify a kernel as compute bound or memory bound using its arithmetic intensity relative to the accelerator’s ridge point.
A dense linear layer at batch size 1 sits near 1 FLOP/byte on an A100 (ridge point about 153 FLOP/byte). At batch size 256 the same layer sits at roughly 205 FLOP/byte. Explain the mechanism that shifts the layer across the ridge and quantify which levers moved.

Answer: At batch size 1 each output coordinate reads one row of the weight matrix, does one dot product, and discards the weights — FLOP/byte is roughly $\mathcal{O}(1)$ because every weight is used once per byte loaded. At batch size 256 the same weight row is multiplied against 256 input vectors before being discarded, so weight reads are amortized across many outputs: total FLOPs grow 256$\times$ while bytes moved grows only modestly because the weight matrix is still read once per tile. Crossing from about 1 FLOP/byte to about 205 FLOP/byte moves the layer past the A100’s roughly 153 FLOP/byte ridge, so the bottleneck flips from HBM bandwidth to tensor-core compute. The practical implication: batching is the primary lever for converting a bandwidth-bound inference kernel into a compute-bound one.

Learning Objective: Quantify how increasing batch size raises arithmetic intensity and can move a kernel across the ridge point.
Which operation is most likely to remain severely memory bound on an A100 even at large batch sizes?
1. LayerNorm, which performs on the order of a handful of arithmetic operations per activation element and reads the entire activation tensor to compute per-channel statistics
2. A large $3 \times 3$ convolution with 256 input and output channels and heavy spatial filter reuse
3. A large batched dense matrix multiplication using tensor cores at batch size 1024
4. A transformer QKV projection that shares input activations across the three projection matrices
Answer: The correct answer is A. LayerNorm has arithmetic intensity around 1 to 2 FLOP/byte — roughly two orders of magnitude below the A100’s 153 FLOP/byte ridge — so no realistic batching or tiling can pull it across the ridge; the operation is structurally bandwidth bound. The convolution and QKV projection cases have enough reuse to approach or exceed the ridge once tiled and batched, and the large batched GEMM is the canonical compute-bound kernel on tensor cores.

Learning Objective: Identify a canonical low-intensity normalization operation that remains memory bound regardless of batching.
True or False: Buying an H100 with higher peak FLOP/s than an A100 guarantees that every kernel compute-bound on the A100 stays compute-bound on the H100.

Answer: False. Newer accelerators raise peak FLOP/s faster than HBM bandwidth, which pushes the ridge point upward — the A100 sits near 153 FLOP/byte while the H100 sits near 295 FLOP/byte. A kernel with arithmetic intensity of 200 FLOP/byte is compute bound on the A100 but falls below the H100 ridge and becomes memory bound on the upgraded hardware, delivering less speedup than the peak-FLOP/s ratio would suggest.

Learning Objective: Recognize that rising ridge points on newer accelerators can convert compute-bound kernels into memory-bound ones.
A team profiles a GPT-2 batch-1 decode kernel on an A100 and measures 0.8 FLOP/byte of weight and activation traffic against a 153 FLOP/byte ridge. Because the kernel sits roughly 190$\times$ below the ____, the first productive optimizations reduce bytes moved through fusion, layout changes, or lower-precision weights, rather than adding more compute silicon.

Answer: ridge point. It is the hardware-specific arithmetic-intensity threshold at which the roofline transitions from the bandwidth slope to the compute ceiling; a kernel sitting far below that threshold cannot use added compute until its byte traffic is cut.

Learning Objective: Infer from a measured FLOP/byte ratio that a kernel sits far below the ridge point and select the appropriate optimization family.
A team profiles GPT-2 autoregressive inference at batch 1 on an A100 and finds realized throughput below 1 percent of peak. Which optimization direction is most justified first?
1. Reduce bytes moved per token through operator fusion, quantization of weights to INT8, or increasing batch size — each raises arithmetic intensity toward the ridge
2. Upgrade to a newer accelerator with 2$\times$ the peak FLOP/s, because more compute throughput is the standard remedy for low utilization
3. Convert all operations to FP64 for better numerical stability, which will let the kernel stay on the compute ceiling longer
4. Replace tensor cores with scalar cores to better match the kernel’s low observed utilization
Answer: The correct answer is A. The 1-percent utilization signature plus batch-1 autoregressive decode is the canonical bandwidth-bound regime: each token pulls the full weight matrix for one dot product per parameter, so the kernel sits far below the ridge and gains come from cutting bytes moved. The newer-accelerator answer actually makes the problem worse because rising ridge points on newer chips demand higher intensity. The FP64 path doubles bytes moved and worsens the bottleneck. The scalar-core answer abandons the array parallelism that exists for exactly this workload shape.

Learning Objective: Select the optimization family matching a batch-1 autoregressive-decode profile with bandwidth-bound signature.

← Back to Questions

Self-Check: Answer

Which decomposition best captures the three dimensions of hardware mapping for neural-network execution?
1. Operation placement on specific compute resources, tensor allocation across levels of the memory hierarchy, and temporal execution order with synchronization—all decided jointly
2. Choosing the programming language and framework that will implement the model at deployment time
3. Randomly distributing tensor operations across available cores to achieve fair load balancing
4. Pruning or quantizing the model so fewer parameters need to be stored on the accelerator
Answer: The correct answer is A. Mapping binds a logical computation to physical resources along three axes: where computation runs, where data lives, and when each piece executes. The language-choice framing confuses deployment tooling with hardware mapping. The random-distribution claim omits locality, which is the central concern. The pruning claim describes an algorithmic compression step, not hardware mapping.

Learning Objective: Classify placement, memory allocation, and execution order as the three joint dimensions of hardware mapping for a neural-network workload.
Explain why computation placement and memory allocation must be decided jointly rather than independently, using a concrete example of how decoupling them degrades performance.

Answer: Placement decides which processing element runs a kernel, and allocation decides which tier of the memory hierarchy holds its operands; neither decision is meaningful without the other, because operand-delivery latency is a function of both. Consider a kernel placed on a fast PE whose operand tensors were allocated to off-chip HBM: the PE idles on every access because HBM latency is tens to hundreds of cycles, so the hardware choice is wasted. Conversely, operands in on-chip SRAM bound to a distant PE still incur traversal cost across the on-chip network. The system consequence is that mapping must co-optimize both axes to close the latency gap between where compute happens and where data lives—a point that becomes central to dataflow-strategy selection in the next section.

Learning Objective: Justify the joint-optimization coupling between operation placement and memory allocation using the operand-delivery-latency mechanism.
A 2D convolution has loops over output-height, output-width, input-channels, output-channels, filter-height, and filter-width. Why does reordering these loops materially affect performance even though the arithmetic result is identical?
1. Loop order determines which variables become the innermost (fastest-changing) indices, which in turn controls which operands can be reused from a register or cache versus reloaded from a slower tier—turning a high-reuse schedule into a schedule that reloads the same data repeatedly
2. Loop order determines the mathematical value of the convolution, so different orders produce different numerical outputs
3. Loop order only affects code readability and has no effect on runtime performance on modern hardware
4. The hardware accepts exactly one legal loop nesting per convolution kernel, so reordering is impossible rather than inefficient
Answer: The correct answer is A. Loop ordering is semantically identity-preserving but changes the reuse footprint: putting output-channels as the innermost loop keeps one filter resident across many MACs; putting input-channels innermost forces the accumulator to live in a register while different weights stream past. The mathematical-result claim is false for associative accumulation. The readability-only claim misses the central lesson. The one-legal-nesting claim contradicts the entire compiler-scheduling discussion.

Learning Objective: Analyze how loop ordering changes operand reuse and memory traffic without changing a kernel’s semantics.
A team profiles a transformer kernel and finds that its activations are allocated to HBM while the kernel is placed on a PE cluster with large L2 capacity and strong peer bandwidth. HBM utilization saturates at 95 percent while PE utilization sits at 18 percent. Which mapping decision failed?
1. Memory allocation: the activations should have been tiled and staged into L2 so their reuse across attention heads is served by the fast tier rather than pulled from HBM every access
2. Operation placement: the PE cluster is too fast for this kernel, so moving the operation onto a slower cluster would improve apparent utilization
3. Execution order: serializing attention heads would reduce HBM pressure by limiting concurrent accesses
4. Precision choice: switching from FP16 to FP32 would let the kernel use more of the available HBM bandwidth
Answer: The correct answer names the allocation failure: placement was reasonable, but activations were not staged into the fast tier that would serve their reuse pattern, so the PEs starve on HBM reads (95 percent saturation) while arithmetic capacity runs at 18 percent. The placement-too-fast answer inverts the diagnosis: the bottleneck is HBM, not PE capacity. Serializing heads reduces concurrency but does not fix the allocation; the same bytes still traverse HBM. Raising precision doubles the byte traffic and makes the bottleneck worse.

Learning Objective: Diagnose a mapping failure by identifying whether placement, memory allocation, or execution order was misconfigured given a saturated-HBM profile.
Why is brute-force search over all legal mappings impractical even for a single layer?
1. The number of legal loop permutations, parallelization decompositions, and memory-placement choices grows combinatorially, so the legal-mapping space for a single convolution can exceed billions of candidates—well beyond any exhaustive evaluation
2. Modern compilers already know the optimal mapping analytically for every accelerator, so search is unnecessary
3. Only graph neural networks have enough operator complexity to make the search space nontrivial
4. Legal mappings number in the low dozens for typical kernels, so search completes in milliseconds
Answer: The correct answer is A. Loop-ordering permutations, tile-size choices, parallel-axis decompositions, and memory-level placement multiply into a space that is factorial in some axes and exponential in others, pushing the candidate count far past any exhaustive budget. The compiler-knows-analytically claim contradicts the existence of autotuners and cost models. The GNN-only claim misattributes the problem to a specific architecture; dense convolutions alone produce the combinatorial explosion. The “low dozens” claim contradicts the chapter’s factorial-space analysis.

Learning Objective: Explain why accelerator mapping requires heuristics or structured strategies rather than exhaustive search.

← Back to Questions

Self-Check: Answer

A CNN uses 3×3 filters reused across thousands of spatial positions per input image. Which stationary dataflow best matches this reuse profile?
1. Weight-stationary, because filter weights are read once at the array edge and reused across every spatial application while activations stream through
2. Output-stationary, because CNNs generate more output positions than the hardware can accumulate in place
3. Input-stationary, because filters are too large to fit in local memory and must stream in from DRAM
4. No stationary choice matters because dataflow has no effect on memory traffic once arithmetic is executed
Answer: The correct answer is A. The CNN reuse pattern—same small filter, many spatial applications—is exactly the case weight-stationary was designed for: keep filter weights in the local tier and stream activations past them. The output-stationary framing inverts which operand is reused; it matches layers where partial sums accumulate over many input contractions, not layers where the same weights apply across many outputs. The input-stationary claim misreads the size asymmetry (filters are the small operand). The “no effect” claim contradicts the chapter’s central argument.

Learning Objective: Select a stationary dataflow strategy by matching it to a workload’s operand-reuse pattern.
A GPU team is deciding between NHWC and NCHW tensor layout for a convolution-heavy inference pipeline. Why is channel-major (NHWC on most modern GPUs) typically preferred for the forward pass?
1. NHWC places values from the same spatial position but different channels at contiguous addresses, so hardware processing adjacent channels for one pixel can read memory efficiently in a single block rather than scattered reads
2. NHWC maximizes branch-prediction efficiency, which dominates convolution performance on GPUs
3. NHWC is the universally fastest layout on CPUs, so GPU engineers inherit the convention
4. NHWC eliminates the need for any tensor layout transformations in any framework
Answer: The correct answer is A. Contiguous memory access—parallel compute units reading contiguous addresses in one operation—is a core hardware preference, and channel-major layout aligns the spatial iteration with that preference. The branch-prediction framing imports a CPU-centric concern that is minor on GPUs. The CPU-always-faster claim is false; CPUs often prefer NCHW for their own cache geometry. The no-transformations claim contradicts the need for framework compilation passes to match different hardware preferences.

Learning Objective: Explain why channel-major tensor layout aligns with hardware memory-access preferences to improve convolution throughput.
Fusing Conv2D, BatchNorm, and ReLU into a single kernel does not change the total floating-point operation count. Explain quantitatively why fusion can still double or triple throughput on a bandwidth-bound inference segment.

Answer: Unfused, each operator writes its output tensor to HBM and the next operator reads it back, so three operators move the activation tensor six times through HBM (three writes plus three reads). Fusion keeps the intermediate activation resident in registers or on-chip SRAM across the three stages, cutting the HBM traffic for that segment from six tensor traversals to two (one input read plus one output write). On a bandwidth-bound segment with arithmetic intensity well below the ridge, wall-clock time is proportional to bytes moved, so cutting bytes by roughly 3× yields a proportional speedup. The system consequence: fusion is the primary compiler lever for bandwidth-bound model segments and is most effective precisely where the arithmetic intensity is lowest.

Learning Objective: Quantify the memory-traffic reduction delivered by operator fusion on a bandwidth-bound segment and relate it to realized speedup.
What is the main purpose of tiling in accelerator execution?
1. To partition a large tensor computation into blocks sized for fast local memory so each block’s operands can be reused many times before eviction, raising arithmetic intensity by reducing redundant traffic from slower memory tiers
2. To change the numerical order of operations enough that the final result becomes slightly more accurate
3. To eliminate parallelism by serializing work into smaller sequential pieces that are easier to schedule
4. To guarantee that every operation becomes compute bound regardless of its underlying arithmetic intensity
Answer: The correct answer is A. Tiling shapes work to fit in fast memory so operands are reused intensively before being evicted; the reused operands do not need to be re-fetched from slow memory, which directly raises arithmetic intensity. The accuracy claim confuses tiling with mixed-precision accumulation. The serialization claim inverts the purpose: tiles are precisely the unit of parallel execution on arrays. The compute-bound guarantee is false; tiling can shift the operating point on the roofline but cannot change a kernel’s structural intensity.

Learning Objective: Explain tiling as a locality-optimizing partitioning that raises arithmetic intensity by fitting working sets into fast memory.
True or False: A single stationary dataflow is optimal across CNNs, transformers, and MLPs, so hybrid mapping is mostly implementation overhead without real performance benefit.

Answer: False. CNNs reuse small filters across spatial positions and favor weight-stationary; attention blocks stress bandwidth through large KV reads and favor cache-aware or output-staging strategies; MLP blocks look like large GEMMs and prefer blocked weight-reuse with fusion. Each reuse pattern is different enough that forcing one strategy on all leaves 2–3× performance on the table in at least one layer class, which is why production compilers and runtimes implement hybrid mapping.

Learning Objective: Recognize that layer-class reuse-pattern diversity makes hybrid mapping a real performance lever rather than implementation overhead.
A compiler maps a transformer composed of attention blocks and MLP layers onto an A100. Which mapping strategy best matches the section’s argument?
1. Use memory-aware attention kernels (fused softmax and masking with on-chip tiling) for attention blocks while using blocked GEMM-style tiling with operator fusion for MLP layers—matching each layer class to the dataflow that serves its reuse pattern
2. Use the same weight-stationary scheme for every layer because consistency across the graph is more important than per-layer locality
3. Prefer row-major layouts uniformly because they are cache-friendly on CPUs, regardless of the target accelerator
4. Disable fusion and tiling globally so runtime decisions remain maximally flexible
Answer: The correct answer is A. Attention and MLP stress different reuse patterns—attention has KV-cache reads with limited weight reuse, while MLPs look like large GEMMs with heavy weight reuse—so the dominant bottlenecks differ and so should the dataflow strategy. The uniform-weight-stationary answer sacrifices attention performance for simplicity. The row-major-CPU answer imports the wrong hardware’s convention. The disable-all answer forfeits the two largest optimization levers the chapter has identified.

Learning Objective: Design a layer-aware hybrid mapping strategy for a transformer by matching attention and MLP layers to their best dataflow.

← Back to Questions

Self-Check: Answer

Why do ML compilers need different optimization priorities from traditional compilers?
1. ML programs are tensor-level computation graphs, so the dominant optimization axes are graph transformations (fusion, layout, partitioning), memory-traffic planning, and accelerator-specific execution—not instruction-level optimizations on sequential scalar code
2. ML programs no longer require any memory management, so compilers can skip allocation and placement entirely
3. Traditional compilers already optimize GPUs and TPUs perfectly, so ML compilers exist only to provide a friendlier API
4. ML models are too small for instruction-level optimization to matter, so any compiler will produce equivalent code
Answer: The correct answer is A. The optimization axis shift from instruction stream to tensor graph is the core reason ML compilers diverge: fusion, layout, and partitioning have no counterpart in a C compiler. The memory-management-disappears claim is the opposite of the truth; tensor placement and movement are central. The “traditional compilers already optimize GPUs” claim ignores that vendor stacks exist precisely because generic compilers cannot target accelerator pipelines. The small-models claim contradicts the existence of trillion-parameter training jobs.

Learning Objective: Compare the optimization axes of ML compilers with those of traditional compilers.
Order the following ML compilation stages by their causal dependencies: (1) graph optimization rewrites the computation graph (fusion, layout), (2) kernel selection picks a concrete implementation (cuBLAS, cuDNN, handwritten) for each operator, (3) memory planning assigns tensor buffers to hierarchy tiers, (4) computation scheduling fixes the temporal execution order across resources, (5) code generation emits device-specific machine code.

Answer: The correct order is: (1) graph optimization rewrites the computation graph, (2) kernel selection picks a concrete implementation for each operator, (3) memory planning assigns tensor buffers to hierarchy tiers, (4) computation scheduling fixes the temporal execution order, (5) code generation emits device-specific machine code. Each stage has a data dependency on the prior one: the compiler cannot choose kernels for operators until fusion has settled the graph; it cannot plan memory until kernel choice reveals working-set sizes; it cannot schedule before memory placement is known; it cannot emit code before the schedule is fixed. Swapping scheduling ahead of memory planning forces the scheduler to reason about access costs it does not yet know—the same mistake early ML compilers made before adopting this pipeline.

Learning Objective: Order the causal dependencies among the five ML compilation stages and justify why reordering them breaks the pipeline.
A model runs 4× faster under an optimizing backend than in eager execution while the model architecture is unchanged. Explain which compiler transformations are most likely responsible and why they deliver the gain without touching the model’s math.

Answer: Four transformations typically account for most of the gap. First, operator fusion collapses adjacent kernels (Conv-BN-ReLU or softmax-masking-matmul) into a single launch, cutting kernel-launch overhead and intermediate HBM traffic. Second, kernel selection substitutes vendor-optimized implementations (cuDNN, cuBLAS) for the generic reference path. Third, layout transformation permutes tensor memory order to enable coalesced access on the target accelerator. Fourth, memory planning folds buffers onto reused allocations, eliminating redundant transfers. None changes the model’s arithmetic output; all change the bytes-moved and launch-cost profile, and on a bandwidth-bound workload the wall-clock delta can easily be 3–5×.

Learning Objective: Explain how compiler transformations deliver multi-x speedups without changing model semantics, naming the specific transforms involved.
What is the main goal of graph optimization before hardware-specific code generation?
1. Rewrite the computation graph to eliminate redundant operations, fuse adjacent kernels, adjust tensor layouts, and reduce intermediate activations before committing to kernel and placement choices that depend on the final graph structure
2. Bind every operation permanently to a fixed physical core, ignoring later scheduling or runtime adaptation
3. Convert all arithmetic to FP64 for numerical accuracy regardless of the target hardware’s capabilities
4. Replace vendor-optimized libraries (cuBLAS, cuDNN) with generic scalar implementations for portability
Answer: The correct answer is A. Graph optimization is the high-level restructuring stage: rewrite the graph into a form the downstream stages can compile efficiently. The core-binding answer overreaches into placement and scheduling decisions that happen after graph optimization. The FP64 answer inverts the trend toward reduced precision. The generic-libraries answer describes the opposite of what graph optimization aims at: vendor libraries are the target of kernel selection, not competition.

Learning Objective: Describe the role of graph optimization as the high-level restructuring stage that precedes kernel and placement commitments.
A compiler lowers the same GEMM into code for an A100, a TPU, and a CPU inference server. Why is kernel selection not a trivial library lookup?
1. The same abstract GEMM has many concrete implementations whose best choice varies with matrix shape, precision, batch size, and current hardware state; a cuBLAS call that wins at 4096×4096 FP16 may lose at 128×128 FP32, so the compiler must choose among implementations using cost models or profiling
2. Accelerators execute only one kernel type for the entire model, so kernel selection collapses to a single per-hardware choice
3. Kernel choice is unrelated to tensor shape or memory bandwidth, so a fixed library call suffices in every case
4. Vendors do not provide optimized kernel libraries for ML workloads, so all kernels must be generated from scratch
Answer: The correct answer is A. Shape, precision, and batch regime each pick different winning implementations, which is why cost models and autotuners exist. The single-kernel-per-chip claim contradicts the existence of per-operator implementations in every production stack. The shape-and-bandwidth-unrelated claim denies the roofline framing the chapter just developed. The no-libraries claim contradicts cuBLAS, cuDNN, oneDNN, and XLA—the entire ecosystem kernel selection chooses among.

Learning Objective: Analyze why kernel selection depends on workload shape, precision, and hardware state rather than a fixed library lookup.
True or False: A complete compile-time execution plan is sufficient by itself, so runtime systems function as little more than administrative wrappers around compiled binaries.

Answer: False. Compile-time plans assume fixed conditions—batch shape, memory state, no contention—that production workloads violate routinely. Runtime systems adapt to variable batch sizes, fragmenting device memory, multi-tenant interference, and thermal throttling, and they can override compile-time kernel choices when hardware state changes. The compile-then-run boundary is a handoff, not a closure.

Learning Objective: Distinguish compile-time plan generation from runtime adaptation and identify what compile-time planning cannot capture.

← Back to Questions

Self-Check: Answer

Why do AI systems need specialized runtime support even after a model has been fully compiled?
1. Production conditions—variable batch size, memory fragmentation, multi-tenant contention, thermal throttling—routinely diverge from the fixed assumptions baked into compile-time plans, so execution must adapt to actual state rather than follow a static script
2. Compilation cannot produce executable machine code for accelerators, so a runtime is required to translate the plan into instructions
3. Runtimes exist only to provide a user interface for monitoring accelerator temperatures and utilization
4. Compiled execution plans are mathematically incomplete and produce wrong results without runtime correction
Answer: The correct answer is A. The runtime adapts to exactly the conditions compile-time cannot see: dynamic shapes, memory state, contention, and hardware state. The compilation-can’t-produce-code claim contradicts the existence of every AOT-compiled binary. The UI-only framing trivializes the runtime’s scheduling and kernel-selection responsibilities. The mathematically-incomplete claim confuses correctness with adaptation; compiled plans are correct under their assumptions but suboptimal when conditions change.

Learning Objective: Explain why runtime systems are required even when compile-time optimization has already occurred.
A transformer inference service receives requests with sequence lengths ranging from 64 to 8192 tokens. Explain how runtime dynamic kernel execution improves performance compared to a single compile-time plan built for an average sequence length.

Answer: A single compiled plan picks one tile size, one memory layout, and one kernel variant optimized for a fixed assumed shape, so it under-utilizes the accelerator on short sequences (too much reserved KV-cache and launch overhead per token) and overruns memory or slows catastrophically on long sequences (tile too small, forcing many launches). A dynamic runtime measures the actual shape at request time and selects among pre-compiled kernel variants or rebuilds tile parameters: short requests use a compact kernel with low launch overhead, long requests use a larger tiled kernel with staged KV-cache. The consequence is that realized latency tracks each request’s inherent cost rather than the average case, so p99 latency and mean utilization both improve.

Learning Objective: Explain how runtime kernel adaptation reduces per-request latency variance for variable-shape inference workloads.
A runtime observes a batch shape that would benefit from FP16 tensor cores and replaces a compiler-selected FP32 matrix kernel with an FP16 tensor-core kernel for that step. Which justification best matches the section’s argument?
1. Runtime kernel selection can exploit actual hardware state and workload shape: if current conditions permit reduced precision (precision budget, tensor-core availability, input range), the runtime substitutes a faster implementation that the compile-time plan conservatively rejected
2. The runtime swaps kernels only to reduce accuracy for easier debugging, never for performance
3. Compiled kernels are unexecutable on real hardware until the runtime rewrites the entire graph
4. FP16 tensor-core substitution is a CPU-only optimization and has no effect on accelerators
Answer: The correct answer names the actual mechanism: the runtime has information the compiler lacked—current shape, current precision budget, current tensor-core availability—and exercises a pre-authorized alternate path when that information favors a faster kernel. The accuracy-for-debugging claim reverses the motivation. The “unexecutable until rewritten” claim contradicts standard AOT deployment. The CPU-only claim contradicts every production inference stack.

Learning Objective: Analyze why runtime kernel selection may override compile-time choices by leveraging information unavailable at compile time.
True or False: An inference service that benchmarks at a stable 10 ms p99 latency in isolation will deliver roughly the same p99 under multi-tenant production load because the compiled execution plan and the hardware are unchanged.

Answer: False. Multi-tenant production introduces HBM bandwidth contention, L2 cache pollution from neighboring kernels, and thermal headroom changes that can raise p99 latency by 2–5× relative to an isolated benchmark—even with the identical compiled plan on identical silicon. The runtime’s ability to detect and react to these conditions is exactly the adaptation the section motivates.

Learning Objective: Distinguish stable-benchmark performance from multi-tenant production behavior using specific contention mechanisms.
What is the main role of kernel scheduling inside the AI runtime?
1. Determine when and where selected kernels execute on available resources so compute, data movement, and host-device transfers are overlapped and accelerators stay continuously fed
2. Rewrite the model architecture to reduce parameter count before execution begins
3. Write training checkpoints to SSD at fixed intervals during inference
4. Guarantee identical wall-clock latency for every kernel, regardless of operand shape or resource contention
Answer: The correct answer is A. Scheduling exists to keep the accelerator’s arithmetic, memory, and host-device paths all working in parallel so no resource idles while another waits. The model-rewriting answer describes graph optimization, which is a compiler role. The checkpoint answer is unrelated to inference scheduling. The identical-latency guarantee is physically impossible because kernels differ in shape, dependencies, and resource demand.

Learning Objective: Describe the runtime scheduler’s role in overlapping compute and data movement to sustain accelerator utilization.

← Back to Questions

Self-Check: Answer

A training run scales from 8 to 64 accelerators and sees only a 5.2$\times$ speedup rather than 8$\times$. Which factor best explains the diminishing return?
1. Gradient synchronization, parameter broadcast, and inter-node coordination create a communication overhead that acts as Amdahl’s serial fraction—growing as more devices must participate in every all-reduce, eventually capping speedup regardless of added compute
2. Accelerators lose arithmetic capability when connected together through NVLink or Ethernet, so per-device throughput drops as the cluster grows
3. Memory capacity decreases as more accelerators are added, forcing smaller batches that scale worse
4. All-reduce removes the need for per-device computation once two or more devices are present
Answer: The correct answer is A. Communication is the serial fraction that Amdahl’s Law binds: once synchronization takes 20 percent of step time at 64 devices, no amount of additional compute can push the end-to-end speedup past roughly 5$\times$. The “lose arithmetic capability” claim has no hardware basis. The capacity-decreases claim inverts the effect of adding devices. The all-reduce-removes-computation claim describes the opposite of what all-reduce does.

Learning Objective: Explain how communication overhead acts as an Amdahl-style serial fraction that caps multi-accelerator scaling.
Which scaling approach most directly reduces inter-chip communication overhead by treating an entire wafer as one compute fabric?
1. Wafer-scale integration, which keeps communication on a single large silicon die and avoids package boundaries, PCIe links, and network switches for most cross-compute-unit traffic
2. Chiplet packaging across standard CPU sockets, which routes most communication through motherboard interconnect
3. A conventional PCIe-attached multi-GPU workstation, which uses the host PCIe bus for inter-GPU traffic
4. A cluster linked only by commodity Ethernet, which amortizes communication across TCP/IP
Answer: The correct answer is A. Wafer-scale integration eliminates the package-and-board traversal that dominates cross-chip latency, keeping most communication on-die. Chiplets still cross package boundaries. PCIe-attached multi-GPU workstations move cross-GPU traffic through PCIe, the slowest tier in the taper. Ethernet-only clusters add network-stack overhead on top of the physical link.

Learning Objective: Identify wafer-scale integration as the approach that most aggressively minimizes inter-chip communication overhead.
Explain why memory coherence becomes a materially harder problem on multi-chip AI systems than on single-chip accelerators, and give one concrete consequence for parallelism design.

Answer: On a single die, cache coherence is managed by on-chip protocols operating at nanosecond latencies. Once many chips must share a consistent view of weights, gradients, or activations, coherence traffic must traverse package boundaries and interconnect fabric, each access costing hundreds to thousands of cycles and consuming coordination bandwidth that competes with useful data. In practice this drives distributed-training systems to replace shared-memory assumptions with explicit memory management where programmers control data placement and synchronization manually. The consequence is that multi-chip AI systems trade the programming convenience of shared memory for the predictability of explicit communication-aware parallelism.

Learning Objective: Explain why coherence costs grow superlinearly in multi-chip AI systems and justify the shift to explicit communication-aware parallelism.
A team scales a training cluster from 64 to 4,096 GPUs. Which factor grows sharply and forces fault tolerance into the design’s first-class concerns?
1. Component mean-time-to-failure is roughly constant per GPU, so aggregate failure rate scales linearly with device count; at 4,096 GPUs a single failure that would lose 15 minutes of progress at 64 devices now occurs many times per day, requiring checkpointing and automatic recovery
2. Software-stack complexity grows but hardware failure rates stay negligible, so fault tolerance is still optional
3. Network topology becomes irrelevant at scale because all-reduce self-adapts to any shape
4. Thermal and power limits disappear at scale because heat dissipates across a larger footprint
Answer: The correct answer is A. The aggregation effect dictates that per-device reliability is fixed, so fleet failure frequency scales with device count; a weekly failure at 64 devices becomes an hourly failure at 4,096. The software-complexity-only claim ignores hardware-induced failure frequency. The topology-irrelevant claim contradicts the entire Vol2 networking material. The thermal-disappears claim inverts the physics: larger systems face more aggregate heat, not less.

Learning Objective: Identify the quantitative growth of failure frequency with device count and justify fault tolerance as a first-class concern at scale.

← Back to Questions

Self-Check: Answer

Why do modern mobile and edge SoCs integrate CPUs, GPUs, DSPs, and NPUs rather than rely on a single general-purpose processor?
1. Different tasks in a mobile AI pipeline (convolution backbone, audio DSP, control logic, display rendering) have different compute, control-flow, latency, and power profiles; specialized blocks execute each task at a fraction of the energy-per-op that a general-purpose core would need
2. Heterogeneous chips eliminate the need for any runtime scheduling because each workload has one obvious home
3. CPUs are physically incapable of executing AI-related code, so alternative processors are required
4. A single general-purpose processor would always exceed legal transistor limits at the silicon density required for mobile devices
Answer: The correct answer is A. Heterogeneity matches workload shape to hardware structure: dense regular MACs belong on an NPU, branch-heavy control stays on the CPU, parallel pixel shading goes to the GPU, and audio/signal streams go to the DSP. The no-scheduling claim misses that the scheduler is exactly what makes heterogeneity usable. The CPU-can’t-run-AI claim is false; CPUs can, just at much higher energy cost. The transistor-limit claim invents a non-existent regulatory constraint.

Learning Objective: Explain why heterogeneous SoCs match workload characteristics to specialized processor types for energy and latency efficiency.
A mobile object-detection pipeline has three stages: a convolutional backbone, non-maximum suppression (NMS), and display overlay rendering. Justify a sensible partition of these stages across an NPU, CPU, and GPU and explain what the energy cost would be of running all three on any single processor.

Answer: The convolutional backbone maps to the NPU because it is dense, regular, multiply-accumulate-heavy work that NPU systolic arrays or tensor units execute at a fraction of CPU energy per MAC. NMS belongs on the CPU because it is branch-heavy, data-dependent sorting and filtering whose irregular control flow would idle an NPU’s lanes; CPU branch predictors and scalar execution fit this pattern. Display overlay rendering goes to the GPU because its graphics pipeline is built precisely for parallel pixel operations. Running the full pipeline on the NPU would crawl on NMS because its irregular control flow produces near-zero NPU utilization; running it on the CPU would consume 10–50$\times$ more energy for the backbone; running it on the GPU would waste power on NMS’s non-parallel branches. The partitioned pipeline delivers lower total energy and lower latency than any uniform choice.

Learning Objective: Design a workload partition across heterogeneous SoC processors by matching each stage’s compute pattern to the most energy-efficient execution unit.
A mobile SoC under sustained AI inference approaches its thermal envelope. Which response best matches the section’s thermal-management argument?
1. Migrate work across processors and coordinate DVFS—moving one pipeline stage from a saturated NPU to an idle GPU while reducing voltage on the hottest block—keeping the aggregate within the thermal envelope while sustaining acceptable frame latency
2. Keep all work on the hottest accelerator and accept short-term throttling, since migration costs exceed benefits
3. Disable the memory hierarchy, because cache activity is the dominant heat source on mobile SoCs
4. Force every processor into its maximum-performance state simultaneously to clear the work queue faster
Answer: The correct answer is A. Coordinated DVFS plus task migration is how heterogeneous systems honor thermal budgets without catastrophic latency collapse: redistribute work away from the hot block and reduce its frequency while moving the load to cooler silicon. The “keep it all on the hottest block” claim accepts throttling the section explicitly warns against. The disable-memory-hierarchy claim inverts the cause—memory activity is not the dominant heat source—and would be self-defeating on memory-bound workloads. The max-performance-simultaneously claim accelerates throttling, which is exactly the failure mode to avoid.

Learning Objective: Analyze how coordinated DVFS and task migration manage thermal constraints without collapsing latency.
True or False: Automotive heterogeneous AI can treat latency optimization the same way smartphone inference does, because both are power-constrained edge deployments.

Answer: False. Automotive systems add hard real-time deadlines (a perception pipeline that misses its 33 ms frame budget is a safety failure, not a UX blemish), safety partitioning (temporal isolation between safety domains), redundancy (redundant processing elements), and deterministic scheduling that smartphone inference never needs. The constraints are qualitatively different: phones optimize average user experience under a soft power cap; automotive systems must guarantee worst-case behavior within a certified deadline.

Learning Objective: Distinguish automotive functional-safety requirements from best-effort mobile inference.
What is a unique software-stack challenge heterogeneous SoCs pose that a single-accelerator system does not?
1. The software must coordinate processor-specific execution models, manage shared-memory interactions across heterogeneous cores (processor-specific caching behavior, DMA synchronization), and schedule across processors whose instruction sets and precision formats differ—all while presenting a usable programming model
2. Models no longer need tensor layout decisions because each processor has a universal layout
3. Compilation becomes unnecessary because runtime processor selection replaces every optimization pass
4. Power and thermal management move entirely into hardware, so software concerns decrease at scale
Answer: The correct answer is A. Heterogeneity multiplies software concerns: different ISAs, different precision support, different memory visibility, and cross-processor scheduling all compound. The no-layouts claim is false; different processors have different preferred layouts. The no-compilation claim contradicts every mobile stack from Qualcomm to Apple. The hardware-handles-thermals claim inverts reality; software DVFS and scheduling policies are what actually manage heat in response to workload signals.

Learning Objective: Explain the software-orchestration complexity introduced by heterogeneous SoC execution across multiple processor types.

← Back to Questions

Self-Check: Answer

Why does the section treat performance per watt as a first-class metric for AI hardware rather than a concern limited to battery-powered devices?
1. At fleet scale, every watt spent per inference multiplies by billions of inferences per day—so efficiency sets annual operating cost, carbon footprint, and datacenter capacity planning, making joule-per-op a system-level constraint comparable to latency and throughput
2. Wattage matters only on battery devices; the chapter uses it metaphorically when discussing datacenters
3. Carbon impact is unrelated to inference fleet design because training dominates total lifecycle emissions
4. Specialized silicon always consumes more power than general-purpose hardware, so perf-per-watt is a constraint on specialization rather than a motivation
Answer: The correct answer is A. Once inference runs billions of times per day, the per-operation energy number determines the utility bill, the carbon footprint, and how many inferences a given datacenter footprint can sustain. The metaphorical-usage claim trivializes what is a direct cost. The training-dominates claim contradicts published lifecycle analyses that show inference can exceed training emissions over a model’s service life. The specialized-consumes-more claim inverts the well-established result that domain-specific silicon improves joules per operation by one to two orders of magnitude.

Learning Objective: Explain why performance per watt is a first-class systems metric at fleet scale rather than a mobile-device concern.
Explain how investing in specialized inference silicon can simultaneously improve system performance and sustainability at fleet scale, using a concrete energy comparison.

Answer: A specialized NPU can execute the same inference at 10–100× lower joules per operation than a CPU because its silicon is allocated to arithmetic and locality structures rather than control logic. At a billion inferences per day, a drop from 5 joules per inference on CPU to 0.05 joules per inference on NPU translates to about 5,000 MJ/day versus 50 MJ/day, or 1,825,000 MJ/year versus 18,250 MJ/year—two orders of magnitude lower electricity use and a proportional reduction in grid-electricity carbon emissions. The system consequence is that accelerator choice is not only a latency-and-cost decision but also a major lever on annual carbon output, so sustainability and performance engineering align rather than conflict.

Learning Objective: Analyze how accelerator specialization lowers joules per operation and quantify the fleet-scale energy and carbon consequences.
True or False: The most sustainable hardware choice is always the cheapest device to purchase, because amortized upfront cost dominates environmental impact over the device’s service life.

Answer: False. Operational energy cost and carbon emissions accumulate over years of continuous service, so a device with higher upfront cost but much better joules per inference can be dramatically more sustainable at fleet scale. A 5× upfront cost premium recovered through 20× better energy efficiency pays back in both electricity bills and lifecycle carbon within months, not years.

Learning Objective: Distinguish upfront-purchase cost from lifecycle energy cost when evaluating sustainability of accelerator choices.

← Back to Questions

Self-Check: Answer

Which purchasing mistake does this section most directly criticize?
1. Choosing hardware on peak FLOP/s alone without first checking whether the target workload’s arithmetic intensity places it above or below the accelerator’s ridge point
2. Comparing deployment cost and sustained power draw across candidate accelerators
3. Estimating sustained (realizable) throughput on the actual workload rather than advertised peak
4. Using arithmetic intensity as one input to hardware selection alongside cost and latency
Answer: The correct answer names the chapter’s central critique: peak FLOP/s ignores the ridge-point test that actually determines realizable throughput on the target workload. The three remaining options are all remedies the chapter endorses—checking sustained throughput, cross-checking power, and using arithmetic intensity—so treating any of them as a mistake inverts the guidance.

Learning Objective: Identify peak FLOP/s as an unreliable standalone metric for accelerator selection.
Why is deploying batch-1 inference on a top-tier training accelerator often economically poor compared to using a cheaper inference-class chip?
1. Batch-1 inference has very low arithmetic intensity, so it sits far below the training accelerator’s ridge point; the expensive compute silicon runs largely idle while latency is set by HBM bandwidth, which both chips can provide at far lower cost
2. Batch-1 inference raises arithmetic intensity enough to saturate tensor cores, so the expensive chip’s extra compute delivers a proportional latency improvement
3. Training-class accelerators cannot execute inference kernels at all, so the investment is structurally wasted
4. Reducing batch size eliminates memory traffic, so the expensive chip’s bandwidth advantage disappears
Answer: The correct answer captures the economics: at batch 1, HBM bandwidth sets latency and the expensive FLOP/s are unused, so a cheaper chip with comparable bandwidth delivers comparable latency at a fraction of the cost. The raises-intensity claim contradicts the section’s roofline analysis. The can’t-run-inference claim is false. The no-memory-traffic claim inverts the mechanism; smaller batches actually starve reuse and increase per-token traffic per FLOP.

Learning Objective: Evaluate why small-batch inference makes high-compute training accelerators an economically poor choice.
Explain why expecting linear speedup from adding more GPUs is a fallacy even when the workload is highly data-parallel, and quantify the mechanism.

Answer: As device count grows, gradient synchronization and parameter broadcast impose communication cost that scales at best logarithmically and often linearly in device count, even on fast interconnects like NVLink or InfiniBand. If gradient all-reduce takes 5 percent of step time at 8 GPUs, it can take 20–30 percent at 256 GPUs because each step requires many more bytes coordinated across many more links. Under Amdahl’s Law, a 20-percent communication fraction caps scaling at 5× regardless of how much compute is added, so the realized 256-GPU speedup may be 120× rather than 256×. The system consequence is that efficient large-scale training requires explicit compute-communication overlap to mask communication behind compute rather than adding more devices.

Learning Objective: Quantify how communication overhead as a serial fraction under Amdahl’s Law prevents linear multi-GPU scaling.
True or False: Vendor-specific kernel optimizations are always the right choice because portability has little practical value once a team commits to a hardware vendor.

Answer: False. Vendor-specific optimizations can deliver 2–3× performance on the current generation but create migration cost when the vendor’s next-generation chip, a competitor’s chip, or a changed cost model later requires rewriting every custom kernel. Production teams routinely face 6–12 months of engineering work to port from a vendor-specific stack, which is often more expensive than the original performance gain.

Learning Objective: Recognize the trade-off between peak vendor-specific performance and long-term portability cost.
A team buys a specialized ML accelerator for a workload with irregular memory access, arithmetic intensity near 1 FLOP/byte, and batch size 1, then measures only 1.2× speedup over a CPU baseline. Which diagnosis best matches the chapter?
1. The workload is mismatched to the accelerator’s assumptions: regular dataflow, high arithmetic intensity, and parallel work. The operation mix sits far below the ridge point, the irregular access defeats the array’s structured-memory path, and batch 1 starves the array’s parallelism—the specialized device cannot exercise its peak compute
2. The workload is likely compute bound because specialized accelerators make most workloads compute bound regardless of input shape
3. The SSD is probably too slow to deliver input tensors, so the fix is a faster storage tier
4. The accelerator needs more branch predictors to match CPU performance on irregular access patterns
Answer: The correct answer names the three mismatch axes the chapter identifies: intensity, regularity, and batch parallelism. The specialized-makes-everything-compute-bound claim is precisely the fallacy the section refutes. The SSD-bottleneck answer invents a bottleneck not suggested by the profile; the actual symptoms point to the accelerator’s assumptions, not storage. The branch-predictor answer misunderstands the accelerator’s design: it lacks branch predictors by choice, because its target workloads have no irregular branches.

Learning Objective: Diagnose a workload-hardware mismatch using arithmetic intensity, access regularity, and batch parallelism.

← Back to Questions

Self-Check: Answer

Which statement best captures the chapter’s overall systems lesson about hardware acceleration?
1. Modern ML performance is governed by matching workload data-movement patterns and arithmetic intensity to the hardware’s memory system and execution model—peak FLOP/s is a ceiling, not a prediction of realized throughput
2. The fastest accelerator is always the one with the highest advertised peak FLOP/s, because FLOP/s scales linearly with throughput on modern workloads
3. Compiler and runtime support matter only after hardware has been chosen correctly, so performance engineering splits cleanly into a hardware phase and a software phase
4. Once a model is quantized, hardware choice becomes essentially irrelevant because low-precision execution is uniformly fast across devices
Answer: The correct answer states the chapter’s central argument: realized performance comes from the match between workload intensity and hardware ridge point, not from peak FLOP/s alone. The FLOP/s-predicts-throughput claim is exactly the fallacy the roofline analysis refutes. The hardware-then-software ordering contradicts the co-design emphasis the chapter repeatedly develops. The quantization-makes-hardware-irrelevant claim inverts the co-design point: INT8 helps only on hardware built to exploit it.

Learning Objective: Synthesize the chapter’s central principle for selecting and optimizing ML accelerators.
A production model runs at a small fraction of its accelerator’s advertised peak throughput. Using the chapter’s diagnostic framework, walk through the first three questions an engineer should ask and what each answer would imply for the next optimization step.

Answer: First, what is the kernel’s arithmetic intensity relative to the accelerator’s ridge point? If intensity sits far below the ridge, the workload is bandwidth bound and the next lever is reducing bytes moved through fusion, tiling, or lower precision; if intensity is near or above the ridge, the workload is compute bound and the next lever is improving tensor-core utilization or choosing a better kernel. Second, is the profile consistent with CPU or host-to-device bottlenecks rather than the accelerator itself? High PCIe utilization or idle accelerator time points to pipeline fixes (prefetching, multi-worker DataLoader) before any kernel optimization. Third, are compiler and runtime choices realizing the optimal dataflow and layout for the target hardware? A missing fusion pass or a suboptimal layout can leave 2–3× on the table without any hardware change. The practical consequence is that diagnosis begins with bottleneck classification—not with assuming the hardware is underpowered.

Learning Objective: Apply the chapter’s bottleneck-classification framework to systematically diagnose a performance shortfall.
A team is deciding between two accelerators: chip X with 2× the peak FLOP/s of chip Y, and chip Y with 1.5× the HBM bandwidth of chip X. Their target workload is a transformer inference service at batch 1 with arithmetic intensity around 1 FLOP/byte. Which chip is the better choice and why?
1. Chip Y, because at 1 FLOP/byte the workload sits two orders of magnitude below either chip’s ridge point and is bandwidth bound—realized throughput tracks HBM bandwidth, not peak FLOP/s, so the bandwidth advantage translates directly to lower latency
2. Chip X, because more FLOP/s always produces better latency regardless of workload intensity
3. Either chip will deliver identical performance because both support tensor cores
4. Chip X, because FLOP/s per dollar is the only metric that matters in production
Answer: The correct answer applies the chapter’s central diagnostic: the workload’s 1 FLOP/byte intensity places it far below any ridge point, making the kernel memory bound; realized latency scales with bandwidth, not compute. The more-FLOP/s-always-wins claim is the exact fallacy the roofline analysis refutes. The identical-performance claim ignores that bandwidth differs by 1.5×. The FLOP/s-per-dollar answer substitutes a cost metric for the realizable-throughput analysis the workload actually demands.

Learning Objective: Synthesize the chapter’s roofline framework into a concrete accelerator-selection decision under realistic workload intensity.

← Back to Questions

Accelerator	Peak FP16	Bandwidth	Ridge Point
GPU (2017-era)	\(\sim 10^2\) TFLOP/s	\(\sim 10^3\) GB/s	\(\sim 10^2\) FLOP/byte
GPU (2020-era)	\(\sim 10^2\) TFLOP/s	\(\sim 10^3\) GB/s to \(\sim 10^0\) TB/s	\(\sim 10^2\) FLOP/byte
GPU (2023-era)	\(\sim 10^3\) TFLOP/s	a few TB/s	\(\sim 10^2\) FLOP/byte
TPU-class (2023-era)	\(\sim 10^2\) to \(\sim 10^3\) TFLOP/s	\(\sim 1\) TB/s	\(\sim 10^2\) FLOP/byte

Hardware Acceleration

Purpose

Hardware Specialization

Specialized computing

Parallel computing and graphics processing

Emergence of domain-specific architectures

The technology S-curve: Why we must shift

Machine learning hardware specialization

The integration bottleneck

AI Compute Primitives

Vector operations

Matrix operations

Matrix operations in neural networks

Matrix operations hardware acceleration

Special function units

Nonlinear functions

Hardware implementation of nonlinear functions

SFU hardware implementation

Compute Units and Execution Models

Mapping primitives to execution units

Evolution from SIMD to SIMT architectures

Tensor Cores

Processing elements

N:M structured sparsity mechanics

Systolic arrays

The tiling principle: Bridging graph and silicon

Numerics in AI acceleration

Precision trade-offs

Mixed-precision computing

Intra-node interconnects: Scaling the stack

Cost-performance analysis

AI Memory Systems

Understanding the AI memory wall

Quantifying the compute-memory performance gap

Hardware balance (\(I_{\text{ridge}}\)): The paradigm partition

Memory access patterns in ML workloads

Irregular memory access

Memory hierarchy

On-chip memory

Off-chip memory

Memory bandwidth and architectural trade-offs

Host-accelerator communication

Node-level interconnect topology

Transfer optimization

Model memory pressure

Multilayer perceptrons

Convolutional neural networks

Transformer networks

Accelerator design implications

Roofline Model

Hardware ridge points

Calculating memory bandwidth bounds

Optimization by intensity regime

Hardware Mapping

Placement and allocation

Computation placement

Memory allocation

Combinatorial complexity

Ordering computation and execution

Parallelization across processing elements

Memory placement and data movement

Mapping search space

Dataflow Optimization

Building blocks of mapping strategies

Data movement patterns

Weight stationary

Output stationary

Input stationary

Memory-efficient tensor layouts

Row-major layout

Channel-major layout

Comparing row-major and channel-major layouts

Kernel fusion

Intermediate memory write

Kernel fusion for memory efficiency

Performance benefits and constraints

Memory-efficient tiling strategies

Tiling fundamentals

Performance benefits of tiling

Tiling methods