Algorithm Foundations

Purpose

What algorithmic building blocks determine both what neural networks compute and how efficiently systems can run them?

Many performance problems that look hardware-bound are actually rooted in the algorithm’s structure: a model may be dominated by matrix multiplies, but its shapes, layouts, and sparsity patterns decide whether those multiplies hit fast matrix units or fall back to slow kernels. Likewise, training instabilities often trace back to the mechanics of differentiation and the memory footprint of activations. This appendix collects the compact linear algebra and learning mechanics that show up repeatedly in profiling traces and back-of-the-envelope estimates: general matrix multiply intensity, tensor shapes and strides, sparse storage overheads, and the true components of training memory. In D·A·M terms, these algorithmic structures are the algorithm axis made concrete, because they determine how much work the machine must execute and how much data must move.

How to Use This Appendix

This appendix is designed as a reference. Reach for it when translating a profiler symptom (“slow matmul,” “shape mismatch,” “OOM during training”) into a concrete computational or memory cause.

Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).

When GEMMs are slow: Use section 1.1.4 and compare intensity to the hardware’s ridge point.
When memory blows up in training: Use section 1.3 and the training memory decomposition.
When tensor code “should work” but does not: Use section 1.2.2 and section 1.2.3.
When sparsity is proposed as a fix: Use section 1.1.5 to check density and metadata overhead.

This appendix covers the mathematical and computational machinery that powers neural networks. From the linear algebra at the heart of every layer to the backpropagation algorithm that enables learning, these foundations explain how models compute and why certain implementation choices affect performance. The concepts here support the deep learning foundations in Neural Computation, the framework internals in ML Frameworks, and the training strategies in Model Training. The algorithm ecosystem is often a zoo of competing architectures, but these mathematical foundations remain constant.

Linear Algebra

Deep learning systems are, at their core, engines for transforming massive matrices. Frameworks like PyTorch abstract away the raw math, but performance engineering still depends on the underlying linear algebra. Building on how numbers are stored in Machine Foundations, this section focuses on how they are manipulated.

Systems Perspective 1.1: Why this matters

Many modern dense neural networks, especially transformers and large CNNs, spend much of their compute time in matrix multiplication or GEMM-like kernels. A self-attention block performs the Q, K, V, and output projection GEMMs plus two attention matrix multiplications: one for the attention scores and one for the attention-weighted values. The layer’s feed-forward block adds more GEMMs. Understanding GEMM performance characteristics explains why batch size affects throughput, why certain layer dimensions are “better” than others, and how to interpret profiler output. Reasoning about matrix dimensions and arithmetic intensity predicts whether a dense workload is compute bound or memory bound before any profiler trace runs.

Tensor operations and notation

We use Einstein summation¹ notation throughout this book because it makes complex operations explicit (implemented as torch.einsum in PyTorch and np.einsum in NumPy). Matrix multiplication \(C = AB\) becomes: \[ C_{ij} = \sum_k A_{ik} B_{kj} \]

¹ Einstein Summation Convention: Introduced by Albert Einstein in 1916 to simplify the notation of general relativity. The convention states that repeated indices in a product are implicitly summed over, eliminating explicit summation signs. ML frameworks adopted it because it concisely expresses arbitrary tensor contractions in a single string.

In einsum notation, this is ik,kj->ij. The notation extends naturally to the multi-dimensional operations in attention mechanisms. For example, batched multi-head attention is bhid,bhjd->bhij (batch, head, sequence indices).

Memory layouts and performance

Data layout in memory (row-major vs. column-major) directly affects cache efficiency. When iterating over a matrix, accessing contiguous memory locations is dramatically faster than strided access. The difference can be 10\(\times\) to 100\(\times\) in effective bandwidth.

A common optimization pattern is to transpose tensors once before repeated operations to ensure contiguous access in the hot loop. The one-time transpose cost is amortized across many subsequent operations.

The dot product as similarity

The dot product \(\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i\) is geometrically equivalent to \(|\mathbf{a}| |\mathbf{b}| \cos \phi\), which makes it a natural measure of similarity between two vectors: a large positive result means they point in the same direction, zero means they are orthogonal (unrelated), and a negative result means they oppose each other.

This geometric interpretation is why dot products appear everywhere in modern architectures. In attention mechanisms, query (\(Q\)) and key (\(K\)) vectors are dot-produced to compute a similarity score that determines how much each token attends to every other token. The resulting attention weights are then used to form a weighted combination of value (\(V\)) vectors—making the dot product the foundation of the transformer’s ability to model long-range dependencies.

General matrix multiply (GEMM)

GEMM² is the computational workhorse of deep learning. For matrices of size \(M{\times}K\) and \(K{\times}N\), GEMM performs \(2MNK\) floating-point operations (multiply-accumulate counts as two operations).

² General Matrix Multiply (GEMM): The name comes from the BLAS (Basic Linear Algebra Subprograms) library specification, first standardized in 1979 (Lawson et al. 1979). The “GE” prefix stands for “general” (as opposed to symmetric, triangular, or banded matrices). GEMM computes \(C = \alpha AB + \beta C\) and is the single most performance-critical routine in deep learning. See Model Training for how GEMM shapes determine training throughput.

Lawson, Charles L., Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. “Basic Linear Algebra Subprograms for Fortran Usage.” ACM Transactions on Mathematical Software 5 (3): 308–23. https://doi.org/10.1145/355841.355847.

The arithmetic intensity of GEMM scales linearly with matrix dimension. For square \(n{\times}n\) matrices in FP16 (2 bytes/element):

\[\text{Intensity} = \frac{O}{D_{\text{vol}}} = \frac{2n^3}{3n^2 \times 2} = \frac{n}{3}\text{ FLOP/byte}\]

This explains several important phenomena:

Larger batches improve efficiency: Batching increases the effective matrix dimensions, pushing workloads toward the compute-bound region of the roofline.
Aligned dimensions help: Hardware tensor cores are optimized for precision- and architecture-specific tile multiples. Dimensions that align with these multiples avoid padding overhead and improve kernel efficiency, but they do not need to be powers of two.
Small matrices are inefficient: A square GEMM with \(n =\) 64 has intensity 64/3 ≈ 21.3 FLOP/byte, well below the ridge point (153.0 FLOP/byte), achieving only ~13.9 percent of peak throughput.

Sparse matrix formats

When most elements in a matrix are zero, specialized storage formats avoid wasting memory on zeros and enable computations that skip them entirely.

The Compressed Sparse Row (CSR) format uses three arrays:

Values: The nonzero elements, stored in row order
Col_Idx: The column index of each nonzero element
Row_Ptr: The starting position in Values for each row (length = num_rows + 1)

CSR is essential for recommendation systems (sparse embedding tables) and pruned models. For a matrix with \(N\) elements, \(R\) rows, and \(K\) nonzeros, CSR uses \(\mathcal{O}(K + R)\) storage instead of \(\mathcal{O}(N)\); when \(K\) is large relative to \(R\), this is often summarized as \(\mathcal{O}(K)\).

To see the trade-off concretely, consider a vocabulary embedding matrix with 100,000 rows and 10,000 columns (1B parameters):

Dense (FP32): 1B parameters \(\times\) 4 bytes each = 4 GB.
Sparse (1 percent density): Storing only nonzeros requires roughly 10M stored values \(\times\) (4 value + 4 index) ≈ 80 MB.
Result: A 50× reduction in memory footprint, fitting a model that would otherwise OOM (Out of Memory).

Linear algebra tells us what to compute; the next question is how to express those computations in code. Tensor programming primitives—shapes, strides, and broadcasting—bridge the gap between mathematical notation and the array operations that actually execute on hardware.

Tensor Programming Primitives

A shape mismatch crash, a silently wrong broadcast, a kernel running at 5 percent of peak because of a noncontiguous tensor—these common ML engineering failures all trace back to the same layer of abstraction. Tensor programming translates the abstract math of linear algebra into concrete array manipulations that run on hardware.

Systems Perspective 1.2: Why this matters

The logic may be correct, yet the code crashes with a shape mismatch error—or worse, runs and produces garbage because dimensions were broadcasted incorrectly. Mastering tensor shapes, strides, and broadcasting is the literacy of ML engineering.

Computational complexity cheat sheet

Table 1 provides a quantitative reference for the most common building blocks. Use these formulas for napkin-math estimation of model size and compute requirements before hardware is provisioned. Given the layer type and input shape, the formulas predict whether a model will fit in memory. The attention row is the key warning: sequence length contributes an \(S^2\) term, while hidden dimension contributes \(d^2\) projection work, so long-context models can shift the dominant cost without changing the parameter count.

Shapes and strides

A tensor is a view over an underlying storage buffer, described by shape, stride, dtype, and offset metadata. Only contiguous tensors lay their logical elements out as one adjacent block.

Shape: The dimensions of the tensor (for example, (3, 4)).
Stride: The number of elements to skip in memory to move to the next element in a dimension.

Table 1: Deep Learning Tensor Primitives: Summary of shapes, parameters, and FLOP counts. Note: \(B\) is batch size, \(S\) is sequence length, \(K\) is kernel size. The Attention FLOPs include QKV projections and the \(S^2\) attention matrix interactions. The \(S^2\) term dominates for long sequences (explaining why LLM inference slows with context length); the \(d^2\) term dominates for large hidden dimensions.

Layer Type	Output Shape	Parameters (\(P\))	FLOPs (per Forward Pass)
Linear	\((B, N_{\text{out}})\)	\((N_{\text{in}} + 1) \times N_{\text{out}}\)	\(2 \times B \times N_{\text{in}} \times N_{\text{out}}\)
Conv2D	\((B, C_{\text{out}}, H', W')\)	\(K^2 \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}\) if bias is enabled	\(2 \times B \times H' \times W' \times K^2 \times C_{\text{in}} \times C_{\text{out}}\)
Attention (Single Head)	\((B, S, d_{\text{model}})\)	\(4 \times d_{\text{model}}^2\)	\(B \times (4 S^2 d_{\text{model}} + 8 S d_{\text{model}}^2)\)
LayerNorm	\((B, S, d_{\text{model}})\)	\(2 \times d_{\text{model}}\)	\(\mathcal{O}(B \times S \times d_{\text{model}})\)

Operations like transpose() or view() often just change the strides, not the data in memory. This is fast (\(\mathcal{O}(1)\)) but can lead to noncontiguous tensors that fail in optimized kernels. When a kernel requires contiguous data—and most optimized BLAS routines do—calling contiguous() forces a memory copy to realign data, which is an \(\mathcal{O}(N)\) operation that can dominate runtime if triggered repeatedly inside a loop.

Broadcasting

Broadcasting allows arithmetic operations on tensors of different shapes. The rule is: compare dimensions from the last to the first. Two dimensions are compatible if:

They are equal.
One of them is 1.

The dimension with size one is “stretched” to match the other, as illustrated in figure 1. This stretching is virtual: the data is not copied in memory. Instead, the stride for that dimension is set to 0, allowing the hardware to read the same value repeatedly with \(\mathcal{O}(1)\) memory overhead.

Figure 1: **Tensor Broadcasting Rules**: Visualization of how tensors (3,1) and (1,4) expand to a shared (3,4) result. This stretching is a virtual operation that modifies strides without allocating new memory.

Consider a concrete case: tensor A has shape (32, 1, 64) and tensor B has shape (1, 128, 64). Comparing dimensions right to left, sixty-four matches sixty-four, then one stretches to 128, then one stretches to thirty-two, yielding result shape (32, 128, 64). Visualizing this expansion prevents silent logic bugs that accidentally allocate a large tensor (for example, a (Batch, Batch) matrix instead of an element-wise (Batch) vector).

Shapes, strides, and broadcasting govern how tensors flow through a model’s forward pass. However, training a model requires more than forward computation—it requires learning from errors. The next section examines the algorithm that makes learning possible: backpropagation, along with the memory costs it imposes.

Mechanics of Learning

With valid tensor programs, we can construct the training loops that power learning. Backpropagation is the algorithm that orchestrates these tensors to compute gradients, transforming a forward prediction into a backward learning signal.

Systems Perspective 1.3: Why this matters

When training fails—loss goes to NaN, gradients explode, or memory runs out—understanding what backpropagation actually does is essential for diagnosing the problem. The backward graph supplies the mental model for reasoning about gradient flow and memory usage during training.

The chain rule and automatic differentiation

For a composed function \(y = f(g(x))\), the derivative is \(\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}\). In a neural network, \(f\) and \(g\) are layers, and the composition can be many levels deep. For a three-layer network \(y = f_3(f_2(f_1(x)))\), the chain rule extends to:

\[ \frac{\partial y}{\partial x} = \frac{\partial f_3}{\partial f_2} \cdot \frac{\partial f_2}{\partial f_1} \cdot \frac{\partial f_1}{\partial x} \]

Each factor in this product is a local derivative—computed at one layer using only that layer’s inputs and outputs. This locality is what makes the algorithm tractable: the entire network never needs to be differentiated as a monolithic function. Instead, each layer computes its own local derivative during the backward pass and multiplies it by the gradient flowing in from the layer above.

Modern frameworks use reverse-mode automatic differentiation, which computes gradients for all \(P\) parameters in a single backward pass. The key insight is that starting from the output and working backward (reverse mode) requires one pass regardless of the number of parameters, whereas starting from each input and working forward (forward mode) would require \(P\) passes—one per parameter. This is why a training step is a small constant multiple of inference, commonly about 2–3\(\times\) a forward pass for dense networks, rather than \(P\) passes.

The backpropagation algorithm

Backpropagation³ implements the chain rule efficiently through two passes: forward to compute outputs, backward to compute gradients. Figure 2 illustrates this process for a simple two-layer network, with the forward pass (black arrows) computing outputs and the backward pass (red dashed arrows) propagating gradients.

³ Backpropagation: Short for “backward propagation of errors.” The algorithm was independently discovered multiple times—by Werbos (1974) and Linnainmaa (1970) for reverse-mode differentiation and by Rumelhart et al. (1986) for neural network training. Its key insight is that computing gradients for all parameters requires only one backward pass through the graph, making training cost roughly 2–3\(\times\) inference rather than \(P\times\) (once per parameter).

Werbos, Paul. 1974. “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.” PhD thesis, Harvard University.

Linnainmaa, Seppo. 1970. “The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors.” Master's thesis, University of Helsinki.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

The diagram lays out a simple two-layer network with both passes annotated, providing a roadmap for the forward and backward passes that follow.

Forward pass

Start at \(x\), the input. Multiply by \(W_1\) to get hidden activation \(h\). Cache \(h\) because the backward pass will need it later. Multiply \(h\) by \(W_2\) to get output \(y\). Cache \(y\). Compare \(y\) to the target label to compute loss \(\mathcal{L}\).

At this point, the loss has been computed and memory contains the input \(x\), the cached activation \(h\), the cached output \(y\), and the loss \(\mathcal{L}\). For a large model, these cached activations dominate memory usage.

Backward pass

Now trace backward from \(\mathcal{L}\). The loss function provides \(\frac{\partial \mathcal{L}}{\partial y}\), the gradient of loss with respect to the prediction. This is where the error signal enters the network.

Since \(y = h \cdot W_2\), the chain rule gives us two gradients at this layer:

\[ \frac{\partial \mathcal{L}}{\partial W_2} = h^T \cdot \frac{\partial \mathcal{L}}{\partial y} \qquad \text{(weight gradient—used to update } W_2\text{)} \]

\[ \frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial y} \cdot W_2^T \qquad \text{(input gradient—passed backward to the previous layer)} \]

Notice that computing \(\frac{\partial \mathcal{L}}{\partial W_2}\) requires the cached activation \(h\) from the forward pass. This is the fundamental reason activations must be stored: every layer’s weight gradient depends on that layer’s input.

Now continue backward to the first layer. Since \(h = x \cdot W_1\), the same pattern gives: \[ \frac{\partial \mathcal{L}}{\partial W_1} = x^T \cdot \frac{\partial \mathcal{L}}{\partial h} \]

Each step backward requires two things: the gradient flowing in from above \(\left(\frac{\partial \mathcal{L}}{\partial h}\right)\) and the activation cached during the forward pass (\(x\)). This is why the backward pass costs roughly 2\(\times\) the forward pass in compute: at each layer, it performs two matrix multiplications (one for the weight gradient, one for the input gradient) vs. one in the forward pass.

The true cost of training memory

A common mistake is to assume that training memory equals model size. This assumption leads to immediate OOM errors because it ignores three other components that often dwarf the weights themselves. The actual memory footprint of training is: \[ M_{\text{total}} = M_{\text{weights}} + M_{\text{gradients}} + M_{\text{optimizer}} + M_{\text{activations}} \]

For a standard Adaptive Moment Estimation (Adam) optimizer (Kingma and Ba 2014) in mixed precision (Micikevicius et al. 2017):

Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” ICLR in press.

Micikevicius, Paulius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, et al. 2017. “Mixed Precision Training.” arXiv Preprint arXiv:1710.03740.

Weights: 2 bytes (FP16/BF16) or 4 bytes (FP32).
Gradients: Same size as weights.
Optimizer State: Adam moments require 8 bytes per parameter; an FP32 master copy adds another 4 bytes, giving 8–12 bytes per parameter.
Activations: The hidden giant. A recomputation- or tiled-attention-friendly estimate scales as \(\mathcal{O}(B \times S \times N_L \times d)\); a full no-recompute attention implementation also materializes attention-score tensors that scale with \(S^2\).

To see how these components interact in practice, consider a concrete model.

Napkin Math 1.1: Worked example: GPT-2 (1.5B) training memory

The model: GPT-2 XL has \(P =\) \(1.5 \times 10^{9}\) parameters, 48 layers, hidden dimension \(d =\) 1600.

Model state (fixed per step):

Weights (BF16): \(1.5 \times 10^{9}\) \(\times\) 2 bytes = 3 GB
Gradients (BF16): \(1.5 \times 10^{9}\) \(\times\) 2 bytes = 3 GB
Optimizer (FP32 master + momentum + variance): \(1.5 \times 10^{9}\) \(\times\) 12 bytes = 18 GB
Total model state: 24 GB—fits on a 80 GB-class A100/H100 (85.9 GB in decimal units), but leaves only 61.9 GB for activations.

Activations (scale with batch):

Per-layer retained activations for this quick estimate are approximately \(12 \times B \times S \times d\) BF16 elements, or \(12 \times B \times S \times d \times 2\) bytes, where \(B\) is batch size and \(S\) is sequence length. The factor twelve accounts for the major intermediate tensors retained for backpropagation: input activations, QKV projections (\(3d\)), attention output, FFN intermediate (\(4d\)), and layer norm/dropout masks. This estimate assumes the implementation does not retain the full \(B \times H \times S^2\) attention-score tensor, as in recomputation- or tiled-attention-friendly accounting. With 48 layers, batch size 8, and sequence length 1024:

\[48 \times 12 \times 8 \times 1024 \times 1600 \times 2\text{ bytes } \approx 15.1 GB\]

Total: ~39.1 GB under this activation-accounting convention, which fits on one 85.9 GB accelerator. However, increase the batch to 64 and activations grow to ~120.8 GB, exceeding the remaining capacity. This is the threshold where gradient checkpointing (trading ~33 percent more compute for \(\mathcal{O}(\sqrt{N_L})\) activation memory) becomes necessary.

Activation explosion

While weights are fixed (\(\mathcal{O}(P)\)), activations grow linearly with batch size and sequence length. As the worked example shows, activation memory can quickly exceed the room left after model state. Gradient checkpointing⁴ reduces this pressure by storing only selected activations and recomputing the rest during backpropagation (Chen et al. 2016); techniques such as FlashAttention address related attention-memory bottlenecks by tiling attention to reduce memory round-trips (Dao et al. 2022).

⁴ Gradient Checkpointing: Trades compute for memory. Instead of storing all activations, only a subset (checkpoints) is kept, and the missing ones are recomputed during the backward pass. This reduces memory usage from storing every layer activation to roughly the square root of the layer count at the cost of ~33 percent more compute.

Chen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. “Training Deep Nets with Sublinear Memory Cost.” arXiv Preprint arXiv:1604.06174.

Dao, T., D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 35: 16344–59. https://doi.org/10.52202/068431-1189.

The same decomposition provides a quick check on whether a proposed training run is feasible.

Checkpoint 1.1: Training memory estimation

A model has one billion parameters and is trained with Adam in mixed precision (FP16 weights, FP32 optimizer states). Without activations, how many GB of memory do the weights, gradients, and optimizer states require?
If the same model processes batch size 32 with sequence length 2048 and 24 layers of hidden dimension 1024, would you expect activations to be larger or smaller than the nonactivation memory? Why?
How does gradient checkpointing reduce activation memory, and what is the trade-off?

Computational graphs and optimization

The dependency structure exposed by backpropagation also gives compilers something to optimize. ML compilers represent models as directed acyclic graphs (DAGs), and that representation enables hardware-independent transformations.

Static single assignment

Compilers transform graphs into static single-assignment (SSA) form, where each variable is assigned exactly once. This makes data dependencies explicit, enabling safe optimizations—most importantly, operator fusion.

Operator fusion

Without fusion, each operation in a chain like MatMul → Add (bias) → ReLU produces an intermediate tensor that is written to High Bandwidth Memory (HBM) and then read back for the next operation. For elementwise operations like Add and ReLU, the compute is trivial (one FLOP per element) but the memory traffic is not (read the tensor, write it back). The arithmetic intensity of unfused elementwise operations is therefore close to zero—deeply memory bound.

Fusion combines consecutive operations into a single kernel that reads the input once, applies all operations in registers or shared memory, and writes the final result once. For a sequence of \(k\) elementwise operations on a tensor of size \(N\) bytes, fusion reduces memory traffic from \(2kN\) bytes (each op reads and writes) to \(2N\) bytes (one read, one write)—a \(k\times\) reduction.

FlashAttention is the most impactful fusion in modern ML: it fuses the entire attention computation (Q\(\cdot\)K\(^T\) scaling, masking, softmax, dropout, and V multiplication) into a single kernel that operates on tiles in SRAM, reducing attention memory from \(\mathcal{O}(S^2)\) to \(\mathcal{O}(S)\) and achieving 2–4\(\times\) wall-clock speedups by eliminating the large intermediate attention matrix that would otherwise be written to and read from HBM.

Together, the linear algebra foundations, tensor programming mechanics, and training memory model covered in this appendix form the algorithmic substrate on which all ML systems are built.

Summary

Key Takeaways: When 'hardware-bound' is really algorithm-bound

GEMM arithmetic intensity: It scales as \(n/3\) for square \(n{\times}n\) matrices, so small matrices are memory bound and waste most of the hardware’s compute capability. Increasing batch size or aligning dimensions to hardware tile sizes are the primary levers for improving throughput.
Sparse storage formats: They reduce memory once density falls below the metadata break-even point (roughly 50 percent density for CSR with FP32 values and INT32 indices), but sparse-kernel performance usually needs much higher sparsity, often 90–95 percent, because irregular index traversal costs throughput.
Tensor layout bugs are silent: Tensor shapes, strides, and broadcasting are the source of many subtle bugs: a noncontiguous tensor can silently degrade kernel performance by orders of magnitude, and an incorrect broadcast can produce a result of the wrong shape without raising an error.
Training memory is activation-sensitive: Training memory is dominated by four components—weights, gradients, optimizer states, and activations—and the last of these grows with batch size and sequence length, often exceeding the model’s weight footprint by 10–50\(\times\).
Backpropagation trades memory for compute: Backpropagation’s efficiency comes from computing all parameter gradients in a single backward pass, but it requires caching forward-pass activations, creating a fundamental memory–compute trade-off that gradient checkpointing partially resolves.
Computational graphs enable compiler optimization: Computational graph representations enable compiler optimizations like operator fusion, which eliminates memory round-trips between consecutive operations and can dramatically improve throughput for memory-bound workloads.