Module 06: Autograd

Gradient computation is the single largest source of memory growth during training. Autograd’s choices about what to cache during the forward pass, what to recompute on the backward, and when to release tensors decide whether a model fits in VRAM. The runtime you build here is the memory manager for the rest of the system.

NoteModule Info

FOUNDATION TIER | Difficulty: ●●●○ | Time: 6-8 hours | Prerequisites: 01-05

You need to be fluent with everything from Modules 01–05:

  • Tensor operations (matmul, broadcasting, reductions)
  • Activation functions (the source of non-linearity)
  • Neural network layers (what gradients will flow through)
  • Loss functions (the scalar gradients flow back from)
  • DataLoader for batched iteration

If you can hand-compute a forward pass through a small network and explain why we minimize loss, you’re ready.

🎧 Audio Overview

Listen to an AI-generated overview.

🚀 Launch Binder

Run interactively in your browser.

Open in Binder →

📄 View Source

Browse the source code on GitHub.

View on GitHub →
🔥 Slide Deck · AI-generated
1 / -
Loading slides...

Overview

A neural network learns by nudging every parameter in the direction that lowers the loss. To find that direction you need a gradient — one number per parameter. A modern model has billions of parameters, so deriving those gradients by hand is not just tedious, it is impossible. Every framework you have ever used — PyTorch, TensorFlow, JAX — solves this with the same trick: automatic differentiation.

In this module you build reverse-mode autograd from scratch. The forward pass records each operation into a small graph; loss.backward() walks that graph in reverse, applying the chain rule one operation at a time. When you finish, calling loss.backward() on your tensors does the same thing it does in PyTorch — and you will know exactly why.

This is the conceptually hardest module in the Foundation tier. It is also the one that unlocks everything that follows: optimizers, training loops, and any model that learns from data.

Learning Objectives

TipBy completing this module, you will:
  • Implement the Function base class that enables gradient computation for all operations
  • Build computation graphs that track dependencies between tensors during forward pass
  • Master the chain rule by implementing backward passes for arithmetic, matrix multiplication, and reductions
  • Understand memory trade-offs between storing intermediate values and recomputing forward passes
  • Connect your autograd implementation to PyTorch’s design patterns and production optimizations

What You’ll Build

Figure 1: TinyTorch Autograd Engine: Reverse-mode automatic differentiation infrastructure.

Implementation roadmap:

Table 1 lays out the implementation in order, one part at a time.

Table 1: Implementation roadmap for the reverse-mode autograd engine.
Part What You’ll Implement Key Concept
1 Function base class Storing inputs for backward pass
2 AddBackward, MulBackward, MatmulBackward Operation-specific gradient rules
3 backward() method on Tensor Reverse-mode differentiation
4 enable_autograd() enhancement Monkey-patching operations for gradient tracking
5 Integration tests Multi-layer gradient flow

The pattern you’ll enable:

# Automatic gradient computation
x = Tensor([2.0], requires_grad=True)
y = x * 3 + 1  # y = 3x + 1
y.backward()   # Computes dy/dx = 3 automatically
print(x.grad)  # [3.0]

What You’re NOT Building (Yet)

To keep this module focused, you will not implement:

  • Higher-order derivatives (gradients of gradients)—PyTorch supports this with create_graph=True
  • Dynamic computation graphs—your graphs are built during forward pass only
  • GPU kernel fusion—PyTorch’s JIT compiler optimizes backward pass operations
  • Checkpointing for memory efficiency—that’s an advanced optimization technique

You are building the core gradient engine. Advanced optimizations come in production frameworks.

API Reference

This section documents the autograd components you’ll build. These integrate with the existing Tensor class from Module 01.

Function Base Class

Function(*tensors)

Base class for all differentiable operations. Every operation (addition, multiplication, etc.) inherits from Function and implements gradient computation rules.

Core Function Classes

Table 2 lists the backward Function classes and the gradient rule each one applies.

Table 2: Backward Function classes and their gradient rules.
Class Purpose Gradient Rule
AddBackward Addition gradients ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
SubBackward Subtraction gradients ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1
MulBackward Multiplication gradients ∂(ab)/∂a = b, ∂(ab)/∂b = a
DivBackward Division gradients ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b²
MatmulBackward Matrix multiplication gradients ∂(A@B)/∂A = grad@B.T, ∂(A@B)/∂B = A.T@grad
SumBackward Reduction gradients ∂sum(a)/∂a[i] = 1 for all i
ReshapeBackward Shape manipulation ∂(X.reshape(…))/∂X = grad.reshape(X.shape)
TransposeBackward Transpose gradients ∂(X.T)/∂X = grad.T

Additional Backward Classes: The implementation includes backward functions for activations (ReLUBackward, SigmoidBackward, SoftmaxBackward, GELUBackward), losses (MSEBackward, BCEBackward, CrossEntropyBackward), and other operations (PermuteBackward, SliceBackward). These follow the same pattern as the core classes above.

Enhanced Tensor Methods

Your implementation adds these methods to the Tensor class:

Table 3 lists the new methods autograd adds to the Tensor class.

Table 3: Methods added to the Tensor class for autograd.
Method Signature Description
backward backward(gradient=None) -> None Compute gradients via backpropagation
zero_grad zero_grad() -> None Reset gradients to None

Global Activation

Table 4 lists the global helpers that toggle gradient tracking.

Table 4: Global helper functions for enabling autograd.
Function Signature Description
enable_autograd enable_autograd(quiet=False) -> None Activate gradient tracking globally

Core Concepts

This section covers the fundamental ideas behind automatic differentiation. Understanding these concepts deeply will help you debug gradient issues in any framework, not just TinyTorch.

Computation Graphs

A computation graph is a directed acyclic graph (DAG): nodes are tensors, edges are the operations that produced them. When you write y = x * 3 + 1, you build a graph with three tensor nodes (x, temp, y) and two operation edges (multiply, add). You don’t see this graph because autograd builds it for you, silently, as a side effect of running the forward pass.

The construction trick is small but powerful: every tensor produced by an operation stores a reference to the operation that produced it. That reference — _grad_fn in your implementation, grad_fn in PyTorch — is the entire graph. To traverse the graph backward you just follow _grad_fn pointers until you reach the leaves.

Forward Pass:  x → [Mul(*3)] → temp → [Add(+1)] → y
Backward Pass: grad_x ← [MulBackward] ← grad_temp ← [AddBackward] ← grad_y

Each backward node also has to remember the values it will need later. For z = a * b, the gradient with respect to a is grad_z * b — so the multiply operation must hold on to b from the forward pass. This is the central memory trade-off of autograd: every saved tensor is bytes you cannot reclaim until backward runs, but those saved tensors are exactly what makes the backward pass cheap.

Your implementation tracks graphs with the _grad_fn attribute:

The code in ?@lst-06-autograd-add-backward makes this concrete.

class AddBackward(Function):
    """Gradient computation for addition."""

    def __init__(self, a, b):
        """Store inputs needed for backward pass."""
        self.saved_tensors = (a, b)

    def apply(self, grad_output):
        """Compute gradients for both inputs."""
        return grad_output, grad_output  # Addition distributes gradients equally

: Listing 6.1 — AddBackward stores inputs and distributes the incoming gradient equally to both addends. {#lst-06-autograd-add-backward}

When you compute z = x + y, your enhanced Tensor class automatically creates an AddBackward instance and attaches it to z:

result = x.data + y.data
result_tensor = Tensor(result)
result_tensor._grad_fn = AddBackward(x, y)  # Track operation

This simple pattern scales elegantly, enabling arbitrarily complex computation graphs. However, this flexibility masks a profound structural cost: the mere act of recording operations fundamentally alters the memory lifecycle of the tensors involved.

NoteSystems Implication: Activation Pinning

By passing (x, y) into AddBackward(x, y), the computation graph captures a hard reference to the input tensors. To the Python interpreter, this means the Garbage Collector cannot free the memory for x and y until the backward pass is complete and the graph is destroyed. This “Activation Pinning” is the root cause of CUDA Out of Memory (OOM) errors during training, as every layer’s output is kept alive in VRAM for the entirety of the forward pass. In practice, researchers rely on telemetry tools like torch.cuda.memory_allocated() to hunt down subtle computation graph memory leaks and measure the exact overhead of retaining these activations.

Because these pinned activations iteratively consume precious VRAM, scaling networks to hundreds of layers requires rigorous strategies—such as gradient checkpointing—to mitigate this memory exhaustion while preserving the integrity of the gradient flow.

The Chain Rule

Backpropagation is the chain rule, applied one node at a time. For a composite function z = f(g(x)), the chain rule says dz/dx = (dz/dg) * (dg/dx). Reverse-mode autograd flips this on its head: instead of multiplying derivatives left-to-right, you walk the graph backward and multiply right-to-left, so every intermediate gradient is computed exactly once and reused everywhere it is needed downstream.

When the graph has multiple paths from a parameter to the loss, the gradients along each path add. This is why a shared embedding table — used a hundred times in a transformer — ends up with the sum of contributions from all hundred uses. You get this for free; the recursion described below visits every path naturally.

Consider this computation: loss = (x * W + b)²

Forward:  x → [Mul(W)] → z1 → [Add(b)] → z2 → [Square] → loss

Backward chain rule:
  ∂loss/∂z2 = 2*z2              (square backward)
  ∂loss/∂z1 = ∂loss/∂z2 * 1     (addition backward)
  ∂loss/∂x  = ∂loss/∂z1 * W     (multiplication backward)

Each backward function does exactly one thing: multiply the incoming gradient by its own local derivative. It knows nothing about the rest of the graph — and it doesn’t need to. Here’s how MulBackward implements this:

The code in ?@lst-06-autograd-mul-backward makes this concrete.

class MulBackward(Function):
    """Gradient computation for element-wise multiplication."""

    def apply(self, grad_output):
        """
        For z = a * b:
        ∂z/∂a = b → grad_a = grad_output * b
        ∂z/∂b = a → grad_b = grad_output * a

        Uses vectorized element-wise multiplication (NumPy broadcasting).
        """
        a, b = self.saved_tensors
        grad_a = grad_b = None

        if a.requires_grad:
            grad_a = grad_output * b.data  # Vectorized element-wise multiplication

        if b.requires_grad:
            grad_b = grad_output * a.data  # NumPy handles broadcasting automatically

        return grad_a, grad_b

: Listing 6.2 — MulBackward applies the product rule with NumPy broadcasting instead of explicit loops. {#lst-06-autograd-mul-backward}

Each operation knows only its own derivative; the chain rule does the connecting. NumPy handles the element-wise math in optimized C, so no explicit loops are needed.

Backward Pass Implementation

The backward pass walks the computation graph in reverse, computing gradients for every tensor it visits. Your backward() method does this as a recursive tree walk — short enough to read in one sitting, but enough to support arbitrarily deep networks:

The code in ?@lst-06-autograd-tensor-backward makes this concrete.

def backward(self, gradient=None):
    """Compute gradients via backpropagation."""
    if not self.requires_grad:
        return

    # Initialize gradient for scalar outputs
    if gradient is None:
        if self.data.size == 1:
            gradient = np.ones_like(self.data)
        else:
            raise ValueError("backward() requires gradient for non-scalar tensors")

    # Accumulate gradient (vectorized NumPy operation)
    if self.grad is None:
        self.grad = np.zeros_like(self.data)
    self.grad += gradient

    # Propagate to parent tensors
    if hasattr(self, '_grad_fn') and self._grad_fn is not None:
        grads = self._grad_fn.apply(gradient)  # Compute input gradients using vectorized ops

        for tensor, grad in zip(self._grad_fn.saved_tensors, grads):
            if isinstance(tensor, Tensor) and tensor.requires_grad and grad is not None:
                tensor.backward(grad)  # Recursive call

: Listing 6.3 — Tensor.backward() seeds the output gradient, accumulates into .grad, and recurses through the _grad_fn chain. {#lst-06-autograd-tensor-backward}

For a 100-layer network, loss.backward() triggers 100 recursive calls — one per layer — flowing gradients from output to input. The traversal is recursive Python; the math inside each apply() is vectorized NumPy. That split is why the system stays both readable and fast.

The gradient argument deserves a closer look. For scalar losses (the typical case) you call loss.backward() with no arguments and the method seeds the gradient to 1.0 — because ∂loss/∂loss = 1. For non-scalar outputs you must pass the upstream gradient explicitly; there is no canonical scalar to seed from, and silently picking one would hide bugs.

Gradient Accumulation

Gradient accumulation is the same feature seen from two sides. Call backward() twice on the same tensor and the gradients add — by design. That’s what lets you split a batch that doesn’t fit in memory into chunks, run them sequentially, and end up with the same gradient as if you’d processed them all at once:

# Large batch (doesn't fit in memory)
for mini_batch in split_batch(large_batch, chunks=4):
    loss = model(mini_batch)
    loss.backward()  # Gradients accumulate in model parameters

# Now gradients equal the sum over the entire large batch
optimizer.step()
model.zero_grad()  # Reset for next iteration

Without this behavior you’d have to store every mini-batch gradient and sum them yourself. With it, the autograd system does the bookkeeping.

The flip side: accumulation becomes a silent bug the moment you forget to call zero_grad() between iterations.

# WRONG: Gradients accumulate across iterations
for batch in dataloader:
    loss = model(batch)
    loss.backward()  # Gradients keep adding!
    optimizer.step()  # Updates use accumulated gradients from all previous batches

# CORRECT: Zero gradients after each update
for batch in dataloader:
    model.zero_grad()  # Reset gradients
    loss = model(batch)
    loss.backward()
    optimizer.step()

Your zero_grad() implementation is simple but crucial:

def zero_grad(self):
    """Reset gradients to None."""
    self.grad = None

Setting to None instead of zeros saves memory: NumPy doesn’t allocate arrays until you accumulate the first gradient.

Memory Management in Autograd

Autograd’s memory footprint comes from two sources: stored intermediate tensors and gradient storage. For a forward pass through an N-layer network, you store roughly N intermediate activations. During backward pass, you store gradients for every parameter.

Consider a simple linear layer: y = x @ W + b

Forward pass stores: - x (needed for computing grad_W = x.T @ grad_y) - W (needed for computing grad_x = grad_y @ W.T)

Backward pass allocates: - grad_x (same shape as x) - grad_W (same shape as W) - grad_b (same shape as b)

For a batch of 32 samples through a (512, 768) linear layer, the memory breakdown is:

Forward storage:
  x:       32 × 512 × 4 bytes =     64 KB
  W:      512 × 768 × 4 bytes =  1,536 KB

Backward storage:
  grad_x:  32 × 512 × 4 bytes =     64 KB
  grad_W: 512 × 768 × 4 bytes =  1,536 KB
  grad_b:       768 × 4 bytes =      3 KB

Total: ~3.1 MB for one layer (2× parameter size + activation size)

Multiply by network depth and you see why memory limits batch size. A 100-layer transformer stores 100× the activations, which can easily exceed GPU memory.

Production frameworks mitigate this with gradient checkpointing: they discard intermediate activations during forward pass and recompute them during backward pass. This trades compute (recomputing activations) for memory (not storing them). Your implementation doesn’t do this—it’s an advanced optimization—but understanding the trade-off is essential.

The implementation shows this memory overhead clearly in the MatmulBackward class:

The code in ?@lst-06-autograd-matmul-backward makes this concrete.

class MatmulBackward(Function):
    """
    Gradient computation for matrix multiplication.

    For Z = A @ B:
    - Must store A and B during forward pass
    - Backward computes: grad_A = grad_Z @ B.T and grad_B = A.T @ grad_Z
    - Uses vectorized NumPy operations (np.matmul, np.swapaxes)
    """

    def apply(self, grad_output):
        a, b = self.saved_tensors  # Retrieved from memory
        grad_a = grad_b = None

        if a.requires_grad:
            # Vectorized transpose and matmul (no explicit loops)
            b_T = np.swapaxes(b.data, -2, -1)
            grad_a = np.matmul(grad_output, b_T)

        if b.requires_grad:
            # Vectorized operations for efficiency
            a_T = np.swapaxes(a.data, -2, -1)
            grad_b = np.matmul(a_T, grad_output)

        return grad_a, grad_b

: Listing 6.4 — MatmulBackward turns one forward matmul into two backward matmuls, the source of the 2x backward-to-forward FLOP ratio. {#lst-06-autograd-matmul-backward}

Notice that both a and b must be saved during forward pass. For large matrices, this storage cost dominates memory usage. All gradient computations use vectorized NumPy operations, which are implemented in optimized C/Fortran code under the hood—no explicit Python loops are needed.

Production Context

Your Implementation vs. PyTorch

Your autograd system and PyTorch’s share the same design: computation graphs built during forward pass, reverse-mode differentiation during backward pass, and gradient accumulation in parameter tensors. The differences are in scale and optimization.

Table 5 places your implementation side by side with the production reference for direct comparison.

Table 5: Feature comparison between TinyTorch autograd and PyTorch autograd.
Feature Your Implementation PyTorch
Graph Building Python objects, _grad_fn attribute C++ objects, compiled graph
Memory Stores all intermediates Gradient checkpointing, memory pools
Speed Pure Python, NumPy backend C++/CUDA, fused kernels
Operations 10 backward functions 2000+ optimized backward functions
Debugging Direct Python inspection torch.autograd.profiler, graph visualization

Code Comparison

The following comparison shows identical conceptual patterns in TinyTorch and PyTorch. The APIs mirror each other because both implement the same autograd algorithm.

from tinytorch import Tensor

# Create tensors with gradient tracking
x = Tensor([[1.0, 2.0]], requires_grad=True)
W = Tensor([[3.0], [4.0]], requires_grad=True)

# Forward pass builds computation graph
y = x.matmul(W)  # y = x @ W
loss = (y * y).sum()  # loss = sum(y²)

# Backward pass computes gradients
loss.backward()

# Access gradients
print(f"x.grad: {x.grad}")  # ∂loss/∂x
print(f"W.grad: {W.grad}")  # ∂loss/∂W
import torch

# Create tensors with gradient tracking
x = torch.tensor([[1.0, 2.0]], requires_grad=True)
W = torch.tensor([[3.0], [4.0]], requires_grad=True)

# Forward pass builds computation graph
y = x @ W  # PyTorch uses @ operator
loss = (y * y).sum()

# Backward pass computes gradients
loss.backward()

# Access gradients
print(f"x.grad: {x.grad}")
print(f"W.grad: {W.grad}")

Let’s walk through the comparison line by line:

  • Line 3-4 (Tensor creation): Both frameworks use requires_grad=True to enable gradient tracking. This is an opt-in design: most tensors (data, labels) don’t need gradients, only parameters do.
  • Line 7-8 (Forward pass): Operations automatically build computation graphs. TinyTorch uses .matmul() method; PyTorch supports both .matmul() and the @ operator.
  • Line 11 (Backward pass): Single method call triggers reverse-mode differentiation through the entire graph.
  • Line 14-15 (Gradient access): Both store gradients in the .grad attribute. Gradients have the same shape as the original tensor.
TipWhat’s Identical

Computation graph construction, chain rule implementation, and gradient accumulation semantics. When you debug PyTorch autograd issues, you’re debugging the same algorithm you implemented here.

Why Autograd Matters at Scale

The case for automating differentiation is overwhelming the moment you look at the numbers:

  • GPT-3: 175 billion parameters — that’s 175,000,000,000 gradients per training step.
  • Training cost: each backward pass takes roughly 2× the forward pass time (two matmuls instead of one per linear layer).
  • Memory: storing the computation graph for a transformer can require ~10× the model’s parameter footprint.

Hand-deriving gradients does not scale to any of this. Even a 3-layer MLP with a million parameters would take weeks to differentiate manually and would still contain bugs at the end. Autograd makes training tractable by automating the most error-prone part of deep learning — and it’s the single biggest reason the field moves as fast as it does.

Check Your Understanding

TipCheck Your Understanding — Autograd

Before moving on, verify you can articulate each of the following:

If any of these feels fuzzy, revisit the Computation Graphs and Memory Management sections before moving on.

Test yourself with these systems thinking questions. They’re designed to build intuition for autograd’s performance characteristics and design decisions.

Q1: Computation Graph Memory

A 5-layer MLP processes a batch of 64 samples. Each layer stores its input activation for backward pass. Layer dimensions are: 784 → 512 → 256 → 128 → 10. How much memory (in MB) is used to store activations for one batch?

Layer 1 input: 64 × 784 × 4 bytes = 196 KB
Layer 2 input: 64 × 512 × 4 bytes = 128 KB
Layer 3 input: 64 × 256 × 4 bytes =  64 KB
Layer 4 input: 64 × 128 × 4 bytes =  32 KB
Layer 5 input: 64 ×  10 × 4 bytes = 2.5 KB

Total: ~422 KB ≈ 0.41 MB

That is per forward pass through a tiny MLP. A 100-layer transformer stores roughly 100× this — and with much wider layers — which is why gradient checkpointing trades compute for memory by recomputing activations during backward pass.

Q2: Backward Pass Complexity

A forward pass through a linear layer y = x @ W (where x is 32×512 and W is 512×256) takes 8ms. How long will the backward pass take?

Forward: 1 matmul (x @ W)

Backward: 2 matmuls - grad_x = grad_y @ W.T (32×256 @ 256×512) - grad_W = x.T @ grad_y (512×32 @ 32×256)

Backward takes ~2× forward time ≈ 16ms

This is why training (forward + backward) takes roughly 3× inference time. GPU parallelism and kernel fusion can reduce this, but the fundamental 2:1 ratio remains.

Q3: Gradient Accumulation Memory

You have 16GB GPU memory and a model with 1B parameters (float32). How much memory is available for activations and gradients during training?

Model parameters:        1B × 4 bytes =  4 GB
Gradients:               1B × 4 bytes =  4 GB
Optimizer state (Adam):  1B × 8 bytes =  8 GB   (momentum + variance)

Total framework overhead: 16 GB

Available for activations: 0 GB — you’ve already exceeded memory before storing a single activation.

This is why large models reach for gradient accumulation across multiple forward passes before updating parameters, or gradient checkpointing to shrink activation memory. The “2× parameter size” rule (params + grads) is a floor, not a ceiling — optimizers like Adam add more on top.

Q4: requires_grad Performance

A typical training batch has: 32 images (input), 10M parameter tensors (weights), 50 intermediate activation tensors. If requires_grad defaults to True for all tensors, how many tensors unnecessarily track gradients?

Tensors that need gradients:

  • Parameters: 10M tensors

Tensors that don’t need gradients:

  • Input images: 32 tensors (data is not learned)
  • Intermediate activations: 50 tensors (needed for backward, but not updated)

32 input tensors would unnecessarily track gradients if requires_grad defaulted to True.

This is why PyTorch defaults requires_grad=False for new tensors and forces an explicit opt-in for parameters. For a batch of 32 images of shape 3×224×224, accidentally tracking gradients on the inputs wastes 4.8M float32 values × 4 bytes = ~18.4 MB per batch — for nothing.

Q5: Graph Retention

You forget to call zero_grad() before each training iteration. After 10 iterations, how do the gradients compare to correct training?

Gradients accumulate across all 10 iterations.

If correct gradient for iteration i is g_i, your accumulated gradient is: grad = g_1 + g_2 + g_3 + ... + g_10

Effects: 1. Magnitude: Gradients are ~10× larger than they should be 2. Direction: The sum of 10 different gradients, which may not point toward the loss minimum 3. Learning: Parameter updates use the wrong direction and wrong magnitude 4. Result: Training diverges or oscillates instead of converging

Bottom line: Always call zero_grad() at the start of each iteration (or after optimizer.step()).

Key Takeaways

  • Computation graphs are dynamic and implicit: Every tensor produced by an operation stores a _grad_fn pointer; that pointer chain is the graph autograd walks in reverse.
  • Activation pinning is autograd’s memory tax: Saved tensors survive until backward() runs, which is why forward-pass VRAM usage scales linearly with depth and caps batch size long before parameter count does.
  • Backward costs 2× forward: Each linear layer turns one matmul into two (grad_x, grad_W), making training roughly 3× the cost of inference — a constant every capacity plan assumes.
  • zero_grad() is not optional: Gradient accumulation is the feature and the silent bug; skipping the reset compounds 10 gradients into one bad step.

Coming next: You can compute a gradient for every parameter — but a gradient alone is just a direction. Module 07 turns directions into updates by building SGD, Adam, and AdamW, the rules that decide how far to step.

Further Reading

For students who want to understand the academic foundations and mathematical underpinnings of automatic differentiation:

Seminal Papers

  • Automatic Differentiation in Machine Learning: a Survey - Baydin et al. (2018). Comprehensive survey of AD techniques, covering forward-mode, reverse-mode, and mixed-mode differentiation. Essential reading for understanding autograd theory.
    • Systems Implication: This work formalized the foundational memory-compute trade-off: reverse-mode AD minimizes compute at the steep cost of O(N) memory to store the forward activation graph, permanently dictating modern GPU VRAM requirements. arXiv:1502.05767
  • Automatic Differentiation of Algorithms - Griewank (1989). The foundational work on reverse-mode AD that underlies all modern deep learning frameworks. Introduces the mathematical formalism for gradient computation via the chain rule.
    • Systems Implication: Exposed the inherent sequential data dependencies during the backward pass. Because gradients must be propagated backwards sequentially from output to input, this creates a structural bottleneck that prevents perfect parallelization across network layers. Computational Optimization and Applications
  • PyTorch: An Imperative Style, High-Performance Deep Learning Library - Paszke et al. (2019). Describes PyTorch’s autograd implementation and design philosophy. Shows how imperative programming (define-by-run) enables dynamic computation graphs.
    • Systems Implication: By introducing dynamic computation graphs, PyTorch mandated that memory allocation and operator dispatch overhead occur strictly at runtime (eager execution). This paradigm shift required highly optimized C++ backends to aggressively hide Python’s interpretive latency. NeurIPS 2019

Additional Resources

What’s Next

You can now compute a gradient for every parameter in any network you build. That gradient tells you which way is downhill — but it does not tell you how big a step to take, or how to dampen oscillations, or how to adapt the step size per-parameter. That is the optimizer’s job.

NoteComing Up: Module 07 — Optimizers

You’ll implement SGD, momentum, and Adam: the rules that turn the param.grad tensors produced by backward() into actual parameter updates. With autograd plus an optimizer, you have the entire machinery a training loop needs.

Preview — how your autograd gets used in the modules that follow:

Table 6 traces how this module is reused by later parts of the curriculum.

Table 6: How autograd feeds into subsequent optimizer and training modules.
Module What It Does Your Autograd In Action
07: Optimizers Update parameters using gradients optimizer.step() uses param.grad computed by backward()
08: Training Complete training loops loss.backward()optimizer.step() → repeat
12: Attention Multi-head self-attention Gradients flow through Q, K, V projections automatically

Get Started

TipInteractive Options
WarningSave Your Progress

Binder sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.

ne, or clone the repository for persistent local work. ::: ocal work. :::

Back to top