Module 06: Autograd
Gradient computation is the single largest source of memory growth during training. Autograd’s choices about what to cache during the forward pass, what to recompute on the backward, and when to release tensors decide whether a model fits in VRAM. The runtime you build here is the memory manager for the rest of the system.
FOUNDATION TIER | Difficulty: ●●●○ | Time: 6-8 hours | Prerequisites: 01-05
You need to be fluent with everything from Modules 01–05:
- Tensor operations (matmul, broadcasting, reductions)
- Activation functions (the source of non-linearity)
- Neural network layers (what gradients will flow through)
- Loss functions (the scalar gradients flow back from)
- DataLoader for batched iteration
If you can hand-compute a forward pass through a small network and explain why we minimize loss, you’re ready.
🎧 Audio Overview
Listen to an AI-generated overview.
Overview
A neural network learns by nudging every parameter in the direction that lowers the loss. To find that direction you need a gradient — one number per parameter. A modern model has billions of parameters, so deriving those gradients by hand is not just tedious, it is impossible. Every framework you have ever used — PyTorch, TensorFlow, JAX — solves this with the same trick: automatic differentiation.
In this module you build reverse-mode autograd from scratch. The forward pass records each operation into a small graph; loss.backward() walks that graph in reverse, applying the chain rule one operation at a time. When you finish, calling loss.backward() on your tensors does the same thing it does in PyTorch — and you will know exactly why.
This is the conceptually hardest module in the Foundation tier. It is also the one that unlocks everything that follows: optimizers, training loops, and any model that learns from data.
Learning Objectives
- Implement the Function base class that enables gradient computation for all operations
- Build computation graphs that track dependencies between tensors during forward pass
- Master the chain rule by implementing backward passes for arithmetic, matrix multiplication, and reductions
- Understand memory trade-offs between storing intermediate values and recomputing forward passes
- Connect your autograd implementation to PyTorch’s design patterns and production optimizations
What You’ll Build
Implementation roadmap:
Table 1 lays out the implementation in order, one part at a time.
| Part | What You’ll Implement | Key Concept |
|---|---|---|
| 1 | Function base class |
Storing inputs for backward pass |
| 2 | AddBackward, MulBackward, MatmulBackward |
Operation-specific gradient rules |
| 3 | backward() method on Tensor |
Reverse-mode differentiation |
| 4 | enable_autograd() enhancement |
Monkey-patching operations for gradient tracking |
| 5 | Integration tests | Multi-layer gradient flow |
The pattern you’ll enable:
# Automatic gradient computation
x = Tensor([2.0], requires_grad=True)
y = x * 3 + 1 # y = 3x + 1
y.backward() # Computes dy/dx = 3 automatically
print(x.grad) # [3.0]What You’re NOT Building (Yet)
To keep this module focused, you will not implement:
- Higher-order derivatives (gradients of gradients)—PyTorch supports this with
create_graph=True - Dynamic computation graphs—your graphs are built during forward pass only
- GPU kernel fusion—PyTorch’s JIT compiler optimizes backward pass operations
- Checkpointing for memory efficiency—that’s an advanced optimization technique
You are building the core gradient engine. Advanced optimizations come in production frameworks.
API Reference
This section documents the autograd components you’ll build. These integrate with the existing Tensor class from Module 01.
Function Base Class
Function(*tensors)Base class for all differentiable operations. Every operation (addition, multiplication, etc.) inherits from Function and implements gradient computation rules.
Core Function Classes
Table 2 lists the backward Function classes and the gradient rule each one applies.
| Class | Purpose | Gradient Rule |
|---|---|---|
AddBackward |
Addition gradients | ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1 |
SubBackward |
Subtraction gradients | ∂(a-b)/∂a = 1, ∂(a-b)/∂b = -1 |
MulBackward |
Multiplication gradients | ∂(ab)/∂a = b, ∂(ab)/∂b = a |
DivBackward |
Division gradients | ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b² |
MatmulBackward |
Matrix multiplication gradients | ∂(A@B)/∂A = grad@B.T, ∂(A@B)/∂B = A.T@grad |
SumBackward |
Reduction gradients | ∂sum(a)/∂a[i] = 1 for all i |
ReshapeBackward |
Shape manipulation | ∂(X.reshape(…))/∂X = grad.reshape(X.shape) |
TransposeBackward |
Transpose gradients | ∂(X.T)/∂X = grad.T |
Additional Backward Classes: The implementation includes backward functions for activations (ReLUBackward, SigmoidBackward, SoftmaxBackward, GELUBackward), losses (MSEBackward, BCEBackward, CrossEntropyBackward), and other operations (PermuteBackward, SliceBackward). These follow the same pattern as the core classes above.
Enhanced Tensor Methods
Your implementation adds these methods to the Tensor class:
Table 3 lists the new methods autograd adds to the Tensor class.
| Method | Signature | Description |
|---|---|---|
backward |
backward(gradient=None) -> None |
Compute gradients via backpropagation |
zero_grad |
zero_grad() -> None |
Reset gradients to None |
Global Activation
Table 4 lists the global helpers that toggle gradient tracking.
| Function | Signature | Description |
|---|---|---|
enable_autograd |
enable_autograd(quiet=False) -> None |
Activate gradient tracking globally |
Core Concepts
This section covers the fundamental ideas behind automatic differentiation. Understanding these concepts deeply will help you debug gradient issues in any framework, not just TinyTorch.
Computation Graphs
A computation graph is a directed acyclic graph (DAG): nodes are tensors, edges are the operations that produced them. When you write y = x * 3 + 1, you build a graph with three tensor nodes (x, temp, y) and two operation edges (multiply, add). You don’t see this graph because autograd builds it for you, silently, as a side effect of running the forward pass.
The construction trick is small but powerful: every tensor produced by an operation stores a reference to the operation that produced it. That reference — _grad_fn in your implementation, grad_fn in PyTorch — is the entire graph. To traverse the graph backward you just follow _grad_fn pointers until you reach the leaves.
Forward Pass: x → [Mul(*3)] → temp → [Add(+1)] → y
Backward Pass: grad_x ← [MulBackward] ← grad_temp ← [AddBackward] ← grad_y
Each backward node also has to remember the values it will need later. For z = a * b, the gradient with respect to a is grad_z * b — so the multiply operation must hold on to b from the forward pass. This is the central memory trade-off of autograd: every saved tensor is bytes you cannot reclaim until backward runs, but those saved tensors are exactly what makes the backward pass cheap.
Your implementation tracks graphs with the _grad_fn attribute:
The code in ?@lst-06-autograd-add-backward makes this concrete.
class AddBackward(Function):
"""Gradient computation for addition."""
def __init__(self, a, b):
"""Store inputs needed for backward pass."""
self.saved_tensors = (a, b)
def apply(self, grad_output):
"""Compute gradients for both inputs."""
return grad_output, grad_output # Addition distributes gradients equally: Listing 6.1 — AddBackward stores inputs and distributes the incoming gradient equally to both addends. {#lst-06-autograd-add-backward}
When you compute z = x + y, your enhanced Tensor class automatically creates an AddBackward instance and attaches it to z:
result = x.data + y.data
result_tensor = Tensor(result)
result_tensor._grad_fn = AddBackward(x, y) # Track operationThis simple pattern scales elegantly, enabling arbitrarily complex computation graphs. However, this flexibility masks a profound structural cost: the mere act of recording operations fundamentally alters the memory lifecycle of the tensors involved.
By passing (x, y) into AddBackward(x, y), the computation graph captures a hard reference to the input tensors. To the Python interpreter, this means the Garbage Collector cannot free the memory for x and y until the backward pass is complete and the graph is destroyed. This “Activation Pinning” is the root cause of CUDA Out of Memory (OOM) errors during training, as every layer’s output is kept alive in VRAM for the entirety of the forward pass. In practice, researchers rely on telemetry tools like torch.cuda.memory_allocated() to hunt down subtle computation graph memory leaks and measure the exact overhead of retaining these activations.
Because these pinned activations iteratively consume precious VRAM, scaling networks to hundreds of layers requires rigorous strategies—such as gradient checkpointing—to mitigate this memory exhaustion while preserving the integrity of the gradient flow.
The Chain Rule
Backpropagation is the chain rule, applied one node at a time. For a composite function z = f(g(x)), the chain rule says dz/dx = (dz/dg) * (dg/dx). Reverse-mode autograd flips this on its head: instead of multiplying derivatives left-to-right, you walk the graph backward and multiply right-to-left, so every intermediate gradient is computed exactly once and reused everywhere it is needed downstream.
When the graph has multiple paths from a parameter to the loss, the gradients along each path add. This is why a shared embedding table — used a hundred times in a transformer — ends up with the sum of contributions from all hundred uses. You get this for free; the recursion described below visits every path naturally.
Consider this computation: loss = (x * W + b)²
Forward: x → [Mul(W)] → z1 → [Add(b)] → z2 → [Square] → loss
Backward chain rule:
∂loss/∂z2 = 2*z2 (square backward)
∂loss/∂z1 = ∂loss/∂z2 * 1 (addition backward)
∂loss/∂x = ∂loss/∂z1 * W (multiplication backward)
Each backward function does exactly one thing: multiply the incoming gradient by its own local derivative. It knows nothing about the rest of the graph — and it doesn’t need to. Here’s how MulBackward implements this:
The code in ?@lst-06-autograd-mul-backward makes this concrete.
class MulBackward(Function):
"""Gradient computation for element-wise multiplication."""
def apply(self, grad_output):
"""
For z = a * b:
∂z/∂a = b → grad_a = grad_output * b
∂z/∂b = a → grad_b = grad_output * a
Uses vectorized element-wise multiplication (NumPy broadcasting).
"""
a, b = self.saved_tensors
grad_a = grad_b = None
if a.requires_grad:
grad_a = grad_output * b.data # Vectorized element-wise multiplication
if b.requires_grad:
grad_b = grad_output * a.data # NumPy handles broadcasting automatically
return grad_a, grad_b: Listing 6.2 — MulBackward applies the product rule with NumPy broadcasting instead of explicit loops. {#lst-06-autograd-mul-backward}
Each operation knows only its own derivative; the chain rule does the connecting. NumPy handles the element-wise math in optimized C, so no explicit loops are needed.
Backward Pass Implementation
The backward pass walks the computation graph in reverse, computing gradients for every tensor it visits. Your backward() method does this as a recursive tree walk — short enough to read in one sitting, but enough to support arbitrarily deep networks:
The code in ?@lst-06-autograd-tensor-backward makes this concrete.
def backward(self, gradient=None):
"""Compute gradients via backpropagation."""
if not self.requires_grad:
return
# Initialize gradient for scalar outputs
if gradient is None:
if self.data.size == 1:
gradient = np.ones_like(self.data)
else:
raise ValueError("backward() requires gradient for non-scalar tensors")
# Accumulate gradient (vectorized NumPy operation)
if self.grad is None:
self.grad = np.zeros_like(self.data)
self.grad += gradient
# Propagate to parent tensors
if hasattr(self, '_grad_fn') and self._grad_fn is not None:
grads = self._grad_fn.apply(gradient) # Compute input gradients using vectorized ops
for tensor, grad in zip(self._grad_fn.saved_tensors, grads):
if isinstance(tensor, Tensor) and tensor.requires_grad and grad is not None:
tensor.backward(grad) # Recursive call: Listing 6.3 — Tensor.backward() seeds the output gradient, accumulates into .grad, and recurses through the _grad_fn chain. {#lst-06-autograd-tensor-backward}
For a 100-layer network, loss.backward() triggers 100 recursive calls — one per layer — flowing gradients from output to input. The traversal is recursive Python; the math inside each apply() is vectorized NumPy. That split is why the system stays both readable and fast.
The gradient argument deserves a closer look. For scalar losses (the typical case) you call loss.backward() with no arguments and the method seeds the gradient to 1.0 — because ∂loss/∂loss = 1. For non-scalar outputs you must pass the upstream gradient explicitly; there is no canonical scalar to seed from, and silently picking one would hide bugs.
Gradient Accumulation
Gradient accumulation is the same feature seen from two sides. Call backward() twice on the same tensor and the gradients add — by design. That’s what lets you split a batch that doesn’t fit in memory into chunks, run them sequentially, and end up with the same gradient as if you’d processed them all at once:
# Large batch (doesn't fit in memory)
for mini_batch in split_batch(large_batch, chunks=4):
loss = model(mini_batch)
loss.backward() # Gradients accumulate in model parameters
# Now gradients equal the sum over the entire large batch
optimizer.step()
model.zero_grad() # Reset for next iterationWithout this behavior you’d have to store every mini-batch gradient and sum them yourself. With it, the autograd system does the bookkeeping.
The flip side: accumulation becomes a silent bug the moment you forget to call zero_grad() between iterations.
# WRONG: Gradients accumulate across iterations
for batch in dataloader:
loss = model(batch)
loss.backward() # Gradients keep adding!
optimizer.step() # Updates use accumulated gradients from all previous batches
# CORRECT: Zero gradients after each update
for batch in dataloader:
model.zero_grad() # Reset gradients
loss = model(batch)
loss.backward()
optimizer.step()Your zero_grad() implementation is simple but crucial:
def zero_grad(self):
"""Reset gradients to None."""
self.grad = NoneSetting to None instead of zeros saves memory: NumPy doesn’t allocate arrays until you accumulate the first gradient.
Memory Management in Autograd
Autograd’s memory footprint comes from two sources: stored intermediate tensors and gradient storage. For a forward pass through an N-layer network, you store roughly N intermediate activations. During backward pass, you store gradients for every parameter.
Consider a simple linear layer: y = x @ W + b
Forward pass stores: - x (needed for computing grad_W = x.T @ grad_y) - W (needed for computing grad_x = grad_y @ W.T)
Backward pass allocates: - grad_x (same shape as x) - grad_W (same shape as W) - grad_b (same shape as b)
For a batch of 32 samples through a (512, 768) linear layer, the memory breakdown is:
Forward storage:
x: 32 × 512 × 4 bytes = 64 KB
W: 512 × 768 × 4 bytes = 1,536 KB
Backward storage:
grad_x: 32 × 512 × 4 bytes = 64 KB
grad_W: 512 × 768 × 4 bytes = 1,536 KB
grad_b: 768 × 4 bytes = 3 KB
Total: ~3.1 MB for one layer (2× parameter size + activation size)
Multiply by network depth and you see why memory limits batch size. A 100-layer transformer stores 100× the activations, which can easily exceed GPU memory.
Production frameworks mitigate this with gradient checkpointing: they discard intermediate activations during forward pass and recompute them during backward pass. This trades compute (recomputing activations) for memory (not storing them). Your implementation doesn’t do this—it’s an advanced optimization—but understanding the trade-off is essential.
The implementation shows this memory overhead clearly in the MatmulBackward class:
The code in ?@lst-06-autograd-matmul-backward makes this concrete.
class MatmulBackward(Function):
"""
Gradient computation for matrix multiplication.
For Z = A @ B:
- Must store A and B during forward pass
- Backward computes: grad_A = grad_Z @ B.T and grad_B = A.T @ grad_Z
- Uses vectorized NumPy operations (np.matmul, np.swapaxes)
"""
def apply(self, grad_output):
a, b = self.saved_tensors # Retrieved from memory
grad_a = grad_b = None
if a.requires_grad:
# Vectorized transpose and matmul (no explicit loops)
b_T = np.swapaxes(b.data, -2, -1)
grad_a = np.matmul(grad_output, b_T)
if b.requires_grad:
# Vectorized operations for efficiency
a_T = np.swapaxes(a.data, -2, -1)
grad_b = np.matmul(a_T, grad_output)
return grad_a, grad_b: Listing 6.4 — MatmulBackward turns one forward matmul into two backward matmuls, the source of the 2x backward-to-forward FLOP ratio. {#lst-06-autograd-matmul-backward}
Notice that both a and b must be saved during forward pass. For large matrices, this storage cost dominates memory usage. All gradient computations use vectorized NumPy operations, which are implemented in optimized C/Fortran code under the hood—no explicit Python loops are needed.
Production Context
Your Implementation vs. PyTorch
Your autograd system and PyTorch’s share the same design: computation graphs built during forward pass, reverse-mode differentiation during backward pass, and gradient accumulation in parameter tensors. The differences are in scale and optimization.
Table 5 places your implementation side by side with the production reference for direct comparison.
| Feature | Your Implementation | PyTorch |
|---|---|---|
| Graph Building | Python objects, _grad_fn attribute |
C++ objects, compiled graph |
| Memory | Stores all intermediates | Gradient checkpointing, memory pools |
| Speed | Pure Python, NumPy backend | C++/CUDA, fused kernels |
| Operations | 10 backward functions | 2000+ optimized backward functions |
| Debugging | Direct Python inspection | torch.autograd.profiler, graph visualization |
Code Comparison
The following comparison shows identical conceptual patterns in TinyTorch and PyTorch. The APIs mirror each other because both implement the same autograd algorithm.
from tinytorch import Tensor
# Create tensors with gradient tracking
x = Tensor([[1.0, 2.0]], requires_grad=True)
W = Tensor([[3.0], [4.0]], requires_grad=True)
# Forward pass builds computation graph
y = x.matmul(W) # y = x @ W
loss = (y * y).sum() # loss = sum(y²)
# Backward pass computes gradients
loss.backward()
# Access gradients
print(f"x.grad: {x.grad}") # ∂loss/∂x
print(f"W.grad: {W.grad}") # ∂loss/∂Wimport torch
# Create tensors with gradient tracking
x = torch.tensor([[1.0, 2.0]], requires_grad=True)
W = torch.tensor([[3.0], [4.0]], requires_grad=True)
# Forward pass builds computation graph
y = x @ W # PyTorch uses @ operator
loss = (y * y).sum()
# Backward pass computes gradients
loss.backward()
# Access gradients
print(f"x.grad: {x.grad}")
print(f"W.grad: {W.grad}")Let’s walk through the comparison line by line:
- Line 3-4 (Tensor creation): Both frameworks use
requires_grad=Trueto enable gradient tracking. This is an opt-in design: most tensors (data, labels) don’t need gradients, only parameters do. - Line 7-8 (Forward pass): Operations automatically build computation graphs. TinyTorch uses
.matmul()method; PyTorch supports both.matmul()and the@operator. - Line 11 (Backward pass): Single method call triggers reverse-mode differentiation through the entire graph.
- Line 14-15 (Gradient access): Both store gradients in the
.gradattribute. Gradients have the same shape as the original tensor.
Computation graph construction, chain rule implementation, and gradient accumulation semantics. When you debug PyTorch autograd issues, you’re debugging the same algorithm you implemented here.
Why Autograd Matters at Scale
The case for automating differentiation is overwhelming the moment you look at the numbers:
- GPT-3: 175 billion parameters — that’s 175,000,000,000 gradients per training step.
- Training cost: each backward pass takes roughly 2× the forward pass time (two matmuls instead of one per linear layer).
- Memory: storing the computation graph for a transformer can require ~10× the model’s parameter footprint.
Hand-deriving gradients does not scale to any of this. Even a 3-layer MLP with a million parameters would take weeks to differentiate manually and would still contain bugs at the end. Autograd makes training tractable by automating the most error-prone part of deep learning — and it’s the single biggest reason the field moves as fast as it does.
Check Your Understanding
Before moving on, verify you can articulate each of the following:
If any of these feels fuzzy, revisit the Computation Graphs and Memory Management sections before moving on.
Test yourself with these systems thinking questions. They’re designed to build intuition for autograd’s performance characteristics and design decisions.
Q1: Computation Graph Memory
A 5-layer MLP processes a batch of 64 samples. Each layer stores its input activation for backward pass. Layer dimensions are: 784 → 512 → 256 → 128 → 10. How much memory (in MB) is used to store activations for one batch?
Layer 1 input: 64 × 784 × 4 bytes = 196 KB
Layer 2 input: 64 × 512 × 4 bytes = 128 KB
Layer 3 input: 64 × 256 × 4 bytes = 64 KB
Layer 4 input: 64 × 128 × 4 bytes = 32 KB
Layer 5 input: 64 × 10 × 4 bytes = 2.5 KB
Total: ~422 KB ≈ 0.41 MB
That is per forward pass through a tiny MLP. A 100-layer transformer stores roughly 100× this — and with much wider layers — which is why gradient checkpointing trades compute for memory by recomputing activations during backward pass.
Q2: Backward Pass Complexity
A forward pass through a linear layer y = x @ W (where x is 32×512 and W is 512×256) takes 8ms. How long will the backward pass take?
Forward: 1 matmul (x @ W)
Backward: 2 matmuls - grad_x = grad_y @ W.T (32×256 @ 256×512) - grad_W = x.T @ grad_y (512×32 @ 32×256)
Backward takes ~2× forward time ≈ 16ms
This is why training (forward + backward) takes roughly 3× inference time. GPU parallelism and kernel fusion can reduce this, but the fundamental 2:1 ratio remains.
Q3: Gradient Accumulation Memory
You have 16GB GPU memory and a model with 1B parameters (float32). How much memory is available for activations and gradients during training?
Model parameters: 1B × 4 bytes = 4 GB
Gradients: 1B × 4 bytes = 4 GB
Optimizer state (Adam): 1B × 8 bytes = 8 GB (momentum + variance)
Total framework overhead: 16 GB
Available for activations: 0 GB — you’ve already exceeded memory before storing a single activation.
This is why large models reach for gradient accumulation across multiple forward passes before updating parameters, or gradient checkpointing to shrink activation memory. The “2× parameter size” rule (params + grads) is a floor, not a ceiling — optimizers like Adam add more on top.
Q4: requires_grad Performance
A typical training batch has: 32 images (input), 10M parameter tensors (weights), 50 intermediate activation tensors. If requires_grad defaults to True for all tensors, how many tensors unnecessarily track gradients?
Tensors that need gradients:
- Parameters: 10M tensors
Tensors that don’t need gradients:
- Input images: 32 tensors (data is not learned)
- Intermediate activations: 50 tensors (needed for backward, but not updated)
32 input tensors would unnecessarily track gradients if requires_grad defaulted to True.
This is why PyTorch defaults requires_grad=False for new tensors and forces an explicit opt-in for parameters. For a batch of 32 images of shape 3×224×224, accidentally tracking gradients on the inputs wastes 4.8M float32 values × 4 bytes = ~18.4 MB per batch — for nothing.
Q5: Graph Retention
You forget to call zero_grad() before each training iteration. After 10 iterations, how do the gradients compare to correct training?
Gradients accumulate across all 10 iterations.
If correct gradient for iteration i is g_i, your accumulated gradient is: grad = g_1 + g_2 + g_3 + ... + g_10
Effects: 1. Magnitude: Gradients are ~10× larger than they should be 2. Direction: The sum of 10 different gradients, which may not point toward the loss minimum 3. Learning: Parameter updates use the wrong direction and wrong magnitude 4. Result: Training diverges or oscillates instead of converging
Bottom line: Always call zero_grad() at the start of each iteration (or after optimizer.step()).
Key Takeaways
- Computation graphs are dynamic and implicit: Every tensor produced by an operation stores a
_grad_fnpointer; that pointer chain is the graph autograd walks in reverse. - Activation pinning is autograd’s memory tax: Saved tensors survive until
backward()runs, which is why forward-pass VRAM usage scales linearly with depth and caps batch size long before parameter count does. - Backward costs 2× forward: Each linear layer turns one matmul into two (
grad_x,grad_W), making training roughly 3× the cost of inference — a constant every capacity plan assumes. zero_grad()is not optional: Gradient accumulation is the feature and the silent bug; skipping the reset compounds 10 gradients into one bad step.
Coming next: You can compute a gradient for every parameter — but a gradient alone is just a direction. Module 07 turns directions into updates by building SGD, Adam, and AdamW, the rules that decide how far to step.
Further Reading
For students who want to understand the academic foundations and mathematical underpinnings of automatic differentiation:
Seminal Papers
- Automatic Differentiation in Machine Learning: a Survey - Baydin et al. (2018). Comprehensive survey of AD techniques, covering forward-mode, reverse-mode, and mixed-mode differentiation. Essential reading for understanding autograd theory.
- Systems Implication: This work formalized the foundational memory-compute trade-off: reverse-mode AD minimizes compute at the steep cost of O(N) memory to store the forward activation graph, permanently dictating modern GPU VRAM requirements. arXiv:1502.05767
- Automatic Differentiation of Algorithms - Griewank (1989). The foundational work on reverse-mode AD that underlies all modern deep learning frameworks. Introduces the mathematical formalism for gradient computation via the chain rule.
- Systems Implication: Exposed the inherent sequential data dependencies during the backward pass. Because gradients must be propagated backwards sequentially from output to input, this creates a structural bottleneck that prevents perfect parallelization across network layers. Computational Optimization and Applications
- PyTorch: An Imperative Style, High-Performance Deep Learning Library - Paszke et al. (2019). Describes PyTorch’s autograd implementation and design philosophy. Shows how imperative programming (define-by-run) enables dynamic computation graphs.
- Systems Implication: By introducing dynamic computation graphs, PyTorch mandated that memory allocation and operator dispatch overhead occur strictly at runtime (eager execution). This paradigm shift required highly optimized C++ backends to aggressively hide Python’s interpretive latency. NeurIPS 2019
Additional Resources
- Textbook: “Deep Learning” by Goodfellow, Bengio, and Courville - Chapter 6 covers backpropagation and computational graphs with excellent visualizations
- Tutorial: CS231n: Backpropagation, Intuitions - Stanford’s visual explanation of gradient flow through computation graphs
- Documentation: PyTorch Autograd Mechanics - Official guide to PyTorch’s autograd implementation details
What’s Next
You can now compute a gradient for every parameter in any network you build. That gradient tells you which way is downhill — but it does not tell you how big a step to take, or how to dampen oscillations, or how to adapt the step size per-parameter. That is the optimizer’s job.
You’ll implement SGD, momentum, and Adam: the rules that turn the param.grad tensors produced by backward() into actual parameter updates. With autograd plus an optimizer, you have the entire machinery a training loop needs.
Preview — how your autograd gets used in the modules that follow:
Table 6 traces how this module is reused by later parts of the curriculum.
| Module | What It Does | Your Autograd In Action |
|---|---|---|
| 07: Optimizers | Update parameters using gradients | optimizer.step() uses param.grad computed by backward() |
| 08: Training | Complete training loops | loss.backward() → optimizer.step() → repeat |
| 12: Attention | Multi-head self-attention | Gradients flow through Q, K, V projections automatically |
Get Started
- Launch Binder - Run interactively in browser, no setup required
- View Source - Browse the implementation code
Binder sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.
ne, or clone the repository for persistent local work. ::: ocal work. :::