Module 06: Optimizers#
Module Info
FOUNDATION TIER | Difficulty: ââââ | Time: 3-5 hours | Prerequisites: 01-05
Prerequisites: Modules 01-05 means you need:
Tensor operations and parameter storage
Understanding of forward/backward passes (autograd)
Why gradients point toward higher loss
If you understand how loss.backward() computes gradients and why we need to update parameters to minimize loss, youâre ready.
Overview#
Optimizers are the engines that drive neural network learning. After your autograd system computes gradients that point uphill toward higher loss, optimizers use those gradients to move parameters downhill toward lower loss. Think of optimization as hiking in dense fog where you can only feel the slope under your feet but canât see where youâre going. Different optimizers represent different hiking strategies, from simple gradient descent to sophisticated algorithms that adapt their step size for each parameter.
In this module, youâll build three production-grade optimizers: SGD with momentum (the foundation algorithm), Adam with adaptive learning rates (the workhorse of modern deep learning), and AdamW with decoupled weight decay (the state-of-the-art for transformers). These optimizers differ dramatically in memory usage, convergence speed, and numerical behavior.
By the end, youâll understand not just how optimizers work but also the systems trade-offs between them: SGD uses 2x parameter memory while Adam uses 3x, but Adam often converges in fewer steps.
Learning Objectives#
Tip
By completing this module, you will:
Implement SGD with momentum to reduce oscillations and accelerate convergence in narrow valleys
Master Adamâs adaptive learning rate mechanism with first and second moment estimation
Understand memory trade-offs (SGD: 2x memory vs Adam: 3x memory) and computational complexity per step
Connect optimizer state management to checkpointing and distributed training considerations
What Youâll Build#
flowchart LR
subgraph "Your Optimizer Classes"
A["Optimizer Base<br/>zero_grad(), step()"]
B["SGD<br/>momentum buffers"]
C["Adam<br/>m, v buffers"]
D["AdamW<br/>decoupled decay"]
end
A --> B --> C --> D
style A fill:#e1f5ff
style B fill:#fff3cd
style C fill:#f8d7da
style D fill:#d4edda
Fig. 11 Your Optimizer Classes#
Implementation roadmap:
Step |
What Youâll Implement |
Key Concept |
|---|---|---|
1 |
|
Common interface: zero_grad(), step() |
2 |
|
Velocity buffers to reduce oscillations |
3 |
|
First and second moment estimation with bias correction |
4 |
|
Decoupled weight decay for proper regularization |
The pattern youâll enable:
# Training loop with optimizer
optimizer = Adam(model.parameters(), lr=0.001)
loss.backward() # Compute gradients (Module 05)
optimizer.step() # Update parameters using gradients
optimizer.zero_grad() # Clear gradients for next iteration
What Youâre NOT Building (Yet)#
To keep this module focused, you will not implement:
Learning rate schedules (thatâs Module 07: Training)
Gradient clipping (PyTorch provides this via
torch.nn.utils.clip_grad_norm_)Second-order optimizers like L-BFGS (rarely used in deep learning due to memory cost)
Distributed optimizer sharding (production frameworks use techniques like ZeRO)
You are building the core optimization algorithms. Advanced training techniques come in Module 07.
API Reference#
This section provides a quick reference for the Optimizer classes youâll build. Use this as your guide while implementing and debugging.
Optimizer Base Class#
Optimizer(params: List[Tensor])
Base class defining the optimizer interface. All optimizers inherit from this.
Method |
Signature |
Description |
|---|---|---|
|
|
Clear gradients from all parameters |
|
|
Update parameters (implemented by subclasses) |
SGD Optimizer#
SGD(params, lr=0.01, momentum=0.0, weight_decay=0.0)
Stochastic Gradient Descent with optional momentum and weight decay.
Parameters:
params: List of Tensor parameters to optimizelr: Learning rate (step size, default: 0.01)momentum: Momentum factor (0.0-1.0, typically 0.9, default: 0.0)weight_decay: L2 penalty coefficient (default: 0.0)
Update rule:
Without momentum:
param = param - lr * gradWith momentum:
v = momentum * v + grad; param = param - lr * v
State management methods:
Method |
Signature |
Description |
|---|---|---|
|
|
Check if optimizer uses momentum (momentum > 0) |
|
|
Get momentum buffers for checkpointing |
|
|
Restore momentum buffers from checkpoint |
Adam Optimizer#
Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0)
Adaptive Moment Estimation with per-parameter learning rates.
Parameters:
params: List of Tensor parameters to optimizelr: Learning rate (default: 0.001)betas: Tuple of coefficients (βâ, βâ) for computing running averages (default: (0.9, 0.999))eps: Small constant for numerical stability (default: 1e-8)weight_decay: L2 penalty coefficient (default: 0.0)
State:
m_buffers: First moment estimates (momentum of gradients)v_buffers: Second moment estimates (momentum of squared gradients)
AdamW Optimizer#
AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01)
Adam with decoupled weight decay regularization.
Parameters:
params: List of Tensor parameters to optimizelr: Learning rate (default: 0.001)betas: Tuple of coefficients (βâ, βâ) for computing running averages (default: (0.9, 0.999))eps: Small constant for numerical stability (default: 1e-8)weight_decay: L2 penalty coefficient (default: 0.01, higher than Adam)
Key difference from Adam: Weight decay is applied directly to parameters after gradient update, not mixed into the gradient.
Core Concepts#
This section covers the fundamental ideas you need to understand optimization deeply. These concepts apply across all ML frameworks and will serve you throughout your career in machine learning systems.
Gradient Descent Fundamentals#
Gradient descent is conceptually simple: gradients point uphill toward higher loss, so we step downhill by moving in the opposite direction. The gradient âL tells us the direction of steepest ascent, so -âL points toward steepest descent.
The basic update rule is: θ_new = θ_old - Îą * âL, where θ represents parameters and Îą is the learning rate (step size). This simple formula hides important challenges. How large should steps be? What if different parameters need different step sizes? What about noisy gradients or narrow valleys that cause oscillation?
Hereâs how your SGD implementation handles the basic case without momentum:
def step(self):
"""Perform SGD update step with momentum."""
for i, param in enumerate(self.params):
if param.grad is None:
continue
# Get gradient data
grad = param.grad
if isinstance(grad, Tensor):
grad_data = grad.data
else:
grad_data = grad
# Apply weight decay if specified
if self.weight_decay != 0:
grad_data = grad_data + self.weight_decay * param.data
# Update parameter: param = param - lr * grad
param.data = param.data - self.lr * grad_data
self.step_count += 1
The code reveals the simplicity of basic SGD: subtract learning rate times gradient from each parameter. But this simplicity comes with a cost: plain SGD can oscillate wildly in narrow valleys of the loss landscape.
Momentum and Acceleration#
Momentum solves the oscillation problem by remembering previous update directions. Think of a ball rolling down a hill: it doesnât immediately change direction when it hits a small bump because it has momentum carrying it forward. In optimization, momentum accumulates velocity in directions that gradients consistently agree on, while oscillations in perpendicular directions cancel out.
The momentum update maintains a velocity buffer v for each parameter: v = β * v_prev + grad and then param = param - lr * v. The momentum coefficient β (typically 0.9) controls how much previous direction we remember. With β=0.9, we keep 90% of the old velocity and add 10% of the current gradient.
Hereâs how your SGD implementation adds momentum:
# Update momentum buffer
if self.momentum != 0:
if self.momentum_buffers[i] is None:
# Initialize momentum buffer on first use
self.momentum_buffers[i] = np.zeros_like(param.data)
# Update momentum: v = momentum * v_prev + grad
self.momentum_buffers[i] = self.momentum * self.momentum_buffers[i] + grad_data
grad_data = self.momentum_buffers[i]
# Update parameter: param = param - lr * grad
param.data = param.data - self.lr * grad_data
The momentum buffer is initialized lazily (only when first needed) to save memory for optimizers without momentum. Once initialized, each step accumulates 90% of the previous velocity plus the current gradient, creating a smoothed update direction thatâs less susceptible to noise and oscillation.
Adam and Adaptive Learning Rates#
Adam solves a fundamental problem: different parameters often need different learning rates. Consider a neural network with embedding weights ranging from -0.01 to 0.01 and output weights ranging from -10 to 10. A learning rate that works well for embeddings might cause output weights to explode, while a rate thatâs safe for output weights makes embeddings learn too slowly.
Adam addresses this by maintaining two statistics for each parameter: a first moment m (exponential moving average of gradients) and a second moment v (exponential moving average of squared gradients). The ratio m/âv gives an adaptive step size: parameters with large gradients get smaller effective learning rates, while parameters with small gradients get larger effective rates.
The algorithm tracks: m = βâ * m_prev + (1-βâ) * grad and v = βâ * v_prev + (1-βâ) * grad². Then it corrects for initialization bias (m and v start at zero) and updates: param = param - lr * mĚ / (âvĚ + Îľ), where mĚ and vĚ are bias-corrected moments.
Hereâs the complete Adam update from your implementation:
def step(self):
"""Perform Adam update step."""
self.step_count += 1
for i, param in enumerate(self.params):
if param.grad is None:
continue
grad = param.grad
if isinstance(grad, Tensor):
grad_data = grad.data
else:
grad_data = grad
# Initialize buffers if needed
if self.m_buffers[i] is None:
self.m_buffers[i] = np.zeros_like(param.data)
self.v_buffers[i] = np.zeros_like(param.data)
# Update biased first moment estimate
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data
# Update biased second moment estimate
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)
# Compute bias correction
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
# Compute bias-corrected moments
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Update parameter
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
The bias correction terms (1 - β^t) are crucial in the first few steps. Without correction, m and v start at zero and take many steps to reach reasonable values, causing the optimizer to take tiny steps initially. The correction divides by increasingly large values: at step 1, divide by 0.1; at step 2, divide by 0.19; eventually the correction approaches 1.0 and has no effect.
AdamW and Decoupled Weight Decay#
AdamW fixes a subtle but important bug in Adamâs weight decay implementation. In standard Adam, weight decay is added to the gradient before adaptive scaling: grad = grad + Îť * param, then proceed with normal Adam. This seems reasonable but creates a problem: the weight decay effect gets scaled by the adaptive learning rate mechanism, making regularization inconsistent across parameters.
Parameters with large gradients get small adaptive learning rates, which also makes their weight decay small. Parameters with small gradients get large adaptive learning rates, which amplifies their weight decay. This is backwards: we want consistent regularization regardless of gradient magnitudes.
AdamW decouples weight decay from the gradient by applying it directly to parameters after the gradient update: first update using pure gradients with Adamâs adaptive mechanism, then separately shrink parameters by a fixed proportion. This ensures regularization strength is consistent across all parameters.
Hereâs how your AdamW implementation achieves decoupling:
# Update moments using pure gradients (NO weight decay mixed in)
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)
# Compute bias correction and bias-corrected moments
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Apply gradient-based update
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
# Apply decoupled weight decay (separate from gradient update)
if self.weight_decay != 0:
param.data = param.data * (1 - self.lr * self.weight_decay)
Notice that weight decay appears only at the end, multiplying parameters by (1 - lr * weight_decay) to shrink them slightly. This shrinkage happens after the gradient update and is completely independent of gradient magnitudes or adaptive scaling.
Learning Rate Selection#
Learning rate is the single most important hyperparameter in optimization. Too large, and parameters oscillate or diverge. Too small, and training takes forever or gets stuck in poor local minima. The optimal learning rate depends on the optimizer, network architecture, dataset, and batch size.
For SGD, learning rates typically range from 0.001 to 0.1. SGD is very sensitive to learning rate choice and often requires careful tuning or learning rate schedules. Momentum helps but doesnât eliminate the sensitivity.
For Adam and AdamW, the default learning rate of 0.001 works well across many problems. The adaptive mechanism provides some robustness to learning rate choice. However, transformers often use smaller rates (0.0001 to 0.0003) with warmup periods where the rate gradually increases from zero.
The relationship between learning rate and batch size matters for distributed training. Larger batches provide less noisy gradients, allowing larger learning rates. A common heuristic is to scale learning rate linearly with batch size: if you double the batch size from 32 to 64, double the learning rate from 0.001 to 0.002.
Production Context#
Your Implementation vs. PyTorch#
Your TinyTorch optimizers and PyTorchâs torch.optim share the same algorithmic foundations and API patterns. The differences lie in implementation details: PyTorch uses optimized C++/CUDA kernels, supports mixed precision training, and includes specialized optimizers for specific domains.
Feature |
Your Implementation |
PyTorch |
|---|---|---|
Backend |
NumPy (Python) |
C++/CUDA kernels |
Speed |
1x (baseline) |
10-50x faster |
Memory |
Same asymptotic cost |
Same (3x for Adam) |
State management |
Manual buffers |
Automatic state_dict() |
Optimizers |
SGD, Adam, AdamW |
10+ algorithms (RMSprop, Adagrad, etc.) |
Code Comparison#
The following comparison shows how optimizer usage looks nearly identical in TinyTorch and PyTorch. This similarity is intentional: by learning TinyTorchâs patterns, youâre simultaneously learning production PyTorch patterns.
from tinytorch.core.optimizers import Adam
# Create optimizer for model parameters
optimizer = Adam(model.parameters(), lr=0.001)
# Training step
loss = criterion(predictions, targets)
loss.backward() # Compute gradients
optimizer.step() # Update parameters
optimizer.zero_grad() # Clear gradients
import torch.optim as optim
# Create optimizer for model parameters
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training step
loss = criterion(predictions, targets)
loss.backward() # Compute gradients
optimizer.step() # Update parameters
optimizer.zero_grad() # Clear gradients
Letâs walk through each line to understand the comparison:
Line 1 (Import): TinyTorch exposes optimizers from
tinytorch.core.optimizers; PyTorch usestorch.optim. The namespace structure mirrors production frameworks.Line 4 (Creation): Both use identical syntax:
Adam(model.parameters(), lr=0.001). Themodel.parameters()method returns an iterable of tensors withrequires_grad=True.Line 7-8 (Training): The loss computation and backward pass are identical. Your autograd system from Module 05 computes gradients just like PyTorch.
Line 9 (Update): Both call
optimizer.step()to update parameters using computed gradients. The update rules are mathematically identical.Line 10 (Clear): Both call
optimizer.zero_grad()to clear gradients before the next iteration. Without this, gradients would accumulate across batches.
Tip
Whatâs Identical
The optimizer API, update algorithms, and memory patterns are identical. When you debug Adamâs learning rate or analyze optimizer memory usage in production, youâll understand exactly whatâs happening because you built these mechanisms yourself.
Why Optimizers Matter at Scale#
To appreciate optimizer importance, consider production training scenarios:
Large language models (175B parameters): Optimizer state alone consumes 1.4 TB with Adam (3x Ă 700 GB parameters), requiring multi-GPU state sharding
Transformer training: AdamW with weight_decay=0.01 is standard, improving generalization over plain Adam by 2-5% accuracy
Convergence speed: Adam typically converges in 30-50% fewer steps than SGD on vision and language tasks, saving hours of GPU time despite higher memory cost
The optimizer choice directly impacts training feasibility. For models that barely fit in memory with SGD, switching to Adam might require distributed training or gradient checkpointing to handle the 1.5x memory increase.
Check Your Understanding#
Test yourself with these systems thinking questions designed to build intuition for optimization trade-offs in production ML.
Q1: Memory Calculation
A language model has 10 billion float32 parameters. Using Adam optimizer, how much total memory does optimizer state require? How does this compare to SGD with momentum?
Answer
Parameters: 10B Ă 4 bytes = 40 GB
Adam state: 2 buffers (m, v) = 2 Ă 40 GB = 80 GB Total with Adam: 40 GB (params) + 80 GB (state) = 120 GB
SGD with momentum: 1 buffer (velocity) = 40 GB Total with SGD: 40 GB (params) + 40 GB (state) = 80 GB
Difference: Adam uses 40 GB more than SGD (50% increase). This might force you to use fewer GPUs or implement optimizer state sharding.
Q2: Convergence Trade-off
If Adam converges in 100,000 steps and SGD needs 200,000 steps, but Adamâs per-step time is 1.2x slower due to additional computations, which optimizer finishes training faster?
Answer
Adam: 100,000 steps Ă 1.2 = 120,000 time units SGD: 200,000 steps Ă 1.0 = 200,000 time units
Adam finishes 1.67x faster despite higher per-step cost. The convergence advantage (2x fewer steps) outweighs the computational overhead (1.2x slower steps).
This illustrates why Adam is popular despite higher memory and compute: wall-clock time to convergence often matters more than per-step efficiency.
Q3: Bias Correction Impact
In Adam, bias correction divides first moment by (1 - βâ^t). At step 1 with βâ=0.9, this correction factor is 0.1. At step 10, itâs 0.651. How does this affect early vs late training?
Answer
Step 1: Divide by 0.1 = multiply by 10x (huge correction) Step 10: Divide by 0.651 = multiply by 1.54x (moderate correction) Step 100: Divide by 0.9999 â multiply by 1.0x (negligible correction)
Early training: Large corrections amplify small moment estimates to reasonable magnitudes, enabling effective learning from the first step.
Late training: Corrections approach 1.0 and have minimal effect, so the algorithm uses raw moment estimates.
Without correction: First moment m starts at 0, making initial steps tiny (learning rate effectively 0.1x intended). Training would be very slow initially.
Q4: Weight Decay Comparison
Adam adds weight decay to gradients before adaptive scaling. AdamW applies it after. For a parameter with grad=0.001 and param=1.0, which experiences stronger regularization with weight_decay=0.01 and lr=0.1?
Answer
Adam approach:
Modified grad = 0.001 + 0.01 Ă 1.0 = 0.011
This gradient gets adaptively scaled (divided by âv, which is small for small gradients)
Effective decay is amplified by adaptive scaling
AdamW approach:
Pure gradient update uses grad=0.001 (small adaptive step)
Then param = param Ă (1 - 0.1 Ă 0.01) = param Ă 0.999 (fixed 0.1% shrinkage)
AdamW has consistent 0.1% weight decay regardless of gradient magnitude. Adamâs decay strength varies with adaptive learning rate scaling, making it inconsistent across parameters. AdamWâs consistency leads to better regularization behavior.
Q5: Optimizer State Checkpointing
Youâre training with Adam and checkpoint every 1000 steps. The checkpoint saves parameters and optimizer state (m, v buffers). If you resume from step 5000 but change learning rate from 0.001 to 0.0001, should you restore old optimizer state or reset it?
Answer
Restore state (recommended): The m and v buffers contain valuable information about gradient statistics accumulated over 5000 steps. Resetting loses this and causes the optimizer to âforgetâ learned gradient scales.
Impact of restoring:
Keeps adaptive learning rates calibrated to parameter-specific gradient magnitudes
Prevents slow re-convergence that happens when resetting
Learning rate change affects step size but not the adaptive scaling
When to reset:
If switching optimizer types (SGD â Adam)
If gradient distribution has fundamentally changed (switching datasets)
If debugging and suspecting corrupted state
Production practice: Always restore optimizer state when resuming training unless you have specific reasons to reset. The state is part of what makes Adam effective.
Further Reading#
For students who want to understand the academic foundations and mathematical underpinnings of optimization algorithms:
Seminal Papers#
Adam: A Method for Stochastic Optimization - Kingma & Ba (2015). The original Adam paper introducing adaptive moment estimation with bias correction. Explains the motivation and derivation. arXiv:1412.6980
Decoupled Weight Decay Regularization (AdamW) - Loshchilov & Hutter (2019). Identifies the weight decay bug in Adam and proposes the decoupled fix. Shows significant improvements on image classification and language modeling. arXiv:1711.05101
On the Importance of Initialization and Momentum in Deep Learning - Sutskever et al. (2013). Classic paper explaining why momentum works and how it accelerates convergence in deep networks. ICML 2013
Additional Resources#
Tutorial: âAn overview of gradient descent optimization algorithmsâ by Sebastian Ruder - Comprehensive survey covering SGD variants, momentum methods, and adaptive learning rate algorithms
Documentation: PyTorch Optimization Documentation - See how production frameworks organize and document optimization algorithms
Whatâs Next#
See also
Coming Up: Module 07 - Training
Combine optimizers with training loops to actually train neural networks. Youâll implement learning rate scheduling, checkpointing, and the complete training/validation workflow that makes everything work together.
Preview - How Your Optimizers Get Used in Future Modules:
Module |
What It Does |
Your Optimizers In Action |
|---|---|---|
07: Training |
Complete training loops |
|
08: DataLoader |
Batch data processing |
|
09: Spatial (CNNs) |
Convolutional networks |
|
Get Started#
Tip
Interactive Options
Launch Binder - Run interactively in browser, no setup required
Open in Colab - Use Google Colab for cloud compute
View Source - Browse the implementation code
Warning
Save Your Progress
Binder and Colab sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.