Module 02: Activations#
Module Info
FOUNDATION TIER | Difficulty: ââââ | Time: 3-5 hours | Prerequisites: 01 (Tensor)
Prerequisites: Module 01 (Tensor) means you need:
Completed Tensor implementation with element-wise operations
Understanding of tensor shapes and broadcasting
Familiarity with NumPy mathematical functions
If you can create a Tensor and perform element-wise arithmetic (x + y, x * 2), youâre ready.
Overview#
Activation functions are the nonlinear transformations that give neural networks their power. Without them, stacking multiple layers would be pointless: no matter how many linear transformations you chain together, the result is still just one linear transformation. A 100-layer network without activations is mathematically identical to a single-layer network.
Activations introduce nonlinearity. ReLU zeros out negative values. Sigmoid squashes any input to a probability between 0 and 1. Softmax converts raw scores into a valid probability distribution. These simple mathematical functions are what enable neural networks to learn complex patterns like recognizing faces, translating languages, and playing games at superhuman levels.
In this module, youâll implement five essential activation functions from scratch. By the end, youâll understand why ReLU replaced sigmoid in hidden layers, how numerical stability prevents catastrophic failures in softmax, and when to use each activation in production systems.
Learning Objectives#
Tip
By completing this module, you will:
Implement five core activation functions (ReLU, Sigmoid, Tanh, GELU, Softmax) with proper numerical stability
Understand why nonlinearity is essential for neural network expressiveness and how activations enable learning
Master computational trade-offs between activation choices and their impact on training speed
Connect your implementations to production patterns in PyTorch and real-world architecture decisions
What Youâll Build#
flowchart LR
subgraph "Your Activation Functions"
A["ReLU<br/>max(0, x)"]
B["Sigmoid<br/>1/(1+e^-x)"]
C["Tanh<br/>(e^x - e^-x)/(e^x + e^-x)"]
D["GELU<br/>x·Ί(x)"]
E["Softmax<br/>e^xi / ÎŁe^xj"]
end
F[Input Tensor] --> A
F --> B
F --> C
F --> D
F --> E
A --> G[Output Tensor]
B --> G
C --> G
D --> G
E --> G
style A fill:#e1f5ff
style B fill:#fff3cd
style C fill:#f8d7da
style D fill:#d4edda
style E fill:#e2d5f1
Fig. 7 Your Activation Functions#
Implementation roadmap:
Part |
What Youâll Implement |
Key Concept |
|---|---|---|
1 |
|
Sparsity through zeroing negatives |
2 |
|
Mapping to (0,1) for probabilities |
3 |
|
Zero-centered activation for better gradients |
4 |
|
Smooth nonlinearity for transformers |
5 |
|
Probability distributions with numerical stability |
The pattern youâll enable:
# Transforming tensors through nonlinear functions
relu = ReLU()
activated = relu(x) # Zeros negatives, keeps positives
softmax = Softmax()
probabilities = softmax(logits) # Converts to probability distribution (sums to 1)
What Youâre NOT Building (Yet)#
To keep this module focused, you will not implement:
Gradient computation (thatâs Module 05: Autograd -
backward()methods are stubs for now)Learnable parameters (activations are fixed mathematical functions)
Advanced variants (LeakyReLU, ELU, Swish - PyTorch has dozens, youâll build the core five)
GPU acceleration (your NumPy implementation runs on CPU)
You are building the nonlinear transformations. Automatic differentiation comes in Module 05.
API Reference#
This section provides a quick reference for the activation classes youâll build. Each activation is a callable object with a forward() method that transforms an input tensor element-wise.
Activation Pattern#
All activations follow this structure:
class ActivationName:
def forward(self, x: Tensor) -> Tensor:
# Apply mathematical transformation
pass
def __call__(self, x: Tensor) -> Tensor:
return self.forward(x)
def backward(self, grad: Tensor) -> Tensor:
# Stub for Module 05
pass
Core Activations#
Activation |
Mathematical Form |
Output Range |
Primary Use Case |
|---|---|---|---|
|
|
|
Hidden layers (CNNs, MLPs) |
|
|
|
Binary classification output |
|
|
|
RNNs, zero-centered needs |
|
|
|
Transformers (GPT, BERT) |
|
|
|
Multi-class classification |
Method Signatures#
ReLU
ReLU.forward(x: Tensor) -> Tensor
Sets negative values to zero, preserves positive values.
Sigmoid
Sigmoid.forward(x: Tensor) -> Tensor
Maps any real number to (0, 1) range using logistic function.
Tanh
Tanh.forward(x: Tensor) -> Tensor
Maps any real number to (-1, 1) range using hyperbolic tangent.
GELU
GELU.forward(x: Tensor) -> Tensor
Smooth approximation to ReLU using Gaussian error function.
Softmax
Softmax.forward(x: Tensor, dim: int = -1) -> Tensor
Converts vector to probability distribution along specified dimension.
Core Concepts#
This section covers the fundamental ideas you need to understand activation functions deeply. These concepts explain why neural networks need nonlinearity, how each activation behaves differently, and what trade-offs youâre making when you choose one over another.
Why Non-linearity Matters#
Consider what happens when you stack linear transformations. If you multiply a matrix by a vector, then multiply the result by another matrix, the composition is still just matrix multiplication. Mathematically:
f(x) = Wâ(Wâx) = (WâWâ)x = Wx
A 100-layer network of pure matrix multiplications is identical to a single matrix multiplication. The depth buys you nothing.
Activation functions break this linearity. When you insert f(x) = max(0, x) between layers, the composition becomes nonlinear:
f(x) = max(0, Wâ max(0, Wâx))
Now you canât simplify the layers away. Each layer can learn to detect increasingly complex patterns. Layer 1 might detect edges in an image. Layer 2 combines edges into shapes. Layer 3 combines shapes into objects. This hierarchical feature learning is only possible because activations introduce nonlinearity.
Without activations, neural networks are just linear regression, no matter how many layers you stack. With activations, they become universal function approximators capable of learning any pattern from data.
ReLU and Its Variants#
ReLU (Rectified Linear Unit) is deceptively simple: it zeros out negative values and leaves positive values unchanged. Hereâs the complete implementation from your module:
class ReLU:
def forward(self, x: Tensor) -> Tensor:
"""Apply ReLU activation element-wise."""
result = np.maximum(0, x.data)
return Tensor(result)
This simplicity is ReLUâs greatest strength. The operation is a single comparison per element: O(n) with a tiny constant factor. Modern CPUs can execute billions of comparisons per second. Compare this to sigmoid, which requires computing an exponential for every element.
ReLU creates sparsity. When half your activations are exactly zero, computations become faster (multiplying by zero is free) and models generalize better (sparse representations are less prone to overfitting). In a 1000-neuron layer, ReLU typically activates 300-500 neurons, effectively creating a smaller, specialized network for each input.
The discontinuity at zero is both a feature and a bug. During training (Module 05), youâll discover that ReLUâs gradient is exactly 1 for positive inputs and exactly 0 for negative inputs. This prevents the vanishing gradient problem that plagued sigmoid-based networks. But it creates a new problem: dying ReLU. If a neuronâs weights shift such that it always receives negative inputs, it will output zero forever, and the zero gradient means it can never recover.
Despite this limitation, ReLU remains the default choice for hidden layers in CNNs and feedforward networks. Its speed and effectiveness at preventing vanishing gradients make it hard to beat.
Sigmoid and Tanh#
Sigmoid maps any real number to the range (0, 1), making it perfect for representing probabilities:
class Sigmoid:
def forward(self, x: Tensor) -> Tensor:
"""Apply sigmoid activation element-wise."""
z = np.clip(x.data, -500, 500) # Prevent overflow
result_data = np.zeros_like(z)
# Positive values: 1 / (1 + exp(-x))
pos_mask = z >= 0
result_data[pos_mask] = 1.0 / (1.0 + np.exp(-z[pos_mask]))
# Negative values: exp(x) / (1 + exp(x))
neg_mask = z < 0
exp_z = np.exp(z[neg_mask])
result_data[neg_mask] = exp_z / (1.0 + exp_z)
return Tensor(result_data)
Notice the numerical stability measures. Computing 1 / (1 + exp(-x)) directly fails for x = 1000 because exp(-1000) underflows to zero, giving 1 / 1 = 1. But the mathematically equivalent exp(x) / (1 + exp(x)) fails for x = 1000 because exp(1000) overflows to infinity. The solution is to compute different formulas depending on the sign of x, and clip extreme values to prevent overflow entirely.
Sigmoidâs smooth S-curve makes it interpretable as a probability, which is why itâs still used for binary classification outputs. But for hidden layers, it has fatal flaws. When |x| is large, the output saturates near 0 or 1, and the gradient becomes nearly zero. In deep networks, these tiny gradients multiply together as they backpropagate, vanishing exponentially. This is why sigmoid was largely replaced by ReLU for hidden layers around 2012.
Tanh is sigmoidâs zero-centered cousin, mapping inputs to (-1, 1):
class Tanh:
def forward(self, x: Tensor) -> Tensor:
"""Apply tanh activation element-wise."""
result = np.tanh(x.data)
return Tensor(result)
The zero-centering matters because it means the output has roughly equal numbers of positive and negative values. This can help with gradient flow in recurrent networks, where the same weights are applied repeatedly. Tanh still suffers from vanishing gradients at extreme values, but the zero-centering makes it preferable to sigmoid when you need bounded outputs.
Softmax and Numerical Stability#
Softmax converts any vector into a valid probability distribution. All outputs are positive, and they sum to exactly 1. This makes it essential for multi-class classification:
class Softmax:
def forward(self, x: Tensor, dim: int = -1) -> Tensor:
"""Apply softmax activation along specified dimension."""
# Numerical stability: subtract max to prevent overflow
x_max_data = np.max(x.data, axis=dim, keepdims=True)
x_max = Tensor(x_max_data, requires_grad=False)
x_shifted = x - x_max
# Compute exponentials
exp_values = Tensor(np.exp(x_shifted.data), requires_grad=x_shifted.requires_grad)
# Sum along dimension
exp_sum_data = np.sum(exp_values.data, axis=dim, keepdims=True)
exp_sum = Tensor(exp_sum_data, requires_grad=exp_values.requires_grad)
# Normalize to get probabilities
result = exp_values / exp_sum
return result
The max subtraction is critical. Without it, softmax([1000, 1001, 1002]) would compute exp(1000), which overflows to infinity, producing NaN results. Subtracting the max first gives softmax([0, 1, 2]), which computes safely. Mathematically, this is identical because the max factor cancels out:
exp(x - max) / ÎŁ exp(x - max) = exp(x) / ÎŁ exp(x)
Softmax amplifies differences. If the input is [1, 2, 3], the output is approximately [0.09, 0.24, 0.67]. The largest input gets 67% of the probability mass, even though itâs only 3Ă larger than the smallest input. This is because exponentials grow superlinearly. In classification, this is desirable: you want the network to be confident when itâs confident.
But softmaxâs coupling is a gotcha. When you change one input, all outputs change because theyâre normalized by the same sum. This means the gradient involves a Jacobian matrix, not just element-wise derivatives. Youâll see this complexity when you implement backward() in Module 05.
Choosing Activations#
Hereâs the decision tree production ML engineers use:
For hidden layers:
Default choice: ReLU (fast, prevents vanishing gradients, creates sparsity)
Modern transformers: GELU (smooth, better gradient flow, state-of-the-art results)
Recurrent networks: Tanh (zero-centered helps with recurrence)
Experimental: LeakyReLU, ELU, Swish (variants that fix dying ReLU problem)
For output layers:
Binary classification: Sigmoid (outputs valid probability in [0, 1])
Multi-class classification: Softmax (outputs probability distribution summing to 1)
Regression: None (linear output, no activation)
Computational cost matters:
ReLU: 1Ă (baseline, just comparisons)
GELU: 4-5Ă (exponential in approximation)
Sigmoid/Tanh: 3-4Ă (exponentials)
Softmax: 5Ă+ (exponentials + normalization)
For a 1 billion parameter model, using GELU instead of ReLU in every hidden layer might increase training time by 20-30%. But if GELU gives you 2% better accuracy, that trade-off is worth it for production systems where model quality matters more than training speed.
Computational Complexity#
All activation functions are element-wise operations, meaning they apply independently to each element of the tensor. This gives O(n) time complexity where n is the total number of elements. But the constant factors differ dramatically:
Operation |
Complexity |
Cost Relative to ReLU |
|---|---|---|
ReLU ( |
O(n) comparisons |
1Ă (baseline) |
Sigmoid/Tanh |
O(n) exponentials |
3-4Ă |
GELU |
O(n) exponentials + multiplies |
4-5Ă |
Softmax |
O(n) exponentials + O(n) sum + O(n) divisions |
5Ă+ |
Exponentials are expensive. A modern CPU can execute 1 billion comparisons per second but only 250 million exponentials per second. This is why ReLU is so popular: at scale, a 4Ă speedup in activation computation can mean the difference between training in 1 day versus 4 days.
Memory complexity is O(n) for all activations because they create an output tensor the same size as the input. Softmax requires small temporary buffers for the exponentials and sum, but this overhead is negligible compared to the tensor sizes in production networks.
Production Context#
Your Implementation vs. PyTorch#
Your TinyTorch activations and PyTorchâs torch.nn.functional activations implement the same mathematical functions with the same numerical stability measures. The differences are in optimization and GPU support:
Feature |
Your Implementation |
PyTorch |
|---|---|---|
Backend |
NumPy (Python/C) |
C++/CUDA kernels |
Speed |
1Ă (CPU baseline) |
10-100Ă faster (GPU) |
Numerical Stability |
â Max subtraction (Softmax), clipping (Sigmoid) |
â Same techniques |
Autograd |
Stubs (Module 05) |
Full gradient computation |
Variants |
5 core activations |
30+ variants (LeakyReLU, PReLU, Mish, etc.) |
Code Comparison#
The following comparison shows equivalent activation usage in TinyTorch and PyTorch. Notice how the APIs are nearly identical, differing only in import paths and minor syntax.
from tinytorch.core.activations import ReLU, Sigmoid, Softmax
from tinytorch.core.tensor import Tensor
# Element-wise activations
x = Tensor([[-1, 0, 1, 2]])
relu = ReLU()
activated = relu(x) # [0, 0, 1, 2]
# Binary classification output
sigmoid = Sigmoid()
probability = sigmoid(x) # All values in (0, 1)
# Multi-class classification output
logits = Tensor([[1, 2, 3]])
softmax = Softmax()
probs = softmax(logits) # [0.09, 0.24, 0.67], sum = 1
import torch
import torch.nn.functional as F
# Element-wise activations
x = torch.tensor([[-1, 0, 1, 2]], dtype=torch.float32)
activated = F.relu(x) # [0, 0, 1, 2]
# Binary classification output
probability = torch.sigmoid(x) # All values in (0, 1)
# Multi-class classification output
logits = torch.tensor([[1, 2, 3]], dtype=torch.float32)
probs = F.softmax(logits, dim=-1) # [0.09, 0.24, 0.67], sum = 1
Letâs walk through the key similarities and differences:
Line 1 (Import): TinyTorch imports activation classes; PyTorch uses functional interface
torch.nn.functional. Both approaches work; PyTorch also supports class-based activations viatorch.nn.ReLU().Line 4-6 (ReLU): Identical semantics. Both zero out negative values, preserve positive values.
Line 9-10 (Sigmoid): Identical mathematical function. Both use numerically stable implementations to prevent overflow.
Line 13-15 (Softmax): Same mathematical operation. Both require specifying the dimension for multi-dimensional tensors. PyTorch uses
dimkeyword argument; TinyTorch defaults todim=-1.
Tip
Whatâs Identical
Mathematical functions, numerical stability techniques (max subtraction in softmax), and the concept of element-wise transformations. When you debug PyTorch activation issues, youâll understand exactly whatâs happening because you implemented the same logic.
Why Activations Matter at Scale#
To appreciate why activation choice matters, consider the scale of modern ML systems:
Large language models: GPT-3 has 96 transformer layers, each with 2 GELU activations. Thatâs 192 GELU operations per forward pass on billions of parameters.
Image classification: ResNet-50 has 49 convolutional layers, each followed by ReLU. Processing a batch of 256 images at 224Ă224 resolution means 12 billion ReLU operations per batch.
Production serving: A model serving 1000 requests per second performs 86 million activation computations per day. A 20% speedup from ReLU vs GELU saves hours of compute time.
Activation functions account for 5-15% of total training time in typical networks (the rest is matrix multiplication). But in transformer models with many layers and small matrix sizes, activations can account for 20-30% of compute time. This is why GELU vs ReLU is a real trade-off: slower computation but potentially better accuracy.
Check Your Understanding#
Test yourself with these systems thinking questions. Theyâre designed to build intuition for how activations behave in real neural networks.
Q1: Memory Calculation
A batch of 32 samples passes through a hidden layer with 4096 neurons and ReLU activation. How much memory is required to store the activation outputs (float32)?
Answer
32 Ă 4096 Ă 4 bytes = 524,288 bytes â 512 KB
This is the activation memory for ONE layer. A 100-layer network needs 50 MB just to store activations for one forward pass. This is why activation memory dominates training memory usage (youâll see this in Module 05 when you cache activations for backpropagation).
Q2: Computational Cost
If ReLU takes 1ms to activate 1 million neurons, approximately how long will GELU take on the same input?
Answer
GELU is approximately 4-5Ă slower than ReLU due to exponential computation in the sigmoid approximation.
Expected time: 4-5ms
At scale, this matters: if you have 100 activation layers in your model, switching from ReLU to GELU adds 300-400ms per forward pass. For training that requires millions of forward passes, this multiplies into hours or days of extra compute time.
Q3: Numerical Stability
Why does softmax subtract the maximum value before computing exponentials? What would happen without this step?
Answer
Without max subtraction: Computing softmax([1000, 1001, 1002]) requires exp(1000), which overflows to infinity in float32/float64, producing NaN.
With max subtraction: First compute x_shifted = x - max(x) = [0, 1, 2], then compute exp([0, 1, 2]) which stays within float range.
Why this works mathematically:
exp(x - max) / ÎŁ exp(x - max) = [exp(x) / exp(max)] / [ÎŁ exp(x) / exp(max)] = exp(x) / ÎŁ exp(x)
The `exp(max)` factor cancels out, so the result is mathematically identical. But numerically, it prevents overflow. This is a classic example of why production ML requires careful numerical engineering, not just correct math.
Q4: Sparsity Analysis
A ReLU layer processes input tensor with shape (128, 1024) containing values drawn from a normal distribution N(0, 1). Approximately what percentage of outputs will be exactly zero?
Answer
For a standard normal distribution N(0, 1), approximately 50% of values are negative.
ReLU zeros all negative values, so approximately 50% of outputs will be exactly zero.
Total elements: 128 Ă 1024 = 131,072 Zeros: â 65,536
This sparsity has major implications:
Speed: Multiplying by zero is free, so downstream computations can skip ~50% of operations
Memory: Sparse formats can compress the output by 2Ă
Generalization: Sparse representations often generalize better (less overfitting)
This is why ReLU is so effective: it creates natural sparsity without requiring explicit regularization.
Q5: Activation Selection
Youâre building a sentiment classifier that outputs âpositiveâ or ânegativeâ. Which activation should you use for the output layer, and why?
Answer
Use Sigmoid for the output layer.
Reasoning:
Binary classification needs a single probability value in [0, 1]
Sigmoid maps any real number to (0, 1)
Output can be interpreted as P(positive) where 0.8 means â80% confident this is positiveâ
Decision rule: predict positive if sigmoid(output) > 0.5
Why NOT other activations:
Softmax: Overkill for binary classification (designed for multi-class), though technically works with 2 outputs
ReLU: Outputs unbounded positive values, not interpretable as probabilities
Tanh: Outputs in (-1, 1), not directly interpretable as probabilities
Production pattern:
Input â Linear + ReLU â Linear + ReLU â Linear + Sigmoid â Binary Probability
For multi-class sentiment (positive/negative/neutral), you'd use Softmax instead to get a 3-element probability distribution.
Further Reading#
For students who want to understand the academic foundations and historical development of activation functions:
Seminal Papers#
Deep Sparse Rectifier Neural Networks - Glorot, Bordes, Bengio (2011). The paper that established ReLU as the default activation for deep networks, showing how its sparsity and constant gradient enable training of very deep networks. AISTATS
Gaussian Error Linear Units (GELUs) - Hendrycks & Gimpel (2016). Introduced the smooth activation that powers modern transformers like GPT and BERT. Explains the probabilistic interpretation and why smoothness helps optimization. arXiv:1606.08415
Attention Is All You Need - Vaswani et al. (2017). While primarily about transformers, this paperâs use of specific activations (ReLU in position-wise FFN, Softmax in attention) established patterns still used today. NeurIPS
Additional Resources#
Textbook: âDeep Learningâ by Goodfellow, Bengio, and Courville - Chapter 6.3 covers activation functions with mathematical rigor
Blog: Understanding Activation Functions - Amazonâs MLU visual explanation of ReLU
Whatâs Next#
See also
Coming Up: Module 03 - Layers
Implement Linear layers that combine your Tensor operations with your activation functions. Youâll build the building blocks that stack to form neural networks: weights, biases, and the forward pass that transforms inputs to outputs.
Preview - How Your Activations Get Used in Future Modules:
Module |
What It Does |
Your Activations In Action |
|---|---|---|
03: Layers |
Neural network building blocks |
|
04: Losses |
Training objectives |
Softmax probabilities feed into cross-entropy loss |
05: Autograd |
Automatic gradients |
|
Get Started#
Tip
Interactive Options
Launch Binder - Run interactively in browser, no setup required
Open in Colab - Use Google Colab for cloud compute
View Source - Browse the implementation code
Warning
Save Your Progress
Binder and Colab sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.