Module 11: Embeddings

Embeddings are the largest single tensor in most language models (GPT-3’s 2.3 GB) but every forward pass only touches a handful of rows. That makes them a row-sparse scatter/gather problem, not a matmul, and it puts them squarely in the memory-bandwidth regime of the roofline. This module builds the lookup table, the positional encodings, and the sparse-gradient backward pass that every transformer input layer depends on.

Module Info

ARCHITECTURE TIER | Difficulty: ●●○○ | Time: 3-5 hours | Prerequisites: 01-08, 10

Prerequisites: Modules 01-08 and 10 means you should understand:

Tensor operations (shape manipulation, matrix operations, broadcasting)
Training fundamentals (forward/backward, optimization)
Tokenization (converting text to token IDs, vocabularies)

If you can explain how a tokenizer converts “hello” to token IDs and how to multiply matrices, you’re ready.

🎧 Audio Overview

Listen to an AI-generated overview.

🚀 Launch Binder

Run interactively in your browser.

Open in Binder →

📄 View Source

Browse the source code on GitHub.

View on GitHub →

🔥 Slide Deck · AI-generated

Loading slides...

Overview

Neural networks operate on vectors. Language is made of tokens. Embeddings are how the two meet: a learnable lookup table that turns each integer token ID into a dense vector, so the rest of the network can do calculus on it.

Your tokenizer from Module 10 produces IDs like [42, 7, 15]. By the end of this module you have built the layer that turns those IDs into geometry — the same layer sitting at the input of every transformer from BERT to GPT-4 — and the positional encodings that tell the network where in the sequence each token lives.

Learning Objectives

By completing this module, you will:

Implement embedding layers that convert token IDs to dense vectors through efficient table lookup
Master positional encoding strategies including learned and sinusoidal approaches
Understand memory scaling for embedding tables and the trade-offs between vocabulary size and embedding dimension
Connect your implementation to production transformer architectures used in GPT and BERT

What You’ll Build

Figure 1: **TinyTorch Embedding System**: Mapping tokens and positions to learned continuous representations.

Implementation roadmap:

Table 1 lays out the implementation in order, one part at a time.

Table 1: Implementation roadmap for the embedding classes.

Part	What You’ll Implement	Key Concept
1	`Embedding` class	Token ID to vector lookup via indexing
2	`PositionalEncoding` class	Learnable position embeddings
3	`create_sinusoidal_embeddings()`	Mathematical position encoding
4	`EmbeddingLayer` class	Complete token + position system

The pattern you’ll enable:

# Converting tokens to position-aware dense vectors
embed_layer = EmbeddingLayer(vocab_size=50000, embed_dim=512)
tokens = Tensor([[1, 42, 7]])  # Token IDs from tokenizer
embeddings = embed_layer(tokens)  # (1, 3, 512) dense vectors ready for attention

What You’re NOT Building (Yet)

To keep this module focused, you will not implement:

Attention mechanisms (that’s Module 12: Attention)
Full transformer architectures (that’s Module 13: Transformers)
Word2Vec or GloVe pretrained embeddings (you’re building learnable embeddings)
Subword embedding composition (PyTorch handles this at the tokenization level)

You are building the foundation for sequence models. Context-aware representations come next.

API Reference

This section documents the embedding components you’ll build. Use this as your reference while implementing and debugging.

Embedding Class

Embedding(vocab_size, embed_dim)

Learnable embedding layer that maps token indices to dense vectors through table lookup.

Constructor Parameters: - vocab_size (int): Size of vocabulary (number of unique tokens) - embed_dim (int): Dimension of embedding vectors

Core Methods:

Table 2 lists the methods on this class.

Table 2: Core methods on the Embedding class.

Method	Signature	Description
`forward`	`forward(indices: Tensor) -> Tensor`	Lookup embeddings for token indices
`parameters`	`parameters() -> List[Tensor]`	Return weight matrix for optimization

Properties: - weight: Tensor of shape (vocab_size, embed_dim) containing learnable embeddings

PositionalEncoding Class

PositionalEncoding(max_seq_len, embed_dim)

Learnable positional encoding that adds trainable position-specific vectors to embeddings.

Constructor Parameters: - max_seq_len (int): Maximum sequence length to support - embed_dim (int): Embedding dimension (must match token embeddings)

Core Methods:

Table 3 lists the methods on this class.

Table 3: Core methods on the PositionalEncoding class.

Method	Signature	Description
`forward`	`forward(x: Tensor) -> Tensor`	Add positional encodings to input embeddings
`parameters`	`parameters() -> List[Tensor]`	Return position embedding matrix

Sinusoidal Embeddings Function

create_sinusoidal_embeddings(max_seq_len, embed_dim) -> Tensor

Creates fixed sinusoidal positional encodings using trigonometric functions. No parameters to learn.

Mathematical Formula:

PE(pos, 2i)   = sin(pos / 10000^(2i/embed_dim))
PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim))

EmbeddingLayer Class

EmbeddingLayer(vocab_size, embed_dim, max_seq_len=512,
               pos_encoding='learned', scale_embeddings=False)

Complete embedding system combining token embeddings and positional encoding.

Constructor Parameters: - vocab_size (int): Size of vocabulary - embed_dim (int): Embedding dimension - max_seq_len (int): Maximum sequence length for positional encoding - pos_encoding (str): Type of positional encoding (‘learned’, ‘sinusoidal’, or None) - scale_embeddings (bool): Whether to scale by sqrt(embed_dim) (transformer convention)

Core Methods:

Table 4 lists the methods on this class.

Table 4: Core methods on the combined token+position Embedding wrapper.

Method	Signature	Description
`forward`	`forward(tokens: Tensor) -> Tensor`	Complete embedding pipeline
`parameters`	`parameters() -> List[Tensor]`	All trainable parameters

Core Concepts

Four ideas carry the rest of this chapter: lookup-as-matmul, sparse gradient updates, the learned-vs-fixed positional tradeoff, and the memory cost of embedding dimension. Every modern language model is built from them.

From Indices to Vectors

Neural networks consume continuous vectors; language arrives as discrete IDs. After the tokenizer turns “the cat sat” into [1, 42, 7], you need a way to give those integers geometry.

The embedding layer is that bridge: a learnable matrix where row i is the vector for token i. For a 50,000-token vocabulary with 512-dimensional embeddings, the table is shape (50000, 512), initialized small. Training nudges similar words toward similar rows — semantic relationships emerge as geometric proximity.

The implementation is one line of NumPy:

The code in ?@lst-11-embedding-forward makes this concrete.

def forward(self, indices: Tensor) -> Tensor:
    """Forward pass: lookup embeddings for given indices."""
    # Validate indices are in range
    if np.any(indices.data >= self.vocab_size) or np.any(indices.data < 0):
        raise ValueError(
            f"Index out of range. Expected 0 <= indices < {self.vocab_size}, "
            f"got min={np.min(indices.data)}, max={np.max(indices.data)}"
        )

    # Perform embedding lookup using advanced indexing
    # This is equivalent to one-hot multiplication but much more efficient
    embedded = self.weight.data[indices.data.astype(int)]

    # Create result tensor with gradient tracking
    result = Tensor(embedded, requires_grad=self.weight.requires_grad)

    if result.requires_grad:
        result._grad_fn = EmbeddingBackward(self.weight, indices)

    return result

: Listing 11.1 — Embedding forward pass. NumPy fancy indexing replaces a one-hot matmul, making lookup O(1) per token. {#lst-11-embedding-forward}

The work happens in self.weight.data[indices.data.astype(int)]. NumPy’s fancy indexing pulls rows 1, 42, and 7 in a single operation, broadcasting cleanly over batched inputs of any shape. It is mathematically equivalent to multiplying a one-hot matrix against the weight table — but it is orders of magnitude faster and allocates no intermediate tensor.

Embedding Table Mechanics

The embedding table is a learnable parameter matrix initialized with small random values. For vocabulary size V and embedding dimension D, the table has shape (V, D). Each row represents one token’s learned representation.

Initialization matters for training stability. The implementation uses Xavier initialization:

# Xavier initialization for better gradient flow
limit = math.sqrt(6.0 / (vocab_size + embed_dim))
self.weight = Tensor(
    np.random.uniform(-limit, limit, (vocab_size, embed_dim)),
    requires_grad=True
)

This initialization scale ensures gradients neither explode nor vanish at the start of training. The limit is computed from both vocabulary size and embedding dimension, balancing the fan-in and fan-out of the embedding layer.

During training, gradients flow back through the lookup operation to update only the embeddings that were accessed. If your batch contains tokens [5, 10, 10, 5], only rows 5 and 10 of the embedding table receive gradient updates. This sparse gradient pattern is handled by the EmbeddingBackward gradient function, making embedding updates extremely efficient even for vocabularies with millions of tokens.

Systems Implication: Sparse Gradient Updates and Embedding Memory Layout

The embedding table is the largest single tensor in most language models — GPT-3’s token embeddings alone consume 2.3 GB — yet a forward pass only touches the handful of rows corresponding to tokens in the current batch. Treating the embedding step as a row-sparse scatter/gather (rather than a dense matmul against a one-hot matrix) is what keeps training tractable: you never materialize the (batch × seq, vocab) one-hot tensor, you never multiply by it, and the backward pass only writes gradients into the rows that were read. Production frameworks go further — they store the embedding matrix in row-major layout so each lookup is one contiguous HBM transaction per token, use sparse gradient tensors in the optimizer (Adam state for only the accessed rows), and in distributed training shard the table across devices so no single GPU has to hold all 2.3 GB. Every one of these optimizations is a direct consequence of the lookup-as-sparse-scatter pattern you just implemented.

Learned vs Fixed Embeddings

Positional information can be added to token embeddings in two fundamentally different ways: learned position embeddings and fixed sinusoidal encodings.

Learned positional encoding treats each sequence position as a unique, trainable parameter, just like token embeddings. For a maximum sequence length M and embedding dimension D, you construct an auxiliary embedding table of shape (M, D):

# Initialize position embedding matrix
# Smaller initialization than token embeddings since these are additive
limit = math.sqrt(2.0 / embed_dim)
self.position_embeddings = Tensor(
    np.random.uniform(-limit, limit, (max_seq_len, embed_dim)),
    requires_grad=True
)

During forward pass, you slice position embeddings and add them to token embeddings:

The code in ?@lst-11-positional-forward makes this concrete.

def forward(self, x: Tensor) -> Tensor:
    """Add positional encodings to input embeddings."""
    batch_size, seq_len, embed_dim = x.shape

    # Slice position embeddings for this sequence length
    # Tensor slicing preserves gradient flow (from Module 01's __getitem__)
    pos_embeddings = self.position_embeddings[:seq_len]

    # Reshape to add batch dimension: (1, seq_len, embed_dim)
    pos_data = pos_embeddings.data[np.newaxis, :, :]
    pos_embeddings_batched = Tensor(pos_data, requires_grad=pos_embeddings.requires_grad)

    # Copy gradient function to preserve backward connection
    if hasattr(pos_embeddings, '_grad_fn') and pos_embeddings._grad_fn is not None:
        pos_embeddings_batched._grad_fn = pos_embeddings._grad_fn

    # Add positional information - gradients flow through both x and pos_embeddings!
    result = x + pos_embeddings_batched
    return result

: Listing 11.2 — PositionalEncoding forward pass. Slices position embeddings to current sequence length and adds them to token embeddings while preserving gradient flow. {#lst-11-positional-forward}

The slicing operation self.position_embeddings[:seq_len] preserves gradient tracking because TinyTorch’s Tensor __getitem__ (from Module 01) maintains the connection to the original parameter. This allows backpropagation to update only the position embeddings actually used in the forward pass.

The advantage is flexibility: the model can learn task-specific positional patterns. The disadvantage is memory cost and a hard maximum sequence length.

Fixed sinusoidal encoding uses mathematical patterns requiring no parameters. The formula creates unique position signatures using sine and cosine at different frequencies:

The code in ?@lst-11-sinusoidal-pe makes this concrete.

# Create position indices [0, 1, 2, ..., max_seq_len-1]
position = np.arange(max_seq_len, dtype=np.float32)[:, np.newaxis]

# Create dimension indices for calculating frequencies
div_term = np.exp(
    np.arange(0, embed_dim, 2, dtype=np.float32) *
    -(math.log(10000.0) / embed_dim)
)

# Initialize the positional encoding matrix
pe = np.zeros((max_seq_len, embed_dim), dtype=np.float32)

# Apply sine to even indices, cosine to odd indices
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)

: Listing 11.3 — Sinusoidal positional encoding construction. Builds a fixed position matrix from trigonometric functions at geometrically spaced frequencies. {#lst-11-sinusoidal-pe}

This creates a unique vector for each position where low-index dimensions oscillate rapidly (high frequency) and high-index dimensions change slowly (low frequency). The pattern allows the model to learn relative positions through dot products, and crucially, can extrapolate to sequences longer than seen during training.

Positional Encodings

Token embeddings are a bag of vectors — they carry no notion of order. Without positional information, “cat sat on mat” and “mat on sat cat” produce the same set of vectors and a downstream attention layer cannot tell them apart.

Positional encodings fix this by adding a position-specific vector to each token’s embedding. Token 42 at position 0 lands somewhere different in space than token 42 at position 5, and the network finally has access to where, not just what.

The Transformer paper introduced sinusoidal positional encoding with a clever mathematical structure. For position pos and dimension i:

PE(pos, 2i)   = sin(pos / 10000^(2i/embed_dim))  # Even dimensions
PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim))  # Odd dimensions

The 10000 base creates different wavelengths across dimensions. Dimension 0 oscillates rapidly (frequency ≈ 1), while dimension 510 changes extremely slowly (frequency ≈ 1/10000). This multiscale structure gives each position a unique “fingerprint” and enables the model to learn relative position through simple vector arithmetic.

At position 0, all sine terms equal 0 and all cosine terms equal 1: [0, 1, 0, 1, 0, 1, ...]. At position 1, the pattern shifts based on each dimension’s frequency. The combination of many frequencies creates distinct encodings where nearby positions have similar (but not identical) vectors, providing smooth positional gradients.

The trigonometric identity enables learning relative positions: PE(pos+k) can be expressed as a linear function of PE(pos) using sine and cosine addition formulas. This allows attention mechanisms to implicitly learn positional offsets (like “the token 3 positions ahead”) through learned weights on the position encodings, without the model needing separate relative position parameters.

Embedding Dimension Trade-offs

The embedding dimension D controls the capacity of your learned representations. Larger D provides more expressiveness but costs memory and compute. The choice involves several interacting factors.

Memory scaling: Embedding tables scale as vocab_size × embed_dim × 4 bytes (float32). A 50,000-token vocabulary with 512-dimensional embeddings costs 98 MB. Double the dimension to 1024 and the cost doubles to 195 MB. For large vocabularies, the embedding table often dominates total model memory — GPT-3’s 50,257-token vocabulary with 12,288-dimensional embeddings consumes 2.3 GB for token embeddings alone.

Semantic capacity: Higher dimensions allow finer-grained semantic distinctions. With 64 dimensions, you might capture basic categories (animals, actions, objects). With 512 dimensions, you can encode subtle relationships (synonyms, antonyms, part-of-speech, contextual variations). With 1024+ dimensions, you have capacity for highly nuanced semantic features discovered through training.

Computational cost: Every attention head in transformers performs operations over the embedding dimension. Memory bandwidth becomes the bottleneck: transferring embedding vectors from RAM to cache dominates the time to process them. Larger embeddings mean more memory traffic per token, reducing throughput.

Typical scales in production:

Table 5 places your implementation in context against production models.

Table 5: Vocabulary size, embedding dimension, and memory cost across production LLMs.

Model	Vocabulary	Embed Dim	Embedding Memory
Small BERT	30,000	768	88 MB
GPT-2	50,257	1,024	196 MB
GPT-3	50,257	12,288	2,356 MB
Large Transformer	100,000	1,024	391 MB

The embedding dimension typically matches the model’s hidden dimension since embeddings feed directly into the first transformer layer. You rarely see models with embedding dimension different from hidden dimension (though it’s technically possible with a projection layer).

Common Errors

These are the errors you’ll encounter most often when working with embeddings. Understanding why they happen will save you hours of debugging.

Index Out of Range

Error: ValueError: Index out of range. Expected 0 <= indices < 50000, got max=50001

Embedding lookup expects token IDs in the range [0, vocab_size-1]. If your tokenizer produces an ID of 50001 but your embedding layer has vocab_size=50000, the lookup fails.

Cause: Mismatch between tokenizer vocabulary size and embedding layer vocabulary size. This often happens when you train a tokenizer separately and forget to sync the vocabulary size when creating the embedding layer.

Fix: Ensure embed_layer.vocab_size matches your tokenizer’s vocabulary size exactly:

tokenizer = Tokenizer(vocab_size=50000)
embed = Embedding(vocab_size=tokenizer.vocab_size, embed_dim=512)

Sequence Length Exceeds Maximum

Error: ValueError: Sequence length 1024 exceeds maximum 512

Learned positional encodings have a fixed maximum sequence length set during initialization. If you try to process a sequence longer than this maximum, the forward pass fails because there are no position embeddings for those positions.

Cause: Input sequences longer than max_seq_len parameter used when creating the positional encoding layer.

Fix: Either increase max_seq_len during initialization, truncate your sequences, or use sinusoidal positional encoding which can handle arbitrary lengths:

# Option 1: Increase max_seq_len
pos_enc = PositionalEncoding(max_seq_len=2048, embed_dim=512)

# Option 2: Use sinusoidal (no length limit)
embed_layer = EmbeddingLayer(vocab_size=50000, embed_dim=512,
                             pos_encoding='sinusoidal')

Embedding Dimension Mismatch

Error: ValueError: Embedding dimension mismatch: expected 512, got 768

When adding positional encodings to token embeddings, the dimensions must match exactly. If your token embeddings are 512-dimensional but your positional encoding expects 768-dimensional inputs, element-wise addition fails.

Cause: Creating embedding components with inconsistent embed_dim values.

Fix: Use the same embed_dim for all embedding components:

embed_dim = 512
token_embed = Embedding(vocab_size=50000, embed_dim=embed_dim)
pos_enc = PositionalEncoding(max_seq_len=512, embed_dim=embed_dim)

Shape Errors with Batching

Error: ValueError: Expected 3D input (batch, seq, embed), got shape (128, 512)

Positional encoding expects 3D tensors with batch dimension. If you pass a 2D tensor (sequence, embedding), the forward pass fails.

Cause: Forgetting to add batch dimension when processing single sequences, or using raw embedding output without reshaping.

Fix: Ensure inputs have batch dimension, even for single sequences:

# Wrong: 2D input
tokens = Tensor([1, 2, 3])
embeddings = embed(tokens)  # Shape: (3, 512) - missing batch dim

# Right: 3D input
tokens = Tensor([[1, 2, 3]])  # Added batch dimension
embeddings = embed(tokens)  # Shape: (1, 3, 512) - correct

Production Context

Your Implementation vs. PyTorch

Your TinyTorch embedding system and PyTorch’s torch.nn.Embedding share the same conceptual design and API patterns. The differences are in scale, optimization, and device support.

Table 6 places your implementation side by side with the production reference for direct comparison.

Table 6: Feature comparison between TinyTorch embeddings and torch.nn.Embedding.

Feature	Your Implementation	PyTorch
Backend	NumPy (CPU only)	C++/CUDA (CPU/GPU)
Lookup Speed	1x (baseline)	10-100x faster on GPU
Max Vocabulary	Limited by RAM	Billions (with techniques)
Positional Encoding	Learned + sinusoidal	Must implement yourself*
Sparse Gradients	Via custom backward	Native sparse gradient support
Memory Optimization	Standard	Quantization, pruning, sharing

*PyTorch provides building blocks but you implement positional encoding patterns yourself (as you did here)

Code Comparison

The following comparison shows equivalent embedding operations in TinyTorch and PyTorch. Notice how the APIs mirror each other closely.

from tinytorch.core.embeddings import Embedding, EmbeddingLayer

# Create embedding layer
embed = Embedding(vocab_size=50000, embed_dim=512)

# Token lookup
tokens = Tensor([[1, 42, 7, 99]])
embeddings = embed(tokens)  # (1, 4, 512)

# Complete system with position encoding
embed_layer = EmbeddingLayer(
    vocab_size=50000,
    embed_dim=512,
    max_seq_len=2048,
    pos_encoding='learned'
)
position_aware = embed_layer(tokens)

import torch
import torch.nn as nn

# Create embedding layer
embed = nn.Embedding(num_embeddings=50000, embedding_dim=512)

# Token lookup
tokens = torch.tensor([[1, 42, 7, 99]])
embeddings = embed(tokens)  # (1, 4, 512)

# Complete system (you implement positional encoding yourself)
class EmbeddingWithPosition(nn.Module):
    def __init__(self, vocab_size, embed_dim, max_seq_len):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)

    def forward(self, tokens):
        seq_len = tokens.shape[1]
        positions = torch.arange(seq_len).unsqueeze(0)
        return self.token_embed(tokens) + self.pos_embed(positions)

embed_layer = EmbeddingWithPosition(50000, 512, 2048)
position_aware = embed_layer(tokens)

Let’s walk through each section to understand the comparison:

Line 1-2 (Import): Both frameworks provide dedicated embedding modules. TinyTorch packages everything in embeddings; PyTorch uses nn.Embedding as the base class.
Line 4-5 (Embedding Creation): Your Embedding class closely mirrors PyTorch’s nn.Embedding. The parameter names differ (vocab_size vs num_embeddings) but the concept is identical.
Line 7-9 (Token Lookup): Both use identical calling patterns. The embedding layer acts as a function, taking token IDs and returning dense vectors. Shape semantics are identical.
Line 11-20 (Complete System): Your EmbeddingLayer provides a complete system in one class. In PyTorch, you implement this pattern yourself by composing nn.Embedding layers for tokens and positions. The HuggingFace Transformers library implements this exact pattern for BERT, GPT, and other models.
Line 22-24 (Forward Pass): Both systems add token and position embeddings element-wise. Your implementation handles this internally; PyTorch requires you to manage position indices explicitly.

What’s Identical

Embedding lookup semantics, gradient flow patterns, and the addition of positional information. When you debug PyTorch transformer models, you’ll recognize these exact patterns because you built them yourself.

Why Embeddings Matter at Scale

The numbers behind a modern language model make the design pressures concrete:

GPT-3 embeddings: 50,257 tokens × 12,288 dimensions = 618 million parameters = 2.3 GB for token embeddings alone (positional encodings extra).
Lookup throughput: A batch of 32 sequences × 2048 tokens demands 65,536 embedding lookups per step. At 1,000 steps/sec, that is roughly 65 million lookups per second.
Memory bandwidth: Each lookup pulls 2–4 KB from RAM to cache (512–1024 floats × 4 bytes). At scale, bandwidth — not arithmetic — is the bottleneck.
Gradient sparsity: A batch touches only a tiny slice of the 50,257-row table. Efficient trainers exploit this and update gradients only for the rows that were accessed.

Embeddings take only 10–15% of training time but consume 30–40% of model memory at large vocabularies. They are cheap to run and expensive to store — exactly the opposite profile of attention.

Check Your Understanding — Embeddings

Before moving on, verify you can articulate each of the following:

Why the embedding table is a lookup (sparse row scatter/gather), not a matmul against a one-hot matrix, and how this reshapes training memory: only the touched rows receive gradient updates.
How memory scales: embedding storage is vocab × dim × bytes (linear in both), while attention cost downstream is quadratic in sequence length — two independent budgets that fight for the same VRAM.
The learned-vs-sinusoidal positional encoding trade: learned PE is flexible but has a hard max_seq_len ceiling; sinusoidal PE is parameter-free and extrapolates, enabling variable-length inference.
Why GPT-3’s 2.3 GB embedding table is only ~0.35% of its parameters but sits on the critical path of every forward pass, making embedding lookup a memory-bandwidth problem rather than a FLOP problem.

If any of these feels fuzzy, revisit the Core Concepts section (especially Embedding Table Mechanics and Embedding Dimension Trade-offs) before moving on.

Self-Check Questions

Test yourself with these systems thinking questions. They’re designed to build intuition for the performance and memory characteristics you’ll encounter in production.

Q1: Memory Calculation

An embedding layer has vocab_size=50000 and embed_dim=512. How much memory does the embedding table use (in MB)?

Answer

50,000 × 512 × 4 bytes = 102,400,000 bytes = 97.7 MB

Calculation breakdown:

Parameters: 50,000 × 512 = 25,600,000
Memory: 25,600,000 × 4 bytes (float32) = 102,400,000 bytes
In MB: 102,400,000 / (1024 × 1024) = 97.7 MB

This is why vocabulary size dominates the deployment budget for small-model, large-vocab systems.

Q2: Positional Encoding Memory

Compare memory requirements for learned vs sinusoidal positional encoding with max_seq_len=2048 and embed_dim=512.

Answer

Learned PE: 2,048 × 512 × 4 = 4,194,304 bytes = 4.0 MB (1,048,576 parameters)

Sinusoidal PE: 0 bytes (0 parameters — computed from a closed-form formula)

For large models, learned PE adds real memory. GPT-3’s learned PE costs an additional 96 MB (2048 × 12288 × 4 bytes). Models that need cheaper or longer-context PE often switch to sinusoidal or rotary variants.

Q3: Lookup Complexity

What is the time complexity of looking up embeddings for a batch of 32 sequences, each with 128 tokens?

Answer

O(1) per token, or O(batch_size × seq_len) = O(32 × 128) = O(4,096) total.

Lookup is constant time per token because it is direct array indexing: weight[token_id]. For 4,096 tokens you do 4,096 constant-time loads.

Vocabulary size does not affect lookup time. Looking up from a 1,000-word table is the same speed as from a 100,000-word table (modulo cache effects). It is indexing, not search.

Q4: Embedding Dimension Scaling

You have an embedding layer with vocab_size=50000, embed_dim=512 using 98 MB. If you double embed_dim to 1024, what happens to memory?

Answer

Memory doubles to 195 MB.

Embedding memory scales linearly with embedding dimension:

Original: 50,000 × 512 × 4 = 98 MB
Doubled: 50,000 × 1,024 × 4 = 195 MB

Each doubling of embed_dim doubles both storage and memory bandwidth — the reason production models balance dimension against available memory rather than maxing it out.

Q5: Sinusoidal Extrapolation

You trained a model with sinusoidal positional encoding and max_seq_len=512. Can you process sequences of length 1024 at inference time? What about with learned positional encoding?

Answer

Sinusoidal PE: Yes - can extrapolate to length 1024 (or any length)

The mathematical formula creates unique encodings for any position. You simply compute:

pe_1024 = create_sinusoidal_embeddings(max_seq_len=1024, embed_dim=512)

Learned PE: No - cannot handle sequences longer than training maximum

Learned PE creates a fixed embedding table of shape (max_seq_len, embed_dim). For positions beyond 512, there are no learned embeddings. You must either: - Retrain with larger max_seq_len - Interpolate position embeddings (advanced technique) - Truncate sequences to 512 tokens

This is why many production models use sinusoidal or relative positional encodings that can handle variable lengths.

Key Takeaways

Embedding lookup is sparse, not dense: rows are indexed directly (O(1) per token), only accessed rows receive gradients, and no one-hot tensor is ever materialized.
Positional encoding is additive geometry, not metadata: learned PE is flexible but length-capped; sinusoidal PE extrapolates but is fixed — same shape, very different deployment behavior.
Memory scales linearly with vocab × dim: doubling either component doubles both storage and memory-bandwidth traffic, which is why dimension choices are always weighed against HBM budget.
Embeddings are cheap to compute and expensive to store: they take ~10–15% of training time but 30–40% of model memory in large-vocab models — the opposite profile of attention.

Coming next: Module 12 builds scaled dot-product attention, the mechanism that turns the static (B, S, D) embedding tensor from this module into a context-aware representation where every position has been shaped by every other.

What’s Next

You now have static geometry: every token ID maps to a fixed vector in space, with positional information layered on top. But the embedding for “bank” is the same whether the next word is “river” or “loan” — the representation has no context. That is the question Module 12 answers.

Coming Up: Module 12 — Attention

You’ll build scaled dot-product attention: the mechanism that lets every embedding look at every other embedding in the sequence and re-weight itself accordingly. Your static (B, S, D) embedding tensor becomes a context-aware (B, S, D) tensor where each position has been shaped by the rest of the sentence.

Where these embeddings show up downstream:

Table 7 traces how this module is reused by later parts of the curriculum.

Table 7: How embeddings feed into attention and transformer modules.

Module	What It Does	Your Embeddings In Action
12: Attention	Context-aware representations	`attention(embed_layer(tokens))` produces query, key, value
13: Transformers	Full sequence-to-sequence stack	`transformer(embed_layer(src), embed_layer(tgt))`

Get Started

Interactive Options

Launch Binder - Run interactively in browser, no setup required
View Source - Browse the implementation code

Save Your Progress

Binder sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.

Module 11: Embeddings

🎧 Audio Overview

🚀 Launch Binder

📄 View Source

Overview

Learning Objectives

What You’ll Build

What You’re NOT Building (Yet)

API Reference

Embedding Class

PositionalEncoding Class

Sinusoidal Embeddings Function

EmbeddingLayer Class

Core Concepts

From Indices to Vectors

Embedding Table Mechanics

Learned vs Fixed Embeddings

Positional Encodings

Embedding Dimension Trade-offs

Common Errors

Index Out of Range

Sequence Length Exceeds Maximum

Embedding Dimension Mismatch

Shape Errors with Batching

Production Context

Your Implementation vs. PyTorch

Code Comparison

Why Embeddings Matter at Scale

Self-Check Questions

Key Takeaways

Further Reading

Seminal Papers

Additional Resources

What’s Next

Get Started