Module 11: Embeddings#
Module Info
ARCHITECTURE TIER | Difficulty: ââââ | Time: 3-5 hours | Prerequisites: 01-07, 10
Prerequisites: Modules 01-07 and 10 means you should understand:
Tensor operations (shape manipulation, matrix operations, broadcasting)
Training fundamentals (forward/backward, optimization)
Tokenization (converting text to token IDs, vocabularies)
If you can explain how a tokenizer converts âhelloâ to token IDs and how to multiply matrices, youâre ready.
Overview#
Embeddings are the crucial bridge between discrete tokens and continuous neural network operations. Your tokenizer from Module 10 converts text into token IDs like [42, 7, 15], but neural networks operate on dense vectors of real numbers. Embeddings transform each integer token ID into a learned dense vector that captures semantic meaning. By the end of this module, youâll implement the embedding systems that power modern language models, from basic lookup tables to sophisticated positional encodings.
Every transformer model, from BERT to GPT-4, relies on embeddings to convert language into learnable representations. Youâll build the exact patterns used in production, understanding not just how they work but why theyâre designed this way.
Learning Objectives#
Tip
By completing this module, you will:
Implement embedding layers that convert token IDs to dense vectors through efficient table lookup
Master positional encoding strategies including learned and sinusoidal approaches
Understand memory scaling for embedding tables and the trade-offs between vocabulary size and embedding dimension
Connect your implementation to production transformer architectures used in GPT and BERT
What Youâll Build#
flowchart LR
subgraph "Your Embedding System"
A["Embedding<br/>Token â Vector"]
B["PositionalEncoding<br/>Learned Positions"]
C["Sinusoidal PE<br/>Math Patterns"]
D["EmbeddingLayer<br/>Complete System"]
end
A --> D
B --> D
C --> D
style A fill:#e1f5ff
style B fill:#fff3cd
style C fill:#f8d7da
style D fill:#d4edda
Fig. 18 Your Embedding System#
Implementation roadmap:
Part |
What Youâll Implement |
Key Concept |
|---|---|---|
1 |
|
Token ID to vector lookup via indexing |
2 |
|
Learnable position embeddings |
3 |
|
Mathematical position encoding |
4 |
|
Complete token + position system |
The pattern youâll enable:
# Converting tokens to position-aware dense vectors
embed_layer = EmbeddingLayer(vocab_size=50000, embed_dim=512)
tokens = Tensor([[1, 42, 7]]) # Token IDs from tokenizer
embeddings = embed_layer(tokens) # (1, 3, 512) dense vectors ready for attention
What Youâre NOT Building (Yet)#
To keep this module focused, you will not implement:
Attention mechanisms (thatâs Module 12: Attention)
Full transformer architectures (thatâs Module 13: Transformers)
Word2Vec or GloVe pretrained embeddings (youâre building learnable embeddings)
Subword embedding composition (PyTorch handles this at the tokenization level)
You are building the foundation for sequence models. Context-aware representations come next.
API Reference#
This section documents the embedding components youâll build. Use this as your reference while implementing and debugging.
Embedding Class#
Embedding(vocab_size, embed_dim)
Learnable embedding layer that maps token indices to dense vectors through table lookup.
Constructor Parameters:
vocab_size(int): Size of vocabulary (number of unique tokens)embed_dim(int): Dimension of embedding vectors
Core Methods:
Method |
Signature |
Description |
|---|---|---|
|
|
Lookup embeddings for token indices |
|
|
Return weight matrix for optimization |
Properties:
weight: Tensor of shape(vocab_size, embed_dim)containing learnable embeddings
PositionalEncoding Class#
PositionalEncoding(max_seq_len, embed_dim)
Learnable positional encoding that adds trainable position-specific vectors to embeddings.
Constructor Parameters:
max_seq_len(int): Maximum sequence length to supportembed_dim(int): Embedding dimension (must match token embeddings)
Core Methods:
Method |
Signature |
Description |
|---|---|---|
|
|
Add positional encodings to input embeddings |
|
|
Return position embedding matrix |
Sinusoidal Embeddings Function#
create_sinusoidal_embeddings(max_seq_len, embed_dim) -> Tensor
Creates fixed sinusoidal positional encodings using trigonometric functions. No parameters to learn.
Mathematical Formula:
PE(pos, 2i) = sin(pos / 10000^(2i/embed_dim))
PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim))
EmbeddingLayer Class#
EmbeddingLayer(vocab_size, embed_dim, max_seq_len=512,
pos_encoding='learned', scale_embeddings=False)
Complete embedding system combining token embeddings and positional encoding.
Constructor Parameters:
vocab_size(int): Size of vocabularyembed_dim(int): Embedding dimensionmax_seq_len(int): Maximum sequence length for positional encodingpos_encoding(str): Type of positional encoding (âlearnedâ, âsinusoidalâ, or None)scale_embeddings(bool): Whether to scale by sqrt(embed_dim) (transformer convention)
Core Methods:
Method |
Signature |
Description |
|---|---|---|
|
|
Complete embedding pipeline |
|
|
All trainable parameters |
Core Concepts#
This section explores the fundamental ideas behind embeddings and positional encoding. These concepts apply to every modern language model, from small research experiments to production systems.
From Indices to Vectors#
Neural networks operate on continuous vectors, but language consists of discrete tokens. After your tokenizer converts âthe cat satâ to token IDs [1, 42, 7], you need a way to represent these discrete integers as dense vectors that can capture semantic meaning.
The embedding layer solves this through a simple but powerful idea: maintain a learnable table where each token ID maps to a vector. For a vocabulary of 50,000 tokens with 512-dimensional embeddings, you create a matrix of shape (50000, 512) initialized with small random values. During training, these vectors adjust to capture semantic relationships: similar words learn similar vectors.
Hereâs the core implementation showing how efficiently this works:
def forward(self, indices: Tensor) -> Tensor:
"""Forward pass: lookup embeddings for given indices."""
# Validate indices are in range
if np.any(indices.data >= self.vocab_size) or np.any(indices.data < 0):
raise ValueError(
f"Index out of range. Expected 0 <= indices < {self.vocab_size}, "
f"got min={np.min(indices.data)}, max={np.max(indices.data)}"
)
# Perform embedding lookup using advanced indexing
# This is equivalent to one-hot multiplication but much more efficient
embedded = self.weight.data[indices.data.astype(int)]
# Create result tensor with gradient tracking
result = Tensor(embedded, requires_grad=self.weight.requires_grad)
if result.requires_grad:
result._grad_fn = EmbeddingBackward(self.weight, indices)
return result
The beauty is in the simplicity: self.weight.data[indices.data.astype(int)] uses NumPyâs advanced indexing (also called fancy indexing) to look up multiple embeddings simultaneously. For input indices [1, 42, 7], this single operation retrieves rows 1, 42, and 7 from the weight matrix in one efficient step, automatically handling batched inputs of any shape. While conceptually equivalent to creating one-hot vectors and matrix multiplication, direct indexing is orders of magnitude faster and requires no intermediate allocations.
Embedding Table Mechanics#
The embedding table is a learnable parameter matrix initialized with small random values. For vocabulary size V and embedding dimension D, the table has shape (V, D). Each row represents one tokenâs learned representation.
Initialization matters for training stability. The implementation uses Xavier initialization:
# Xavier initialization for better gradient flow
limit = math.sqrt(6.0 / (vocab_size + embed_dim))
self.weight = Tensor(
np.random.uniform(-limit, limit, (vocab_size, embed_dim)),
requires_grad=True
)
This initialization scale ensures gradients neither explode nor vanish at the start of training. The limit is computed from both vocabulary size and embedding dimension, balancing the fan-in and fan-out of the embedding layer.
During training, gradients flow back through the lookup operation to update only the embeddings that were accessed. If your batch contains tokens [5, 10, 10, 5], only rows 5 and 10 of the embedding table receive gradient updates. This sparse gradient pattern is handled by the EmbeddingBackward gradient function, making embedding updates extremely efficient even for vocabularies with millions of tokens.
Learned vs Fixed Embeddings#
Positional information can be added to token embeddings in two fundamentally different ways: learned position embeddings and fixed sinusoidal encodings.
Learned positional encoding treats each position as a trainable parameter, just like token embeddings. For maximum sequence length M and embedding dimension D, you create a second embedding table of shape (M, D):
# Initialize position embedding matrix
# Smaller initialization than token embeddings since these are additive
limit = math.sqrt(2.0 / embed_dim)
self.position_embeddings = Tensor(
np.random.uniform(-limit, limit, (max_seq_len, embed_dim)),
requires_grad=True
)
During forward pass, you slice position embeddings and add them to token embeddings:
def forward(self, x: Tensor) -> Tensor:
"""Add positional encodings to input embeddings."""
batch_size, seq_len, embed_dim = x.shape
# Slice position embeddings for this sequence length
# Tensor slicing preserves gradient flow (from Module 01's __getitem__)
pos_embeddings = self.position_embeddings[:seq_len]
# Reshape to add batch dimension: (1, seq_len, embed_dim)
pos_data = pos_embeddings.data[np.newaxis, :, :]
pos_embeddings_batched = Tensor(pos_data, requires_grad=pos_embeddings.requires_grad)
# Copy gradient function to preserve backward connection
if hasattr(pos_embeddings, '_grad_fn') and pos_embeddings._grad_fn is not None:
pos_embeddings_batched._grad_fn = pos_embeddings._grad_fn
# Add positional information - gradients flow through both x and pos_embeddings!
result = x + pos_embeddings_batched
return result
The slicing operation self.position_embeddings[:seq_len] preserves gradient tracking because TinyTorchâs Tensor __getitem__ (from Module 01) maintains the connection to the original parameter. This allows backpropagation to update only the position embeddings actually used in the forward pass.
The advantage is flexibility: the model can learn task-specific positional patterns. The disadvantage is memory cost and a hard maximum sequence length.
Fixed sinusoidal encoding uses mathematical patterns requiring no parameters. The formula creates unique position signatures using sine and cosine at different frequencies:
# Create position indices [0, 1, 2, ..., max_seq_len-1]
position = np.arange(max_seq_len, dtype=np.float32)[:, np.newaxis]
# Create dimension indices for calculating frequencies
div_term = np.exp(
np.arange(0, embed_dim, 2, dtype=np.float32) *
-(math.log(10000.0) / embed_dim)
)
# Initialize the positional encoding matrix
pe = np.zeros((max_seq_len, embed_dim), dtype=np.float32)
# Apply sine to even indices, cosine to odd indices
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
This creates a unique vector for each position where low-index dimensions oscillate rapidly (high frequency) and high-index dimensions change slowly (low frequency). The pattern allows the model to learn relative positions through dot products, and crucially, can extrapolate to sequences longer than seen during training.
Positional Encodings#
Without positional information, embeddings have no notion of order. The sentence âcat sat on matâ would have identical representation to âmat on sat catâ because youâd sum or average the same four token embeddings regardless of order.
Positional encodings solve this by adding position-specific information to each tokenâs embedding. After embedding lookup, token 42 at position 0 gets different positional information than token 42 at position 5, making the model position-aware.
The Transformer paper introduced sinusoidal positional encoding with a clever mathematical structure. For position pos and dimension i:
PE(pos, 2i) = sin(pos / 10000^(2i/embed_dim)) # Even dimensions
PE(pos, 2i+1) = cos(pos / 10000^(2i/embed_dim)) # Odd dimensions
The 10000 base creates different wavelengths across dimensions. Dimension 0 oscillates rapidly (frequency â 1), while dimension 510 changes extremely slowly (frequency â 1/10000). This multi-scale structure gives each position a unique âfingerprintâ and enables the model to learn relative position through simple vector arithmetic.
At position 0, all sine terms equal 0 and all cosine terms equal 1: [0, 1, 0, 1, 0, 1, ...]. At position 1, the pattern shifts based on each dimensionâs frequency. The combination of many frequencies creates distinct encodings where nearby positions have similar (but not identical) vectors, providing smooth positional gradients.
The trigonometric identity enables learning relative positions: PE(pos+k) can be expressed as a linear function of PE(pos) using sine and cosine addition formulas. This allows attention mechanisms to implicitly learn positional offsets (like âthe token 3 positions aheadâ) through learned weights on the position encodings, without the model needing separate relative position parameters.
Embedding Dimension Trade-offs#
The embedding dimension D controls the capacity of your learned representations. Larger D provides more expressiveness but costs memory and compute. The choice involves several interacting factors.
Memory scaling: Embedding tables scale as vocab_size Ă embed_dim Ă 4 bytes (for float32). A vocabulary of 50,000 tokens with 512-dimensional embeddings requires 100 MB. Double the dimension to 1024, and memory doubles to 200 MB. For large vocabularies, the embedding table often dominates total model memory. GPT-3âs 50,257 token vocabulary with 12,288-dimensional embeddings uses approximately 2.4 GB just for token embeddings.
Semantic capacity: Higher dimensions allow finer-grained semantic distinctions. With 64 dimensions, you might capture basic categories (animals, actions, objects). With 512 dimensions, you can encode subtle relationships (synonyms, antonyms, part-of-speech, contextual variations). With 1024+ dimensions, you have capacity for highly nuanced semantic features discovered through training.
Computational cost: Every attention head in transformers performs operations over the embedding dimension. Memory bandwidth becomes the bottleneck: transferring embedding vectors from RAM to cache dominates the time to process them. Larger embeddings mean more memory traffic per token, reducing throughput.
Typical scales in production:
Model |
Vocabulary |
Embed Dim |
Embedding Memory |
|---|---|---|---|
Small BERT |
30,000 |
768 |
92 MB |
GPT-2 |
50,257 |
1,024 |
206 MB |
GPT-3 |
50,257 |
12,288 |
2,471 MB |
Large Transformer |
100,000 |
1,024 |
410 MB |
The embedding dimension typically matches the modelâs hidden dimension since embeddings feed directly into the first transformer layer. You rarely see models with embedding dimension different from hidden dimension (though itâs technically possible with a projection layer).
Common Errors#
These are the errors youâll encounter most often when working with embeddings. Understanding why they happen will save you hours of debugging.
Index Out of Range#
Error: ValueError: Index out of range. Expected 0 <= indices < 50000, got max=50001
Embedding lookup expects token IDs in the range [0, vocab_size-1]. If your tokenizer produces an ID of 50001 but your embedding layer has vocab_size=50000, the lookup fails.
Cause: Mismatch between tokenizer vocabulary size and embedding layer vocabulary size. This often happens when you train a tokenizer separately and forget to sync the vocabulary size when creating the embedding layer.
Fix: Ensure embed_layer.vocab_size matches your tokenizerâs vocabulary size exactly:
tokenizer = Tokenizer(vocab_size=50000)
embed = Embedding(vocab_size=tokenizer.vocab_size, embed_dim=512)
Sequence Length Exceeds Maximum#
Error: ValueError: Sequence length 1024 exceeds maximum 512
Learned positional encodings have a fixed maximum sequence length set during initialization. If you try to process a sequence longer than this maximum, the forward pass fails because there are no position embeddings for those positions.
Cause: Input sequences longer than max_seq_len parameter used when creating the positional encoding layer.
Fix: Either increase max_seq_len during initialization, truncate your sequences, or use sinusoidal positional encoding which can handle arbitrary lengths:
# Option 1: Increase max_seq_len
pos_enc = PositionalEncoding(max_seq_len=2048, embed_dim=512)
# Option 2: Use sinusoidal (no length limit)
embed_layer = EmbeddingLayer(vocab_size=50000, embed_dim=512,
pos_encoding='sinusoidal')
Embedding Dimension Mismatch#
Error: ValueError: Embedding dimension mismatch: expected 512, got 768
When adding positional encodings to token embeddings, the dimensions must match exactly. If your token embeddings are 512-dimensional but your positional encoding expects 768-dimensional inputs, element-wise addition fails.
Cause: Creating embedding components with inconsistent embed_dim values.
Fix: Use the same embed_dim for all embedding components:
embed_dim = 512
token_embed = Embedding(vocab_size=50000, embed_dim=embed_dim)
pos_enc = PositionalEncoding(max_seq_len=512, embed_dim=embed_dim)
Shape Errors with Batching#
Error: ValueError: Expected 3D input (batch, seq, embed), got shape (128, 512)
Positional encoding expects 3D tensors with batch dimension. If you pass a 2D tensor (sequence, embedding), the forward pass fails.
Cause: Forgetting to add batch dimension when processing single sequences, or using raw embedding output without reshaping.
Fix: Ensure inputs have batch dimension, even for single sequences:
# Wrong: 2D input
tokens = Tensor([1, 2, 3])
embeddings = embed(tokens) # Shape: (3, 512) - missing batch dim
# Right: 3D input
tokens = Tensor([[1, 2, 3]]) # Added batch dimension
embeddings = embed(tokens) # Shape: (1, 3, 512) - correct
Production Context#
Your Implementation vs. PyTorch#
Your TinyTorch embedding system and PyTorchâs torch.nn.Embedding share the same conceptual design and API patterns. The differences are in scale, optimization, and device support.
Feature |
Your Implementation |
PyTorch |
|---|---|---|
Backend |
NumPy (CPU only) |
C++/CUDA (CPU/GPU) |
Lookup Speed |
1x (baseline) |
10-100x faster on GPU |
Max Vocabulary |
Limited by RAM |
Billions (with techniques) |
Positional Encoding |
Learned + sinusoidal |
Must implement yourself* |
Sparse Gradients |
Via custom backward |
Native sparse gradient support |
Memory Optimization |
Standard |
Quantization, pruning, sharing |
*PyTorch provides building blocks but you implement positional encoding patterns yourself (as you did here)
Code Comparison#
The following comparison shows equivalent embedding operations in TinyTorch and PyTorch. Notice how the APIs mirror each other closely.
from tinytorch.core.embeddings import Embedding, EmbeddingLayer
# Create embedding layer
embed = Embedding(vocab_size=50000, embed_dim=512)
# Token lookup
tokens = Tensor([[1, 42, 7, 99]])
embeddings = embed(tokens) # (1, 4, 512)
# Complete system with position encoding
embed_layer = EmbeddingLayer(
vocab_size=50000,
embed_dim=512,
max_seq_len=2048,
pos_encoding='learned'
)
position_aware = embed_layer(tokens)
import torch
import torch.nn as nn
# Create embedding layer
embed = nn.Embedding(num_embeddings=50000, embedding_dim=512)
# Token lookup
tokens = torch.tensor([[1, 42, 7, 99]])
embeddings = embed(tokens) # (1, 4, 512)
# Complete system (you implement positional encoding yourself)
class EmbeddingWithPosition(nn.Module):
def __init__(self, vocab_size, embed_dim, max_seq_len):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, embed_dim)
self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
def forward(self, tokens):
seq_len = tokens.shape[1]
positions = torch.arange(seq_len).unsqueeze(0)
return self.token_embed(tokens) + self.pos_embed(positions)
embed_layer = EmbeddingWithPosition(50000, 512, 2048)
position_aware = embed_layer(tokens)
Letâs walk through each section to understand the comparison:
Line 1-2 (Import): Both frameworks provide dedicated embedding modules. TinyTorch packages everything in
embeddings; PyTorch usesnn.Embeddingas the base class.Line 4-5 (Embedding Creation): Your
Embeddingclass closely mirrors PyTorchâsnn.Embedding. The parameter names differ (vocab_sizevsnum_embeddings) but the concept is identical.Line 7-9 (Token Lookup): Both use identical calling patterns. The embedding layer acts as a function, taking token IDs and returning dense vectors. Shape semantics are identical.
Line 11-20 (Complete System): Your
EmbeddingLayerprovides a complete system in one class. In PyTorch, you implement this pattern yourself by composingnn.Embeddinglayers for tokens and positions. The HuggingFace Transformers library implements this exact pattern for BERT, GPT, and other models.Line 22-24 (Forward Pass): Both systems add token and position embeddings element-wise. Your implementation handles this internally; PyTorch requires you to manage position indices explicitly.
Tip
Whatâs Identical
Embedding lookup semantics, gradient flow patterns, and the addition of positional information. When you debug PyTorch transformer models, youâll recognize these exact patterns because you built them yourself.
Why Embeddings Matter at Scale#
To appreciate embedding systems, consider the scale of modern language models:
GPT-3 embeddings: 50,257 token vocabulary Ă 12,288 dimensions = 618 million parameters = 2.4 GB of memory (just for token embeddings, not counting position embeddings)
Lookup throughput: Processing 32 sequences of 2048 tokens requires 65,536 embedding lookups per batch. At 1000 batches per second (typical training), thatâs 65 million lookups per second.
Memory bandwidth: Each lookup transfers 512-1024 dimensions Ă 4 bytes = 2-4 KB from RAM to cache. At scale, memory bandwidth (not compute) becomes the bottleneck.
Gradient sparsity: In a batch with 65,536 tokens, only a small fraction of the 50,257 vocabulary is accessed. Efficient training exploits this sparsity, updating only the accessed embeddingsâ gradients.
Modern transformer training spends approximately 10-15% of total time in embedding operations (lookup + position encoding). The remaining 85-90% goes to attention and feedforward layers. However, embeddings consume 30-40% of model memory for models with large vocabularies, making them critical for deployment.
Check Your Understanding#
Test yourself with these systems thinking questions. Theyâre designed to build intuition for the performance and memory characteristics youâll encounter in production.
Q1: Memory Calculation
An embedding layer has vocab_size=50000 and embed_dim=512. How much memory does the embedding table use (in MB)?
Answer
50,000 Ă 512 Ă 4 bytes = 102,400,000 bytes = 97.7 MB
Calculation breakdown:
Parameters: 50,000 Ă 512 = 25,600,000
Memory: 25,600,000 Ă 4 bytes (float32) = 102,400,000 bytes
In MB: 102,400,000 / (1024 Ă 1024) = 97.7 MB
This is why vocabulary size matters for model deployment!
Q2: Positional Encoding Memory
Compare memory requirements for learned vs sinusoidal positional encoding with max_seq_len=2048 and embed_dim=512.
Answer
Learned PE: 2,048 Ă 512 Ă 4 = 4,194,304 bytes = 4.0 MB (1,048,576 parameters)
Sinusoidal PE: 0 bytes (0 parameters - computed mathematically)
For large models, learned PE adds significant memory. GPT-3 uses learned PE with 2048 positions Ă 12,288 dimensions = 100 MB additional memory. Some models use sinusoidal to save this memory.
Q3: Lookup Complexity
What is the time complexity of looking up embeddings for a batch of 32 sequences, each with 128 tokens?
Answer
O(1) per token, or O(batch_size Ă seq_len) = O(32 Ă 128) = O(4096) total
The lookup operation is constant time per token because itâs just array indexing: weight[token_id]. For 4,096 tokens, you perform 4,096 constant-time lookups.
Importantly, vocabulary size does NOT affect lookup time. Looking up tokens from a 1,000 word vocabulary is the same speed as from a 100,000 word vocabulary (assuming cache effects are comparable). The memory access is direct indexing, not search.
Q4: Embedding Dimension Scaling
You have an embedding layer with vocab_size=50000, embed_dim=512 using 100 MB. If you double embed_dim to 1024, what happens to memory?
Answer
Memory doubles to 200 MB
Embedding memory scales linearly with embedding dimension:
Original: 50,000 Ă 512 Ă 4 = 100 MB
Doubled: 50,000 Ă 1,024 Ă 4 = 200 MB
This is why you canât arbitrarily increase embedding dimensions. Each doubling doubles memory and memory bandwidth requirements. Large models carefully balance embedding dimension against available memory.
Q5: Sinusoidal Extrapolation
You trained a model with sinusoidal positional encoding and max_seq_len=512. Can you process sequences of length 1024 at inference time? What about with learned positional encoding?
Answer
Sinusoidal PE: Yes - can extrapolate to length 1024 (or any length)
The mathematical formula creates unique encodings for any position. You simply compute:
pe_1024 = create_sinusoidal_embeddings(max_seq_len=1024, embed_dim=512)
Learned PE: No - cannot handle sequences longer than training maximum
Learned PE creates a fixed embedding table of shape (max_seq_len, embed_dim). For positions beyond 512, there are no learned embeddings. You must either:
Retrain with larger
max_seq_lenInterpolate position embeddings (advanced technique)
Truncate sequences to 512 tokens
This is why many production models use sinusoidal or relative positional encodings that can handle variable lengths.
## Further Reading
For students who want to understand the academic foundations and mathematical underpinnings of embeddings and positional encoding:
### Seminal Papers
- **Word2Vec** - Mikolov et al. (2013). Introduced efficient learned word embeddings through context prediction. Though your implementation learns embeddings end-to-end, Word2Vec established the idea that similar words should have similar vectors. [arXiv:1301.3781](https://arxiv.org/abs/1301.3781)
- **Attention Is All You Need** - Vaswani et al. (2017). Introduced sinusoidal positional encoding and demonstrated that learned embeddings combined with positional information enable powerful sequence models. Section 3.5 explains the positional encoding formula you implemented. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)
- **BERT: Pre-training of Deep Bidirectional Transformers** - Devlin et al. (2018). Shows how embeddings combine with positional and segment encodings for language understanding tasks. BERT uses learned positional embeddings rather than sinusoidal. [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)
### Additional Resources
- **Blog Post**: "The Illustrated Word2Vec" by Jay Alammar - Visual explanation of learned word embeddings and semantic relationships
- **Documentation**: [PyTorch nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) - See production embedding implementation
- **Paper**: "GloVe: Global Vectors for Word Representation" - Pennington et al. (2014) - Alternative embedding approach based on co-occurrence statistics
## What's Next
```{seealso} Coming Up: Module 12 - Attention
Implement attention mechanisms that let embeddings interact with each other. You'll build the scaled dot-product attention that enables transformers to learn which tokens should influence each other, creating context-aware representations.
Preview - How Your Embeddings Get Used in Future Modules:
Module |
What It Does |
Your Embeddings In Action |
|---|---|---|
12: Attention |
Context-aware representations |
|
13: Transformers |
Complete sequence-to-sequence |
|
Get Started#
Tip
Interactive Options
Launch Binder - Run interactively in browser, no setup required
Open in Colab - Use Google Colab for cloud compute
View Source - Browse the implementation code
Warning
Save Your Progress
Binder and Colab sessions are temporary. Download your completed notebook when done, or clone the repository for persistent local work.