TinyTorch Datasets#
Ship-with-Repo Datasets for Fast Learning
Small datasets for instant iteration + standard benchmarks for validation
Purpose: Understand TinyTorch’s dataset strategy and where to find each dataset used in milestones.
Design Philosophy#
TinyTorch uses a two-tier dataset approach:
Shipped Datasets
~350 KB total - Ships with repository
- Small enough to fit in Git (~1K samples each)
- Fast training (seconds to minutes)
- Instant gratification for learners
- Works offline - no download needed
- Perfect for rapid iteration
Downloaded Datasets
~180 MB - Auto-downloaded when needed
- Standard ML benchmarks (MNIST, CIFAR-10)
- Larger scale (~60K samples)
- Used for validation and scaling
- Downloaded automatically by milestones
- Cached locally for reuse
Philosophy: Following Andrej Karpathy’s “~1K samples” approach—small datasets for learning, full benchmarks for validation.
Shipped Datasets (Included with TinyTorch)#
TinyDigits - Handwritten Digit Recognition#
Location: datasets/tinydigits/
Size: ~310 KB
Used by: Milestones 03 & 04 (MLP and CNN examples)
Contents:
1,000 training samples
200 test samples
8Ă—8 grayscale images (downsampled from MNIST)
10 classes (digits 0-9)
Format: Python pickle file with NumPy arrays
Why 8Ă—8?
Fast iteration: Trains in seconds
Memory-friendly: Small enough to debug
Conceptually complete: Same challenges as 28Ă—28 MNIST
Git-friendly: Only 310 KB vs 10 MB for full MNIST
Usage in milestones:
# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)
TinyTalks - Conversational Q&A Dataset#
Location: datasets/tinytalks/
Size: ~40 KB
Used by: Milestone 05 (Transformer/GPT text generation)
Contents:
350 Q&A pairs across 5 difficulty levels
Character-level text data
Topics: General knowledge, math, science, reasoning
Balanced difficulty distribution
Format: Plain text files with Q: / A: format
Why conversational format?
Engaging: Questions feel natural
Varied: Different answer lengths and complexity
Educational: Difficulty levels scaffold learning
Practical: Mirrors real chatbot use cases
Example:
Q: What is the capital of France?
A: Paris
Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h
Usage in milestones:
# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs
See detailed documentation: datasets/tinytalks/README.md
Downloaded Datasets (Auto-Downloaded On-Demand)#
These standard benchmarks download automatically when you run relevant milestone scripts:
MNIST - Handwritten Digit Classification#
Downloads to: milestones/datasets/mnist/
Size: ~10 MB (compressed)
Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py
Contents:
60,000 training samples
10,000 test samples
28Ă—28 grayscale images
10 classes (digits 0-9)
Auto-download: When you run the MNIST milestone script, it automatically:
Checks if data exists locally
Downloads if needed (~10 MB)
Caches for future runs
Loads data using your TinyTorch DataLoader
Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)
Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart’s breakthrough.
CIFAR-10 - Natural Image Classification#
Downloads to: milestones/datasets/cifar-10/
Size: ~170 MB (compressed)
Used by: milestones/04_1998_cnn/02_lecun_cifar10.py
Contents:
50,000 training samples
10,000 test samples
32Ă—32 RGB images
10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
Auto-download: Milestone script handles everything:
Downloads from official source
Verifies integrity
Caches locally
Preprocesses for your framework
Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)
Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.
Dataset Selection Rationale#
Why These Specific Datasets?#
TinyDigits (not full MNIST):
100Ă— faster training iterations
Ships with repo (no download)
Same conceptual challenges
Perfect for learning and debugging
TinyTalks (custom dataset):
Designed for educational progression
Scaffolded difficulty levels
Character-level tokenization friendly
Engaging conversational format
MNIST (when scaling up):
Industry standard benchmark
Validates your implementation
Comparable to published results
95%+ accuracy is achievable milestone
CIFAR-10 (for CNN validation):
Natural images (harder than digits)
RGB channels (multi-dimensional)
Standard CNN benchmark
75%+ with basic CNN proves it works
Accessing Datasets#
For Students#
You don’t need to manually download anything!
# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # Uses shipped TinyDigits
python 02_rumelhart_mnist.py # Auto-downloads MNIST if needed
The milestones handle all data loading automatically.
For Developers/Researchers#
Direct dataset access:
# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()
# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities
Dataset Sizes Summary#
Dataset |
Size |
Samples |
Ships With Repo |
Purpose |
|---|---|---|---|---|
TinyDigits |
310 KB |
1,200 |
Yes |
Fast MLP/CNN iteration |
TinyTalks |
40 KB |
350 pairs |
Yes |
Transformer learning |
MNIST |
10 MB |
70,000 |
Downloads |
MLP validation |
CIFAR-10 |
170 MB |
60,000 |
Downloads |
CNN validation |
Total shipped: ~350 KB Total with benchmarks: ~180 MB
Why Ship-with-Repo Matters#
Traditional ML courses:
“Download MNIST (10 MB)”
“Download CIFAR-10 (170 MB)”
Wait for downloads before starting
Large files in Git (bad practice)
TinyTorch approach:
Clone repo → Immediately start learning
Train first model in under 1 minute
Full benchmarks download only when scaling
Git repo stays small and fast
Educational benefit: Students see working models within minutes, not hours.
Frequently Asked Questions#
Q: Why not use full MNIST from the start? A: TinyDigits trains 100Ă— faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.
Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.
Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.
Q: Where does CIFAR-10 download from?
A: Official sources via milestones/data_manager.py, with integrity verification.
Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.