TinyTorch Datasets

TinyTorch Datasets#

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch’s dataset strategy and where to find each dataset used in milestones.

Design Philosophy#

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

Small enough to fit in Git (~1K samples each)
Fast training (seconds to minutes)
Instant gratification for learners
Works offline - no download needed
Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

Standard ML benchmarks (MNIST, CIFAR-10)
Larger scale (~60K samples)
Used for validation and scaling
Downloaded automatically by milestones
Cached locally for reuse

Philosophy: Following Andrej Karpathy’s “~1K samples” approach—small datasets for learning, full benchmarks for validation.

Shipped Datasets (Included with TinyTorch)#

TinyDigits - Handwritten Digit Recognition#

Location: datasets/tinydigits/ Size: ~310 KB Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

1,000 training samples
200 test samples
8×8 grayscale images (downsampled from MNIST)
10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8×8?

Fast iteration: Trains in seconds
Memory-friendly: Small enough to debug
Conceptually complete: Same challenges as 28×28 MNIST
Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset#

Location: datasets/tinytalks/ Size: ~40 KB Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

350 Q&A pairs across 5 difficulty levels
Character-level text data
Topics: General knowledge, math, science, reasoning
Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

Engaging: Questions feel natural
Varied: Different answer lengths and complexity
Educational: Difficulty levels scaffold learning
Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md

Downloaded Datasets (Auto-Downloaded On-Demand)#

These standard benchmarks download automatically when you run relevant milestone scripts:

MNIST - Handwritten Digit Classification#

Downloads to: milestones/datasets/mnist/ Size: ~10 MB (compressed) Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py

Contents:

60,000 training samples
10,000 test samples
28×28 grayscale images
10 classes (digits 0-9)

Auto-download: When you run the MNIST milestone script, it automatically:

Checks if data exists locally
Downloads if needed (~10 MB)
Caches for future runs
Loads data using your TinyTorch DataLoader

Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)

Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart’s breakthrough.

CIFAR-10 - Natural Image Classification#

Downloads to: milestones/datasets/cifar-10/ Size: ~170 MB (compressed) Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

50,000 training samples
10,000 test samples
32×32 RGB images
10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything:

Downloads from official source
Verifies integrity
Caches locally
Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.

Dataset Selection Rationale#

Why These Specific Datasets?#

TinyDigits (not full MNIST):

100× faster training iterations
Ships with repo (no download)
Same conceptual challenges
Perfect for learning and debugging

TinyTalks (custom dataset):

Designed for educational progression
Scaffolded difficulty levels
Character-level tokenization friendly
Engaging conversational format

MNIST (when scaling up):

Industry standard benchmark
Validates your implementation
Comparable to published results
95%+ accuracy is achievable milestone

CIFAR-10 (for CNN validation):

Natural images (harder than digits)
RGB channels (multi-dimensional)
Standard CNN benchmark
75%+ with basic CNN proves it works

Accessing Datasets#

For Students#

You don’t need to manually download anything!

# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

python 02_rumelhart_mnist.py       # Auto-downloads MNIST if needed

The milestones handle all data loading automatically.

For Developers/Researchers#

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary#

Dataset	Size	Samples	Ships With Repo	Purpose
TinyDigits	310 KB	1,200	Yes	Fast MLP/CNN iteration
TinyTalks	40 KB	350 pairs	Yes	Transformer learning
MNIST	10 MB	70,000	Downloads	MLP validation
CIFAR-10	170 MB	60,000	Downloads	CNN validation

Total shipped: ~350 KB Total with benchmarks: ~180 MB

Why Ship-with-Repo Matters#

Traditional ML courses:

“Download MNIST (10 MB)”
“Download CIFAR-10 (170 MB)”
Wait for downloads before starting
Large files in Git (bad practice)

TinyTorch approach:

Clone repo → Immediately start learning
Train first model in under 1 minute
Full benchmarks download only when scaling
Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.

Frequently Asked Questions#

Q: Why not use full MNIST from the start? A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.

Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from? A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.

TinyTorch Datasets

Contents

TinyTorch Datasets#

Ship-with-Repo Datasets for Fast Learning

Design Philosophy#

Shipped Datasets

Downloaded Datasets

Shipped Datasets (Included with TinyTorch)#

TinyDigits - Handwritten Digit Recognition#

TinyTalks - Conversational Q&A Dataset#

Downloaded Datasets (Auto-Downloaded On-Demand)#

MNIST - Handwritten Digit Classification#

CIFAR-10 - Natural Image Classification#

Dataset Selection Rationale#

Why These Specific Datasets?#

Accessing Datasets#

For Students#

For Developers/Researchers#

Dataset Sizes Summary#

Why Ship-with-Repo Matters#

Frequently Asked Questions#

Related Documentation#