TinyTorch Datasets

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch’s dataset strategy and where to find each dataset used in milestones.

Design Philosophy

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

  • Small enough to fit in Git (~1K samples each)
  • Fast training (seconds to minutes)
  • Instant gratification for learners
  • Works offline - no download needed
  • Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger scale (~60K samples)
  • Used for validation and scaling
  • Downloaded automatically by milestones
  • Cached locally for reuse

Philosophy: Following Andrej Karpathy’s “~1K samples” approach—small datasets for learning, full benchmarks for validation.

Shipped Datasets (Included with TinyTorch)

TinyDigits - Handwritten Digit Recognition

Location: datasets/tinydigits/ Size: ~310 KB Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

  • 1,000 training samples
  • 200 test samples
  • 8×8 grayscale images (downsampled from MNIST)
  • 10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8×8?

  • Fast iteration: Trains in seconds
  • Memory-friendly: Small enough to debug
  • Conceptually complete: Same challenges as 28×28 MNIST
  • Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset

Location: datasets/tinytalks/ Size: ~40 KB Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

  • 350 Q&A pairs across 5 difficulty levels
  • Character-level text data
  • Topics: General knowledge, math, science, reasoning
  • Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

  • Engaging: Questions feel natural
  • Varied: Different answer lengths and complexity
  • Educational: Difficulty levels scaffold learning
  • Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md

Optional Downloaded Datasets

These standard benchmarks are available for extensions and research experiments.

MNIST - Handwritten Digit Classification

Downloads to: milestones/datasets/mnist/ Size: ~10 MB (compressed) Used by: Optional MLP extensions after the current TinyDigits milestone

Contents:

  • 60,000 training samples
  • 10,000 test samples
  • 28×28 grayscale images
  • 10 classes (digits 0-9)

Auto-download: The dataset manager can: 1. Check if data exists locally 2. Download if needed (~10 MB) 3. Cache for future runs 4. Load data using your TinyTorch DataLoader

Purpose: Optional scale-up benchmark after the TinyDigits milestone

Extension goal: Validate that the same backpropagation stack can scale beyond the shipped TinyDigits dataset.

CIFAR-10 - Natural Image Classification

Downloads to: milestones/datasets/cifar-10/ Size: ~170 MB (compressed) Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

  • 50,000 training samples
  • 10,000 test samples
  • 32×32 RGB images
  • 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything: 1. Downloads from official source 2. Verifies integrity 3. Caches locally 4. Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.

Dataset Selection Rationale

Why These Specific Datasets?

TinyDigits (not full MNIST):

  • 100× faster training iterations
  • Ships with repo (no download)
  • Same conceptual challenges
  • Perfect for learning and debugging

TinyTalks (custom dataset):

  • Designed for educational progression
  • Scaffolded difficulty levels
  • Character-level tokenization friendly
  • Engaging conversational format

MNIST (when scaling up):

  • Industry standard benchmark
  • Validates your implementation
  • Comparable to published results
  • Useful optional scale-up after TinyDigits

CIFAR-10 (for CNN validation):

  • Natural images (harder than digits)
  • RGB channels (multi-dimensional)
  • Standard CNN benchmark
  • 75%+ with basic CNN proves it works

Accessing Datasets

For Students

You don’t need to manually download anything!

# Run the current milestone script
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

The milestones handle all data loading automatically.

For Developers/Researchers

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary

Dataset Size Samples Ships With Repo Purpose
TinyDigits 310 KB 1,200 Yes Fast MLP/CNN iteration
TinyTalks 40 KB 350 pairs Yes Transformer learning
MNIST 10 MB 70,000 Downloads Optional MLP scale-up
CIFAR-10 170 MB 60,000 Downloads CNN validation

Total shipped: ~350 KB Total with benchmarks: ~180 MB

Why Ship-with-Repo Matters

Traditional ML courses:

  • “Download MNIST (10 MB)”
  • “Download CIFAR-10 (170 MB)”
  • Wait for downloads before starting
  • Large files in Git (bad practice)

TinyTorch approach:

  • Clone repo → Immediately start learning
  • Train first model in under 1 minute
  • Full benchmarks download only when scaling
  • Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.

Frequently Asked Questions

Q: Why not use full MNIST from the start? A: TinyDigits trains 100× faster, enabling rapid iteration during learning. MNIST is available later as an optional scale-up benchmark.

Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from? A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.

Back to top