TinyTorch Datasets#

Ship-with-Repo Datasets for Fast Learning

Small datasets for instant iteration + standard benchmarks for validation

Purpose: Understand TinyTorch’s dataset strategy and where to find each dataset used in milestones.

Design Philosophy#

TinyTorch uses a two-tier dataset approach:

Shipped Datasets

~350 KB total - Ships with repository

  • Small enough to fit in Git (~1K samples each)
  • Fast training (seconds to minutes)
  • Instant gratification for learners
  • Works offline - no download needed
  • Perfect for rapid iteration

Downloaded Datasets

~180 MB - Auto-downloaded when needed

  • Standard ML benchmarks (MNIST, CIFAR-10)
  • Larger scale (~60K samples)
  • Used for validation and scaling
  • Downloaded automatically by milestones
  • Cached locally for reuse

Philosophy: Following Andrej Karpathy’s “~1K samples” approach—small datasets for learning, full benchmarks for validation.

Shipped Datasets (Included with TinyTorch)#

TinyDigits - Handwritten Digit Recognition#

Location: datasets/tinydigits/ Size: ~310 KB Used by: Milestones 03 & 04 (MLP and CNN examples)

Contents:

  • 1,000 training samples

  • 200 test samples

  • 8Ă—8 grayscale images (downsampled from MNIST)

  • 10 classes (digits 0-9)

Format: Python pickle file with NumPy arrays

Why 8Ă—8?

  • Fast iteration: Trains in seconds

  • Memory-friendly: Small enough to debug

  • Conceptually complete: Same challenges as 28Ă—28 MNIST

  • Git-friendly: Only 310 KB vs 10 MB for full MNIST

Usage in milestones:

# Automatically loaded by milestones
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()
# X_train shape: (1000, 8, 8)
# y_train shape: (1000,)

TinyTalks - Conversational Q&A Dataset#

Location: datasets/tinytalks/ Size: ~40 KB Used by: Milestone 05 (Transformer/GPT text generation)

Contents:

  • 350 Q&A pairs across 5 difficulty levels

  • Character-level text data

  • Topics: General knowledge, math, science, reasoning

  • Balanced difficulty distribution

Format: Plain text files with Q: / A: format

Why conversational format?

  • Engaging: Questions feel natural

  • Varied: Different answer lengths and complexity

  • Educational: Difficulty levels scaffold learning

  • Practical: Mirrors real chatbot use cases

Example:

Q: What is the capital of France?
A: Paris

Q: If a train travels 120 km in 2 hours, what is its average speed?
A: 60 km/h

Usage in milestones:

# Automatically loaded by transformer milestones
from datasets.tinytalks import load_tinytalks
dataset = load_tinytalks()
# Returns list of (question, answer) pairs

See detailed documentation: datasets/tinytalks/README.md

Downloaded Datasets (Auto-Downloaded On-Demand)#

These standard benchmarks download automatically when you run relevant milestone scripts:

MNIST - Handwritten Digit Classification#

Downloads to: milestones/datasets/mnist/ Size: ~10 MB (compressed) Used by: milestones/03_1986_mlp/02_rumelhart_mnist.py

Contents:

  • 60,000 training samples

  • 10,000 test samples

  • 28Ă—28 grayscale images

  • 10 classes (digits 0-9)

Auto-download: When you run the MNIST milestone script, it automatically:

  1. Checks if data exists locally

  2. Downloads if needed (~10 MB)

  3. Caches for future runs

  4. Loads data using your TinyTorch DataLoader

Purpose: Validate that your framework achieves production-level results (95%+ accuracy target)

Milestone goal: Implement backpropagation and achieve 95%+ accuracy—matching 1986 Rumelhart’s breakthrough.

CIFAR-10 - Natural Image Classification#

Downloads to: milestones/datasets/cifar-10/ Size: ~170 MB (compressed) Used by: milestones/04_1998_cnn/02_lecun_cifar10.py

Contents:

  • 50,000 training samples

  • 10,000 test samples

  • 32Ă—32 RGB images

  • 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)

Auto-download: Milestone script handles everything:

  1. Downloads from official source

  2. Verifies integrity

  3. Caches locally

  4. Preprocesses for your framework

Purpose: Prove your CNN implementation works on real natural images (75%+ accuracy target)

Milestone goal: Build LeNet-style CNN achieving 75%+ accuracy—demonstrating spatial intelligence.

Dataset Selection Rationale#

Why These Specific Datasets?#

TinyDigits (not full MNIST):

  • 100Ă— faster training iterations

  • Ships with repo (no download)

  • Same conceptual challenges

  • Perfect for learning and debugging

TinyTalks (custom dataset):

  • Designed for educational progression

  • Scaffolded difficulty levels

  • Character-level tokenization friendly

  • Engaging conversational format

MNIST (when scaling up):

  • Industry standard benchmark

  • Validates your implementation

  • Comparable to published results

  • 95%+ accuracy is achievable milestone

CIFAR-10 (for CNN validation):

  • Natural images (harder than digits)

  • RGB channels (multi-dimensional)

  • Standard CNN benchmark

  • 75%+ with basic CNN proves it works

Accessing Datasets#

For Students#

You don’t need to manually download anything!

# Just run milestone scripts
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py  # Uses shipped TinyDigits

python 02_rumelhart_mnist.py       # Auto-downloads MNIST if needed

The milestones handle all data loading automatically.

For Developers/Researchers#

Direct dataset access:

# Shipped datasets (always available)
from datasets.tinydigits import load_tinydigits
X_train, y_train, X_test, y_test = load_tinydigits()

from datasets.tinytalks import load_tinytalks
conversations = load_tinytalks()

# Downloaded datasets (through milestones)
# See milestones/data_manager.py for download utilities

Dataset Sizes Summary#

Dataset

Size

Samples

Ships With Repo

Purpose

TinyDigits

310 KB

1,200

Yes

Fast MLP/CNN iteration

TinyTalks

40 KB

350 pairs

Yes

Transformer learning

MNIST

10 MB

70,000

Downloads

MLP validation

CIFAR-10

170 MB

60,000

Downloads

CNN validation

Total shipped: ~350 KB Total with benchmarks: ~180 MB

Why Ship-with-Repo Matters#

Traditional ML courses:

  • “Download MNIST (10 MB)”

  • “Download CIFAR-10 (170 MB)”

  • Wait for downloads before starting

  • Large files in Git (bad practice)

TinyTorch approach:

  • Clone repo → Immediately start learning

  • Train first model in under 1 minute

  • Full benchmarks download only when scaling

  • Git repo stays small and fast

Educational benefit: Students see working models within minutes, not hours.

Frequently Asked Questions#

Q: Why not use full MNIST from the start? A: TinyDigits trains 100Ă— faster, enabling rapid iteration during learning. MNIST validates your complete implementation later.

Q: Can I use my own datasets? A: Absolutely! TinyTorch is a real framework—add your data loading code just like PyTorch.

Q: Why ship datasets in Git? A: 350 KB is negligible (smaller than many images), and it enables offline learning with instant iteration.

Q: Where does CIFAR-10 download from? A: Official sources via milestones/data_manager.py, with integrity verification.

Q: Can I skip the large downloads? A: Yes! You can work through most milestones using only shipped datasets. Downloaded datasets are for validation milestones.