The Datasets Zoo
Reference Dataset Profiles for Data-Scale Reasoning
The Datasets Zoo provides canonical dataset sizes — training/test example counts, image dimensions, and class counts — for back-of-envelope data-pipeline and storage analysis.
| Dataset | Train Examples | Test Examples | Classes | Resolution |
|---|---|---|---|---|
| CIFAR-10 | 50,000 | 10,000 | 10 | 32×32×3 |
| ImageNet-1k | 1,281,167 | 50,000 | 1000 | 224×224×3 |
| MNIST | 60,000 | 10,000 | 10 | 28×28×1 |
| Multilingual Spoken Words Corpus | 23,400,000 | — | — | — |
Python Access
import mlsysim
imagenet = mlsysim.Datasets.ImageNet
cifar = mlsysim.Datasets.CIFAR10
mnist = mlsysim.Datasets.MNISTPair dataset profiles with DataModel and the Data Engineering textbook chapters when reasoning about ingestion bandwidth and epoch time.