The Datasets Zoo

Reference Dataset Profiles for Data-Scale Reasoning

The Datasets Zoo provides canonical dataset sizes — training/test example counts, image dimensions, and class counts — for back-of-envelope data-pipeline and storage analysis.

Dataset Train Examples Test Examples Classes Resolution
CIFAR-10 50,000 10,000 10 32×32×3
ImageNet-1k 1,281,167 50,000 1000 224×224×3
MNIST 60,000 10,000 10 28×28×1
Multilingual Spoken Words Corpus 23,400,000

Python Access

import mlsysim

imagenet = mlsysim.Datasets.ImageNet
cifar = mlsysim.Datasets.CIFAR10
mnist = mlsysim.Datasets.MNIST

Pair dataset profiles with DataModel and the Data Engineering textbook chapters when reasoning about ingestion bandwidth and epoch time.

Back to top