The Model Zoo
Reference Workloads for Systems Modeling
The Model Zoo defines the Computational Demand placed on the hardware. Every workload is pulled from the mlsysim.Models registry and characterized by its FLOPs, parameter count, and architecture type—independent of any specific hardware.
The key number for roofline analysis is each model’s arithmetic intensity—how many floating-point operations it performs per byte of memory loaded. Models with low arithmetic intensity (small batch, decoder-only inference) tend to be memory-bound on any hardware. Pair these specs with the Silicon Zoo to find your bottleneck.
Workload Types
MLSys·im supports five workload architectures, each with distinct scaling characteristics:
| Type | Architecture | Key Characteristic | Example Models |
|---|---|---|---|
| Transformer | Dense attention | 2P FLOPs/token; KV-cache grows with sequence length | GPT-4, LLaMA, BERT |
| CNN | Convolutional | Fixed FLOPs per image; no sequence dependence | ResNet-50, EfficientNet |
| Sparse (MoE) | Mixture-of-Experts | Active params ≪ total params; All-to-All dispatch | Mixtral, GShard |
| SSM (Mamba) | State-space model | O(1) state cache; linear-time sequence processing | Mamba, S4 |
| Diffusion | Iterative denoising | T × FLOPs/step; latency scales with denoising steps | Stable Diffusion, DALL-E |
For Sparse/MoE models, parameters refers to total parameters (used for memory sizing), while active_parameters refers to the subset active per token (used for FLOP counting). This distinction is critical: a 340B MoE model may use only 47B parameters per forward pass.
Vetted Model Registry
Large Language Models (LLMs)
| Model | Architecture | Parameters | Inference FLOPS | Layers |
|---|
Vision Models (CNNs)
| Model | Architecture | Parameters | Inference FLOPS | Layers |
|---|
TinyML Models
| Model | Architecture | Parameters | Inference FLOPS | Layers |
|---|
How to Read the Model Zoo
Parameters vs. Inference FLOPs
These two numbers tell very different stories:
- Parameters determine memory footprint: at fp16, each parameter is 2 bytes. A 70B-parameter model needs ~140 GB just for weights — more than a single A100.
- Inference FLOPs determine compute time: the total floating-point operations for one forward pass. Higher FLOPs means more work for the GPU’s compute cores.
The ratio of FLOPs to memory accessed (the arithmetic intensity) determines whether a workload is compute-bound or memory-bound. At small batch sizes, most models are memory-bound because the weights must be loaded regardless of batch size.
Which Model for Which Hardware?
As a rough guide:
- TinyML MCUs (KB-scale memory) — only
Tinymodels fit (MobileNetV2, TinyBERT) - Edge devices (Jetson, 8-32 GB) — small Vision and Language models at int8
- Single Cloud GPU (40-80 GB) — models up to ~30B parameters at fp16
- Multi-GPU clusters — 70B+ models require distributed serving or training
Textbook Connection
The Model Training and Model Serving chapters use these workload profiles to demonstrate roofline analysis and serving cost estimation. The Model Compression chapter shows how quantization reduces both parameter memory and inference FLOPs.
- NoteAdd your own model
Defining custom workloads is straightforward. You can extend the registry or define a (or ) object directly in your code. Learn more in the Contributing Guide and the Models API Reference.
Note: For dynamic memory footprint and KV-cache calculations, see the API Reference.*