The Model Zoo

Reference Workloads for Systems Modeling

The Model Zoo defines the Computational Demand placed on the hardware. Every workload is pulled from the mlsysim.Models registry and characterized by its FLOPs, parameter count, and architecture type—independent of any specific hardware.

TipArithmetic Intensity = FLOPs ÷ Bytes

The key number for roofline analysis is each model’s arithmetic intensity—how many floating-point operations it performs per byte of memory loaded. Models with low arithmetic intensity (small batch, decoder-only inference) tend to be memory-bound on any hardware. Pair these specs with the Silicon Zoo to find your bottleneck.

Workload Types

MLSys·im supports five workload architectures, each with distinct scaling characteristics:

Type Architecture Key Characteristic Example Models
Transformer Dense attention 2P FLOPs/token; KV-cache grows with sequence length GPT-4, LLaMA, BERT
CNN Convolutional Fixed FLOPs per image; no sequence dependence ResNet-50, EfficientNet
Sparse (MoE) Mixture-of-Experts Active params ≪ total params; All-to-All dispatch Mixtral, GShard
SSM (Mamba) State-space model O(1) state cache; linear-time sequence processing Mamba, S4
Diffusion Iterative denoising T × FLOPs/step; latency scales with denoising steps Stable Diffusion, DALL-E
NoteActive vs. Total Parameters

For Sparse/MoE models, parameters refers to total parameters (used for memory sizing), while active_parameters refers to the subset active per token (used for FLOP counting). This distinction is critical: a 340B MoE model may use only 47B parameters per forward pass.


Vetted Model Registry

Large Language Models (LLMs)

Model Architecture Parameters Inference FLOPS Layers

Vision Models (CNNs)

Model Architecture Parameters Inference FLOPS Layers

TinyML Models

Model Architecture Parameters Inference FLOPS Layers

How to Read the Model Zoo

Parameters vs. Inference FLOPs

These two numbers tell very different stories:

  • Parameters determine memory footprint: at fp16, each parameter is 2 bytes. A 70B-parameter model needs ~140 GB just for weights — more than a single A100.
  • Inference FLOPs determine compute time: the total floating-point operations for one forward pass. Higher FLOPs means more work for the GPU’s compute cores.

The ratio of FLOPs to memory accessed (the arithmetic intensity) determines whether a workload is compute-bound or memory-bound. At small batch sizes, most models are memory-bound because the weights must be loaded regardless of batch size.

Which Model for Which Hardware?

As a rough guide:

  • TinyML MCUs (KB-scale memory) — only Tiny models fit (MobileNetV2, TinyBERT)
  • Edge devices (Jetson, 8-32 GB) — small Vision and Language models at int8
  • Single Cloud GPU (40-80 GB) — models up to ~30B parameters at fp16
  • Multi-GPU clusters — 70B+ models require distributed serving or training

Textbook Connection

The Model Training and Model Serving chapters use these workload profiles to demonstrate roofline analysis and serving cost estimation. The Model Compression chapter shows how quantization reduces both parameter memory and inference FLOPs.


  • NoteAdd your own model

    Defining custom workloads is straightforward. You can extend the registry or define a (or ) object directly in your code. Learn more in the Contributing Guide and the Models API Reference.

Note: For dynamic memory footprint and KV-cache calculations, see the API Reference.*

Back to top