The Model Zoo

Reference Workloads for Systems Modeling

The Model Zoo defines the Computational Demand placed on the hardware. Every workload is pulled from the mlsysim.Models registry and characterized by its FLOPs, parameter count, and architecture type—independent of any specific hardware.

TipArithmetic Intensity = FLOPs ÷ Bytes

The key number for roofline analysis is each model’s arithmetic intensity—how many floating-point operations it performs per byte of memory loaded. Models with low arithmetic intensity (small batch, decoder-only inference) tend to be memory-bound on any hardware. Pair these specs with the Silicon Zoo to find your bottleneck.

Workload Types

MLSys·im supports five workload architectures, each with distinct scaling characteristics:

Type Architecture Key Characteristic Example Models
Transformer Dense attention 2P FLOPs/token; KV-cache grows with sequence length GPT-4, LLaMA, BERT
CNN Convolutional Fixed FLOPs per image; no sequence dependence ResNet-50, EfficientNet
Sparse (MoE) Mixture-of-Experts Active params ≪ total params; All-to-All dispatch Mixtral, GShard
SSM (Mamba) State-space model O(1) state cache; linear-time sequence processing Mamba, S4
Diffusion Iterative denoising T × FLOPs/step; latency scales with denoising steps Stable Diffusion, DALL-E
NoteActive vs. Total Parameters

For Sparse/MoE models, parameters refers to total parameters (used for memory sizing), while active_parameters refers to the subset active per token (used for FLOP counting). This distinction is critical: a 340B MoE model may use only 47B parameters per forward pass.


Vetted Model Registry

Large Language Models (LLMs)

Model Architecture Parameters Inference FLOPS Layers
BERT-Base Transformer 110.0 MFLOPs 22.0 GFLOPs 12
BERT-Large Transformer 340.0 MFLOPs 72.0 GFLOPs 24
GPT-2 (1.5B) Transformer 1.5 GFLOPs 3.0 GFLOPs 48
GPT-3 (175B) Transformer 175.0 GFLOPs 350.0 GFLOPs 96
GPT-4 Transformer 1.8 TFLOPs 3.5 TFLOPs 120
Llama-2-70B Transformer 70.0 GFLOPs 140.0 GFLOPs 80
Llama-3.1-70B Transformer 70.6 GFLOPs 141.2 GFLOPs 80
Llama-3.1-8B Transformer 8.0 GFLOPs 16.1 GFLOPs 32

Vision Models (CNNs)

Model Architecture Parameters Inference FLOPS Layers
AlexNet CNN 60.0 MFLOPs 1.5 GFLOPs 8
MobileNetV2 CNN 3.5 MFLOPs 300.0 MFLOPs 54
ResNet-50 CNN 25.6 MFLOPs 4.1 GFLOPs 50
YOLOv8-Nano CNN 3.2 MFLOPs 8.7 GFLOPs 225

TinyML Models

Model Architecture Parameters Inference FLOPS Layers
Anomaly Detector MLP 270.0 Kparam 540.0 Kparam
DS-CNN (KWS) CNN 200.0 Kparam 20.0 MFLOPs None
Wake Vision (Doorbell) CNN 250.0 Kparam 25.0 MFLOPs None

How to Read the Model Zoo

Parameters vs. Inference FLOPs

These two numbers tell very different stories:

  • Parameters determine memory footprint: at fp16, each parameter is 2 bytes. A 70B-parameter model needs ~140 GB just for weights — more than a single A100.
  • Inference FLOPs determine compute time: the total floating-point operations for one forward pass. Higher FLOPs means more work for the GPU’s compute cores.

The ratio of FLOPs to memory accessed (the arithmetic intensity) determines whether a workload is compute-bound or memory-bound. At small batch sizes, most models are memory-bound because the weights must be loaded regardless of batch size.

Which Model for Which Hardware?

As a rough guide:

  • TinyML MCUs (KB-scale memory) — only Tiny models fit (MobileNetV2, TinyBERT)
  • Edge devices (Jetson, 8-32 GB) — small Vision and Language models at int8
  • Single Cloud GPU (40-80 GB) — models up to ~30B parameters at fp16
  • Multi-GPU clusters — 70B+ models require distributed serving or training

Textbook Connection

The Model Training and Model Serving chapters use these workload profiles to demonstrate roofline analysis and serving cost estimation. The Model Compression chapter shows how quantization reduces both parameter memory and inference FLOPs.


NoteAdd your own model

Defining custom workloads is straightforward. You can extend the registry or define a TransformerWorkload (or CNNWorkload, SSMWorkload, DiffusionWorkload) object directly in your code. Learn more in the Contributing Guide and the Models API Reference.

TipCLI Access

Browse the Model Zoo from your terminal: mlsysim zoo models

For dynamic memory footprint and KV-cache calculations, see the API Reference.

Back to top