The MLSys Zoo
A Single Source of Truth for ML Systems Specifications
The MLSys Zoo is a centralized, vetted registry of specifications used throughout the mlsysim platform. Every entry is strictly typed with pint.Quantity for dimensional correctness, provenance-tracked, and validated against official sources.
A persistent problem in ML systems literature is spec staleness—people cite outdated or incorrect hardware numbers. The MLSys Zoo fixes this by being the authoritative source for the mlsysim ecosystem. When a spec changes (e.g., NVIDIA publishes an updated datasheet), it is updated once here and propagates automatically to every solver and tutorial.
The Four Catalogs
| Zoo | Description | Key Data | Python Access |
|---|---|---|---|
| ⚡ Silicon Zoo | AI accelerators from microcontrollers to datacenter GPUs | Peak FLOPs, Memory BW, Capacity, TDP | mlsysim.Hardware.Cloud.A100 |
| 🧠 Model Zoo | Reference ML workloads: transformers, CNNs, TinyML | Parameters, Inference FLOPs, Layers | mlsysim.Models.ResNet50 |
| 🕸️ Fleet Zoo | Multi-node cluster configurations and deployment tiers | Node type, Count, Network Fabric | mlsysim.Systems.Clusters.Frontier_8K |
| 🌍 Infrastructure Zoo | Regional electricity grids and datacenter profiles | Carbon Intensity, PUE | mlsysim.Infra.Grids.Quebec |
System Composition Hierarchy
ML systems are structurally composed of smaller parts. The mlsysim registry reflects this physical reality. Before a workload can be evaluated, the structural components are combined into a coherent system.
Here is how the components in the Zoo relate to each other:
- HardwareNode (Silicon): The fundamental unit of compute (e.g., an H100 GPU or a DGX Spark GB10 superchip). It provides FLOPs and Memory Bandwidth.
- Node: A single server chassis. It contains one or more
HardwareNodes connected by a high-speed intra-node bus (like NVLink). - NetworkFabric: The inter-node networking (e.g., InfiniBand NDR or 100GbE) that allows servers to communicate.
- Fleet (Cluster): A collection of
Nodes connected by aNetworkFabric. This is the top-level entity used for distributed training and cluster reliability models. - Datacenter & GridProfile (Infra/Regions): The physical facility and regional power grid that hosts the
Fleet. It dictates the Power Usage Effectiveness (PUE) and the carbon intensity of the electricity consumed.
Accessing Zoo Entries in Code
All Zoo entries follow the same registry pattern:
import mlsysim
# Hardware
a100 = mlsysim.Hardware.Cloud.A100
jetson = mlsysim.Hardware.Edge.JetsonAGX
# Models
resnet = mlsysim.Models.ResNet50
llama = mlsysim.Models.Language.Llama3_70B
# Infrastructure
quebec = mlsysim.Infra.Grids.Quebec
virginia = mlsysim.Infra.Grids.US_Average
# Systems (Fleets)
cluster = mlsysim.Systems.Clusters.Frontier_8KAll quantities (FLOPs, bandwidth, capacity) are pint.Quantity objects. You can convert between units and MLSys·im will catch dimensional errors at runtime:
hw.compute.peak_flops.to("TFLOPs/s") # → 312.0 TFLOPs/s
hw.memory.bandwidth.to("TB/s") # → 2.0 TB/s
hw.memory.bandwidth.to("FLOP/s") # → pint.DimensionalityError ✓