MLSysSim Data Model
Eight zoos (typed registries) plus support layers. Book LEGO cells and tutorials should prefer zoos + mlsysim.physics.* + explicit operands. Measurement units live in core/units.py, physical constants in physics/constants.py, and domain values in registries; the former core/constants.py shim is deleted (no-backward-compat policy).
This document is the registry-level view. The runtime view — how workloads, hardware, infrastructure, and systems feed the solver layer — is the 5-layer model in architecture.qmd: the zoos below are the registries that populate Layers A–D of that stack.
Zoos
| Zoo | Registry | Role |
|---|---|---|
| Hardware | Hardware.Cloud.*, Hardware.Edge.*, … |
Chip/board/appliance specs (datasheet truth). Canonical paths only — no bare Hardware.H100. |
| Models | Models.* |
Workloads and architectures (parameters, layers, FLOPs). |
| Datasets | Datasets.* |
Data zoo — ImageNet, MNIST, CIFAR, etc. |
| Platforms | Platforms.* |
Abstract deployment envelopes (RAM, storage, latency ranges). Replaces Systems.Tiers. |
| Infrastructure | Infrastructure.Grids.*, Infrastructure.Datacenters.*, Infrastructure.Pricing.*, Infrastructure.Capacity.* |
Site/energy/economics layer — utility grid, facility PUE, pricing, and capacity facts. Not GPU fleets or network fabrics. |
| Systems | Systems.Nodes.*, Systems.Racks.*, Systems.Fabrics.*, Systems.Clusters.*, Systems.Pods.*, Systems.Storage.* |
Composed physical systems and topology. Fleets live in Systems.Clusters (type Fleet); rack-level aggregates live in Systems.Racks. |
| Ops | Ops.Monitoring.*, Ops.TrainingRunOverheads.* |
Operational policies, thresholds, and goodput-loss profiles. |
| Scenarios | Scenarios.* |
Executable workload + system + constraint bundles. |
Support (not zoos)
mlsysim.core.units— pint units, byte/bit widths, precision map.mlsysim.physics.*— physical constants and formulas.Literature.*— cited appendix scalars (MFU bands, Chinchilla, communication, batch-size anchors).Scenarios.*— executable workload + system + constraint bundles, suitable forScenario.evaluate().ReferenceStats.*— non-executable sourced anchors for real-world scenario and case-study statistics.Systems.Reliability/Orchestration— MTTF, recovery, scheduling assumptions.Ops.Monitoring/TrainingRunOverheads— PSI, KS, drift thresholds, training goodput-loss profiles.mlsysim.engine.calibration— solver/engine default kwargs (not appendix tables).Infrastructure.Pricing— cloud, storage, labeling, fleet economics (PricePoint.rate).- Regional carbon / PUE / fleet / fabrics —
Infrastructure.Grids,FacilityCooling,Systems.Clusters,Systems.Fabrics.
Validation invariants
MLSysIM is used to generate textbook calculations, so registry and CLI data are validated before they reach solver equations.
- Explicit units are required for physical quantities. A capacity must be written as
80 GBor80 GiB, not80. A model size must be bytes, a power value must be watts, and a latency value must be time. - Compute work carries its own dimension. FLOPs (and integer-op rates such as TOPS) live on a dedicated
[flop]pint dimension (2026-06-06), so1 TFLOP/scan never silently add to or convert into900 GB/s— pint itself raisesDimensionalityError. Bytes, counts, parameters, and dollars remain dimensionless aliases, and MLSysIM’s unit-family checks keep those semantically distinct at every schema boundary. - Precision names are closed vocabulary. Use the precision names in
core.units.PRECISION_MAP; unsupported values fail instead of silently using FP16 storage. - Distributed topology must divide exactly. Tensor, pipeline, and expert parallel groups must divide total accelerators without flooring. CLI fleet plans must likewise specify topology that divides cleanly.
Relationships
flowchart TB
subgraph zoos [Zoos]
Hardware
Models
Datasets
Platforms
Infrastructure
Systems
Ops
Scenarios
end
subgraph support [Support]
units[core/units.py]
literature[Literature.*]
scenarios[Scenarios.*]
referenceStats[ReferenceStats.*]
calibration[engine/calibration.py]
physics[physics.*]
end
Hardware --> Systems
Platforms --> Systems
Infrastructure --> Systems
Models --> physics
Datasets --> physics
units --> physics
literature --> physics
Ops --> physics
Scenarios --> physics
calibration --> physics
Systems --> physics
Models --> scenarios
Hardware --> scenarios
Systems --> scenarios
referenceStats --> scenarios
- Fleet ≠ datacenter:
Systems.Clusters.*(Fleet) references optionalInfrastructure.Datacenters.*/ grid for carbon and PUE. - NVL72 is
Hardware.Cloud.GB200_NVL72, not an Infrastructure rack entry. - Networks/fabrics: interconnect specs on Hardware; topology instances under
Systems.Fabrics. - Scenario ≠ model or hardware: a scenario composes existing model and system facts with local constraints. It does not redefine GPT-4, H100, or a fleet.
- Reference statistics are not scenarios:
ReferenceStats.MobilePower.*andReferenceStats.Workloads.*are sourced anchors for book calculations, not runnable bundles.
Ownership Rule
When adding a number, classify the semantic object before choosing a namespace:
| Question | Home |
|---|---|
| Is this a datasheet fact about a chip, board, appliance, NIC, or storage device? | Hardware.* |
| Is this a composed physical setup such as a node, rack, cluster, fabric, or storage path? | Systems.* |
| Is this a grid, datacenter, price, capacity, or facility-envelope fact? | Infrastructure.* |
| Is this an operational threshold, run-overhead profile, or monitoring policy? | Ops.* |
| Is this a cited scalar from a paper/table used as a literature anchor? | Literature.* |
| Is this a runnable workload + system + constraint bundle? | Scenarios.* |
| Is this a non-executable sourced world statistic or case-study anchor? | ReferenceStats.* |
| Is this a reusable teaching/problem setting that is not a physical system or runnable scenario? | ReferenceStats.* or local LEGO, depending on reuse and provenance |
| Is this a one-off knob that defines a local exercise? | Keep it local in the LEGO cell and label it as a scenario assumption. |
Provenance is metadata attached to entries in any of these homes. It does not decide the namespace; the type of thing being modeled does.
Consumer Conventions
- Use explicit zoo paths for registry operands.
- Use
mlsysim.physics.*for derived quantities; registries for operands. - Use
Scenario.evaluate()when a runnable workload + system + constraint bundle is needed. - Use
ReferenceStats.*for non-executable anchors. Do not route reference statistics throughScenarios.*.
Migration tiers (QMD)
| Tier | Source | Target |
|---|---|---|
| A | GPU/chip constants (H100_*, NVLINK_*, …) |
Hardware.* |
| B | Network/fabric (INFINIBAND_*, ETHERNET_*, …) |
Hardware.Networks.* / Systems.Fabrics.* |
| C | Model/dataset constants | Models.* / Datasets.* |
| D | Economics/reliability/ops/literature | Infrastructure.Pricing.*, Systems.Reliability.*, Ops.*, Literature.*, Scenarios.* |
| Platforms | Systems.Tiers, tier latency/RAM strings |
Platforms.* |
No aliases
Hard-delete migrated symbols from constants.py after parity tests pass. Do not keep Hardware.H100, Infrastructure.Quebec, or Systems.Cloud = … shims.
Verification gates (every commit)
- L1: pytest, exec affected QMD cells,
lego_focal_verify.py - L2:
test_registry_parity.pyfor deleted symbols - L3–L5: fmt, HTML build,
audit_lego_html.pywhen QMD touched - L6: downstream content sign-off before rendered-content commits
See PROVENANCE.md and docs/contributing.qmd for package-side provenance rules.