Contributing to MLSYSIM
Add hardware specs, write solvers, build tutorials, and grow the MLSys Zoo.
MLSYSIM grows stronger with every new hardware spec, solver, tutorial, and bug report. This guide explains how to contribute – whether you are a student who spotted a wrong datasheet number, an instructor designing a teaching scenario, or a researcher who needs a new analytical solver.
MLSYSIM is maintained as part of the ML Systems textbook project. All contributions go through GitHub. If you are not familiar with Git and pull requests, GitHub’s guide is a good starting point.
Repository: harvard-edge/cs249r_book
At a Glance
| Contribution | Difficulty | Impact | Where it lives |
|---|---|---|---|
| Report a bug or wrong spec | Beginner | High – specs affect all users | GitHub Issues |
| Add hardware to the Silicon Zoo | Intermediate | High – expands coverage | mlsysim/hardware/ |
| Add a model to the Model Zoo | Intermediate | Medium – new workloads | mlsysim/models/ |
| Add a fleet or fabric | Intermediate | Medium – new topologies | mlsysim/systems/ |
| Add a grid or rack profile | Intermediate | Medium – new infra | mlsysim/infrastructure/ |
| Write a tutorial | Intermediate | High – improves learning | docs/tutorials/ |
| Add or improve a solver | Advanced | High – new analysis capabilities | engine/solvers/ |
1. Reporting Issues
The fastest way to contribute: open an issue on GitHub.
Good bug reports include:
- Which spec is wrong (e.g., “A100 peak TFLOP/s in
Hardware.Cloud.A100”) - The correct value and your source (official datasheet URL preferred)
- The version of MLSYSIM you are using (
python -c "import mlsysim; print(mlsysim.__version__)")
Good feature requests include:
- What hardware, model, or solver you want added and why
- A link to the official specification or paper
2. Adding Hardware to the Silicon Zoo
Every chip in the Silicon Zoo lives in one of five categories: Cloud, Workstation, Mobile, Edge, or Tiny. Each entry follows a strict format with mandatory provenance metadata.
The Hardware Acceleration and Compute Infrastructure slide decks cover the accelerator landscape and datasheet specs that feed into MLSYSIM’s hardware registry.
Step 1: Add a YAML spec under hardware/data/
Chip and board specs live in the Silicon Zoo — there is no shared constants file to park them in (units live in core/units.py, physical constants in physics/constants.py). Add one YAML file under the right tier, for example mlsysim/hardware/data/cloud/MyAccelerator.yaml. The loader validates the file against HardwareNode at import time and exposes it through the same Python API (Hardware.Cloud.MyAccelerator).
__key__: MyAccelerator
name: Example Accelerator
release_year: 2026
compute:
peak_flops: 312 TFLOPs / s
precision_flops:
fp32: 19.5 TFLOPs / s
int8: 624 TOPS
memory:
capacity: 80 GiB
bandwidth: 2039 GB / s
interconnect:
name: PCIe Gen4 x16
bandwidth: 32 GB / s
tdp: 400 W
metadata:
provenance:
kind: datasheet
ref: Example accelerator datasheet
url: https://...
verified: "2026-05-30"Reuse shared records from mlsysim/core/provenance_catalog.py when several entries share one source. See mlsysim/PROVENANCE.md and run python -m mlsysim.tools.audit_provenance --scope all --strict before opening a PR.
The YAML data layer is intentionally strict: duplicate YAML keys, duplicate __key__ values, missing data directories, and unknown schema fields fail loading. If the value is a technology-class reference, use @tech: rather than copying a shared fact into many devices, for example latency: "@tech:Interconnect.NVLink.latency".
Step 2: Verify with the registry contract tests
The loader contract and duplicate-spec gates run automatically: tests/test_registry_loader_contract.py and tests/test_registry_no_duplicate_specs.py. The legacy core/constants.py junk drawer was deleted outright; tests/test_constants_allowlist.py enforces that neither it nor a compat shim for it ever comes back.
Step 3: Use the canonical path in downstream examples
Examples import Hardware.Cloud.A100, not A100_FLOPS_FP16_TENSOR. Downstream content should use the same canonical registry paths as package tutorials.
Provenance rules
Every registry entry requires metadata.provenance (Provenance model in core/provenance.py). Use kind honestly: datasheet, literature, estimate (needs notes), convention, or illustrative.
- Link to an official primary source (manufacturer datasheet, not a blog post)
- Set
verified(YYYY-MM-DD) on datasheet and literature entries - State which variant in
refornotes(e.g., SXM5 vs. PCIe) - When a spec varies across SKUs, use the most conservative published value unless the variant is specified in the node name
Formatting for QMD and LEGO cells
Display helpers live in mlsysim.fmt — keep physics in mlsysim.physics.*:
| Helper | Use when |
|---|---|
fmt(q, precision=1) |
Human-readable quantities in prose |
fmt_int(n) |
Integer counts without spurious decimals |
check(condition, message) |
Assert invariants in tutorials and LEGO cells |
MarkdownStr |
Inline QMD values that must not be LaTeX-escaped |
Registry operands come from zoos; derived values from mlsysim.physics.calc_* or solvers.
3. Adding Models to the Model Zoo
Models live in category YAML files under mlsysim/models/data/: language.yaml, vision.yaml, tiny.yaml, recommendation.yaml, statespace.yaml, and generativevision.yaml. Each entry includes a __type__ tag selecting the validated workload class (TransformerWorkload, CNNWorkload, SSMWorkload, DiffusionWorkload, SparseTransformerWorkload, or plain Workload).
The Network Architectures and Neural Network Computation slide decks explain the architectural parameters (layers, heads, hidden dimensions) that define each workload.
Transformer workloads
Llama3_8B:
__type__: TransformerWorkload
name: Llama-3.1-8B
architecture: Transformer
parameters: 8030000000 param
inference_flops: 16060000000 flop
layers: 32
hidden_dim: 4096
heads: 32
kv_heads: 8For inference_flops, the standard approximation is \(2P\) FLOPs per token for transformer forward passes (multiply-accumulate counted as 2 operations). When a more precise count is available from the paper, use it and note the source in a comment.
CNN workloads
ResNet50:
__type__: CNNWorkload
name: ResNet-50
architecture: CNN
parameters: 25600000 param
inference_flops: 4100000000 flop
layers: 504. Adding Systems and Infrastructure
MLSYSIM models the full deployment stack: individual accelerators compose into nodes, nodes compose into racks and fleets, and fleets connect via network fabrics. Infrastructure captures the grid, datacenter environment, pricing, and capacity facts that those systems run within.
The Network Fabrics and Fleet Orchestration slide decks explain the network topologies and cluster compositions that MLSYSIM models analytically.
Adding a fleet or fabric (systems/registry.py)
# A new reference node
DGX_B200 = Node(
name="DGX B200",
accelerator=Hardware.Cloud.B200,
accelerators_per_node=8,
intra_node_bw=1800 * ureg.GB / ureg.second,
nics_per_node=8,
)
# A new cluster built from that node
Training_2K = Fleet(
name="Training Cluster (2048 GPUs)",
node=DGX_B200,
count=256, # 256 nodes x 8 GPUs = 2048
fabric=Fabrics.InfiniBand_NDR,
)Adding a grid profile (infrastructure/data/grids.yaml)
Grid profiles capture the carbon intensity, cooling efficiency (PUE), and water usage (WUE) of a datacenter region.
Iceland:
name: Iceland (Geothermal)
carbon_intensity_g_kwh: 28
pue: 1.06
wue: 0.0
primary_source: geothermal
metadata:
provenance: "@prov:IEA_WEO_2023"The Sustainable AI slide deck covers PUE, WUE, and carbon intensity – the exact quantities that GridProfile captures.
5. Writing a Tutorial
The best tutorials teach one insight through one concrete example. Before writing, answer three questions:
- What is the one thing the reader will understand after this tutorial?
- What would they have guessed incorrectly before reading it?
- What surprising number will they compute?
Tutorial structure
Follow the pattern established in Hello, Roofline, Geography is a Systems Variable, Two Phases of Inference, and Scaling to 1000 GPUs:
---
title: "Short, specific title"
subtitle: "Payoff sentence: what you learn in 10 words."
---
[2-3 sentence hook: what problem does this solve?]
By the end of this tutorial you will understand:
- [Concept 1]
- [Concept 2]
- [Concept 3]
::: {.callout-tip}
## Background concept
[1-paragraph intuition before any code]
:::
## 1. Setup
[import block -- path hack MUST be hidden with #| echo: false]
## 2. First Example
[minimal working code + output]
## 3-N. Build Understanding
[progressive complexity, callouts explaining surprising results]
## What You Learned
[bullet list recap]
## Next Steps
[2-3 links to related content]Code style in tutorials
- Use clean imports: Start with
import mlsysim. The package ispip install-ed in the docs build environment (see.github/workflows/mlsysim-preview-dev.yml), so no path manipulation is needed. - Use Zoo entries: Pull from
mlsysim.Hardware.Cloud.A100,mlsysim.Models.Language.Llama3_70B, etc. – no hardcoded constants in tutorial code - Print with units: Always use pint’s
~format spec:f"{value.to('ms'):~.2f}" - Comment sparingly: Code should be readable without comments; add a callout if explanation is needed
Linking to slide decks
When your tutorial covers a topic with an existing slide deck, link to it so readers can deepen their understanding. Relevant slide decks for common tutorial topics:
| Tutorial topic | Slide deck |
|---|---|
| Roofline analysis | Hardware Acceleration |
| Benchmarking & MFU | Benchmarking, Performance Engineering |
| LLM serving (TTFT/ITL) | Model Serving, Inference at Scale |
| Distributed training | Model Training, Distributed Training |
| Collective communication | Collective Communication |
| Fault tolerance | Fault Tolerance |
| Sustainability & carbon | Sustainable AI |
| Model compression | Model Compression |
| Network architectures | Network Architectures |
6. Adding or Improving a Solver
MLSYSIM ships 32 analytical solvers (see mlsysim.solvers), each grounded in peer-reviewed literature. All inherit from the 3-Tier base classes (ForwardModel, BaseSolver, BaseOptimizer) and implement solve(**kwargs). The six foundational ones:
| Solver | What it computes | Key references |
|---|---|---|
SingleNodeModel |
Roofline bounds, MFU/HFU | Williams et al. (2009) |
DistributedModel |
3D/4D parallelism, all-reduce, pipeline bubbles | Shoeybi et al. (2019), Narayanan et al. (2019) |
ReliabilityModel |
MTBF, failure probability, Young-Daly checkpointing | Young (1974), Daly (2006) |
SustainabilityModel |
Energy, carbon, water (PUE, WUE) | Patterson et al. (2021) |
EconomicsModel |
CapEx, OpEx, TCO | Barroso et al. (2018) |
ServingModel |
TTFT, ITL, KV-cache pressure | Pope et al. (2023), Patel et al. (2024), Agrawal et al. (2024) |
The Performance Engineering and Distributed Training slide decks cover the analytical models (roofline, all-reduce cost, pipeline bubbles) that the solvers implement.
Solver contribution checklist
- Place canonical equations in
mlsysim/physics/– every solver equation must be independently callable and unit-tested - Inherit from
BaseSolverand implementsolve(**kwargs) -> dict - Use pint units throughout – all inputs and outputs must carry physical units
- Cite the source paper in a docstring on the solver class
- Add tests (e.g. in
tests/test_solver_invariants.pyor a new test file) covering at least one known-good result from the paper - Document the solver by adding a page under
docs/api/
Example: extending a solver
All formulas live in mlsysim/physics/ so they can be tested in isolation:
# In mlsysim/mlsysim/physics/
def ring_allreduce_time(message_bytes, num_nodes, bandwidth):
"""Ring all-reduce latency: 2(N-1)/N * M/B.
Reference: Thakur et al. (2005), "Optimization of Collective
Communication Operations in MPICH."
"""
return 2 * (num_nodes - 1) / num_nodes * message_bytes / bandwidth7. Running Tests
Before submitting a pull request, ensure the test suite passes:
# Install development dependencies
pip install -e ".[dev]"
# Run the full test suite (from the mlsysim package root)
pytest tests/ -v
# Run a specific test file
pytest tests/test_solver_invariants.py -vKey areas of the test suite:
| Test file | What it validates |
|---|---|
test_engine.py |
Single-node inference, OOM exceptions, precision switching |
test_hardware.py |
Registry access, ridge point calculation, JSON serialization |
test_solver_invariants.py / test_solver_module_exports.py |
Solver behavior and the public mlsysim.solvers surface |
test_empirical.py |
Empirical validation against published numbers |
test_registry_loader_contract.py / test_constants_allowlist.py |
YAML loader contract and the retired-constants gate |
8. Submitting a Pull Request
- Fork the repository on GitHub
- Create a branch with a descriptive name:
git checkout -b feat/add-b200-hardware - Make your changes following the patterns in this guide
- Run tests to confirm nothing is broken
- Open a PR against the
devbranch with:- A clear description of what changed and why
- A link to the source document for any new spec values
- Output showing your change working (
python -c "..."snippet)
Branch naming conventions
| Type | Pattern | Example |
|---|---|---|
| New feature | feat/<scope> |
feat/add-mi300x-hardware |
| Bug fix | fix/<scope> |
fix/a100-bandwidth-typo |
| Documentation | docs/<scope> |
docs/tutorial-kv-cache |
| Tests | test/<scope> |
test/reliability-solver |
9. Architecture Overview
Understanding the module layout helps you find the right file to edit.
mlsysim/
core/
constants.py # Retired shim: re-exports units only (CI allowlist enforced)
loader.py # YAML loading, duplicate-key checks, registry generation
provenance.py # Sourced values and provenance helpers
registry/ # Base Registry pattern
types.py # Shared schema types
units.py # Unit registry and exported Pint units
engine/ # Solvers, scenario evaluation, calibration, explainers
physics/ # Canonical calc_* implementations (roofline, all-reduce, TCO, …)
hardware/data/ # Hardware.* YAML zoo
models/data/ # Models.* YAML zoo
datasets/data/ # Datasets.* YAML zoo
systems/registry.py # Systems.Nodes / Racks / Fabrics / Clusters / Storage / Pods
infrastructure/ # Infrastructure.* grids and pricing
scenarios/ # Runnable Scenarios.* bundles and ReferenceStats.* anchors
platforms/registry.py # Platforms.* zoo
tests/ # pytest suite (includes registry migration gates)
examples/ # Standalone scripts
docs/ # Quarto documentation site
Community Standards
MLSYSIM is a pedagogical tool used in courses at Harvard and beyond. Contributions should:
- Prioritize accuracy over completeness – a wrong spec is worse than a missing one
- Cite sources – every number needs a URL to an official datasheet or peer-reviewed paper
- Explain the analytical reasoning – a tutorial that teaches why is better than one that shows how
- Use units everywhere – pint prevents dimensional errors; do not bypass it with raw floats
Thank you for helping make MLSYSIM more accurate and useful for the next generation of ML systems engineers.