Contributing to MLSys·im

Add hardware specs, write solvers, build tutorials, and grow the MLSys Zoo.

MLSys·im grows stronger with every new hardware spec, solver, tutorial, and bug report. This guide explains how to contribute – whether you are a student who spotted a wrong datasheet number, an instructor designing a teaching scenario, or a researcher who needs a new analytical solver.

Before you start

MLSys·im is maintained as part of the ML Systems textbook project. All contributions go through GitHub. If you are not familiar with Git and pull requests, GitHub’s guide is a good starting point.

Repository: harvard-edge/cs249r_book

At a Glance

Contribution	Difficulty	Impact	Where it lives
Report a bug or wrong spec	Beginner	High – specs affect all users	GitHub Issues
Add hardware to the Silicon Zoo	Intermediate	High – expands coverage	`mlsysim/hardware/`
Add a model to the Model Zoo	Intermediate	Medium – new workloads	`mlsysim/models/`
Add a fleet or fabric	Intermediate	Medium – new topologies	`mlsysim/systems/`
Add a grid or rack profile	Intermediate	Medium – new infra	`mlsysim/infrastructure/`
Write a tutorial	Intermediate	High – improves learning	`docs/tutorials/`
Add or improve a solver	Advanced	High – new analysis capabilities	`engine/solvers/`

1. Reporting Issues

The fastest way to contribute: open an issue on GitHub.

Good bug reports include:

Which spec is wrong (e.g., “A100 peak TFLOP/s in Hardware.Cloud.A100”)
The correct value and your source (official datasheet URL preferred)
The version of MLSys·im you are using (python -c "import mlsysim; print(mlsysim.__version__)")

Good feature requests include:

What hardware, model, or solver you want added and why
A link to the official specification or paper

2. Adding Hardware to the Silicon Zoo

Every chip in the Silicon Zoo lives in one of five categories: Cloud, Workstation, Mobile, Edge, or Tiny. Each entry follows a strict format with mandatory provenance metadata.

Background reading

The Hardware Acceleration and Compute Infrastructure slide decks cover the accelerator landscape and datasheet specs that feed into MLSys·im’s hardware registry.

Step 1: Add a YAML spec under `hardware/data/`

Chip and board specs live in the Silicon Zoo — there is no shared constants file to park them in (units live in core/units.py, physical constants in physics/constants.py). Add one YAML file under the right tier, for example mlsysim/hardware/data/cloud/MyAccelerator.yaml. The loader validates the file against HardwareNode at import time and exposes it through the same Python API (Hardware.Cloud.MyAccelerator).

__key__: MyAccelerator
name: Example Accelerator
release_year: 2026
compute:
  peak_flops: 312 TFLOPs / s
  precision_flops:
    fp32: 19.5 TFLOPs / s
    int8: 624 TOPS
memory:
  capacity: 80 GiB
  bandwidth: 2039 GB / s
interconnect:
  name: PCIe Gen4 x16
  bandwidth: 32 GB / s
tdp: 400 W
metadata:
  provenance:
    kind: datasheet
    ref: Example accelerator datasheet
    url: https://...
    verified: "2026-05-30"

Reuse shared records from mlsysim/core/provenance_catalog.py when several entries share one source. See mlsysim/PROVENANCE.md and run python -m mlsysim.tools.audit_provenance --scope all --strict before opening a PR.

The YAML data layer is intentionally strict: duplicate YAML keys, duplicate __key__ values, missing data directories, and unknown schema fields fail loading. If the value is a technology-class reference, use @tech: rather than copying a shared fact into many devices, for example latency: "@tech:Interconnect.NVLink.latency".

Step 2: Verify with the registry contract tests

The loader contract and duplicate-spec gates run automatically: tests/test_registry_loader_contract.py and tests/test_registry_no_duplicate_specs.py. The legacy core/constants.py junk drawer was deleted outright; tests/test_constants_allowlist.py enforces that neither it nor a compat shim for it ever comes back.

Step 3: Use the canonical path in downstream examples

Examples import Hardware.Cloud.A100, not A100_FLOPS_FP16_TENSOR. Downstream content should use the same canonical registry paths as package tutorials.

Provenance rules

Every registry entry requires metadata.provenance (Provenance model in core/provenance.py). Use kind honestly: datasheet, literature, estimate (needs notes), convention, or illustrative.

Link to an official primary source (manufacturer datasheet, not a blog post)
Set verified (YYYY-MM-DD) on datasheet and literature entries
State which variant in ref or notes (e.g., SXM5 vs. PCIe)
When a spec varies across SKUs, use the most conservative published value unless the variant is specified in the node name

Formatting for QMD and LEGO cells

Display helpers live in mlsysim.fmt — keep physics in mlsysim.physics.*:

Helper	Use when
`fmt(q, precision=1)`	Human-readable quantities in prose
`fmt_int(n)`	Integer counts without spurious decimals
`check(condition, message)`	Assert invariants in tutorials and LEGO cells
`MarkdownStr`	Inline QMD values that must not be LaTeX-escaped

Registry operands come from zoos; derived values from mlsysim.physics.calc_* or solvers.

3. Adding Models to the Model Zoo

Models live in category YAML files under mlsysim/models/data/: language.yaml, vision.yaml, tiny.yaml, recommendation.yaml, statespace.yaml, and generativevision.yaml. Each entry includes a __type__ tag selecting the validated workload class (TransformerWorkload, CNNWorkload, SSMWorkload, DiffusionWorkload, SparseTransformerWorkload, or plain Workload).

Background reading

The Network Architectures and Neural Network Computation slide decks explain the architectural parameters (layers, heads, hidden dimensions) that define each workload.

Transformer workloads

Llama3_8B:
  __type__: TransformerWorkload
  name: Llama-3.1-8B
  architecture: Transformer
  parameters: 8030000000 param
  inference_flops: 16060000000 flop
  layers: 32
  hidden_dim: 4096
  heads: 32
  kv_heads: 8

For inference_flops, the standard approximation is \(2P\) FLOPs per token for transformer forward passes (multiply-accumulate counted as 2 operations). When a more precise count is available from the paper, use it and note the source in a comment.

CNN workloads

ResNet50:
  __type__: CNNWorkload
  name: ResNet-50
  architecture: CNN
  parameters: 25600000 param
  inference_flops: 4100000000 flop
  layers: 50

4. Adding Systems and Infrastructure

MLSys·im models the full deployment stack: individual accelerators compose into nodes, nodes compose into racks and fleets, and fleets connect via network fabrics. Infrastructure captures the grid, datacenter environment, pricing, and capacity facts that those systems run within.

Background reading

The Network Fabrics and Fleet Orchestration slide decks explain the network topologies and cluster compositions that MLSys·im models analytically.

Adding a fleet or fabric (`systems/registry.py`)

# A new reference node
DGX_B200 = Node(
    name="DGX B200",
    accelerator=Hardware.Cloud.B200,
    accelerators_per_node=8,
    intra_node_bw=1800 * ureg.GB / ureg.second,
    nics_per_node=8,
)

# A new cluster built from that node
Training_2K = Fleet(
    name="Training Cluster (2048 GPUs)",
    node=DGX_B200,
    count=256,  # 256 nodes x 8 GPUs = 2048
    fabric=Fabrics.InfiniBand_NDR,
)

Adding a grid profile (`infrastructure/data/grids.yaml`)

Grid profiles capture the carbon intensity, cooling efficiency (PUE), and water usage (WUE) of a datacenter region.

Iceland:
  name: Iceland (Geothermal)
  carbon_intensity_g_kwh: 28
  pue: 1.06
  wue: 0.0
  primary_source: geothermal
  metadata:
    provenance: "@prov:IEA_WEO_2023"

Background reading

The Sustainable AI slide deck covers PUE, WUE, and carbon intensity – the exact quantities that GridProfile captures.

5. Writing a Tutorial

The best tutorials teach one insight through one concrete example. Before writing, answer three questions:

What is the one thing the reader will understand after this tutorial?
What would they have guessed incorrectly before reading it?
What surprising number will they compute?

Tutorial structure

Follow the pattern established in Hello, Roofline, Geography is a Systems Variable, Two Phases of Inference, and Scaling to 1000 GPUs:

---
title: "Short, specific title"
subtitle: "Payoff sentence: what you learn in 10 words."
---

[2-3 sentence hook: what problem does this solve?]

By the end of this tutorial you will understand:
- [Concept 1]
- [Concept 2]
- [Concept 3]

::: {.callout-tip}
## Background concept
[1-paragraph intuition before any code]
:::

## 1. Setup
[import block -- path hack MUST be hidden with #| echo: false]

## 2. First Example
[minimal working code + output]

## 3-N. Build Understanding
[progressive complexity, callouts explaining surprising results]

## What You Learned
[bullet list recap]

## Next Steps
[2-3 links to related content]

Code style in tutorials

Use clean imports: Start with import mlsysim. The package is pip install-ed in the docs build environment (see .github/workflows/mlsysim-preview-dev.yml), so no path manipulation is needed.
Use Zoo entries: Pull from mlsysim.Hardware.Cloud.A100, mlsysim.Models.Language.Llama3_70B, etc. – no hardcoded constants in tutorial code
Print with units: Always use pint’s ~ format spec: f"{value.to('ms'):~.2f}"
Comment sparingly: Code should be readable without comments; add a callout if explanation is needed

Linking to slide decks

When your tutorial covers a topic with an existing slide deck, link to it so readers can deepen their understanding. Relevant slide decks for common tutorial topics:

Tutorial topic	Slide deck
Roofline analysis	Hardware Acceleration
Benchmarking & MFU	Benchmarking, Performance Engineering
LLM serving (TTFT/ITL)	Model Serving, Inference at Scale
Distributed training	Model Training, Distributed Training
Collective communication	Collective Communication
Fault tolerance	Fault Tolerance
Sustainability & carbon	Sustainable AI
Model compression	Model Compression
Network architectures	Network Architectures

6. Adding or Improving a Solver

MLSys·im ships 32 analytical solvers (see mlsysim.solvers), each grounded in peer-reviewed literature. All inherit from the 3-Tier base classes (ForwardModel, BaseSolver, BaseOptimizer) and implement solve(**kwargs). The six foundational ones:

Solver	What it computes	Key references
`SingleNodeModel`	Roofline bounds, MFU/HFU	Williams et al. (2009)
`DistributedModel`	3D/4D parallelism, all-reduce, pipeline bubbles	Shoeybi et al. (2019), Narayanan et al. (2019)
`ReliabilityModel`	MTBF, failure probability, Young-Daly checkpointing	Young (1974), Daly (2006)
`SustainabilityModel`	Energy, carbon, water (PUE, WUE)	Patterson et al. (2021)
`EconomicsModel`	CapEx, OpEx, TCO	Barroso et al. (2018)
`ServingModel`	TTFT, ITL, KV-cache pressure	Pope et al. (2023), Patel et al. (2024), Agrawal et al. (2024)

Background reading

The Performance Engineering and Distributed Training slide decks cover the analytical models (roofline, all-reduce cost, pipeline bubbles) that the solvers implement.

Solver contribution checklist

Place canonical equations in mlsysim/physics/ – every solver equation must be independently callable and unit-tested
Inherit from BaseSolver and implement solve(**kwargs) -> dict
Use pint units throughout – all inputs and outputs must carry physical units
Cite the source paper in a docstring on the solver class
Add tests (e.g. in tests/test_solver_invariants.py or a new test file) covering at least one known-good result from the paper
Document the solver by adding a page under docs/api/

Example: extending a solver

All formulas live in mlsysim/physics/ so they can be tested in isolation:

# In mlsysim/mlsysim/physics/
def ring_allreduce_time(message_bytes, num_nodes, bandwidth):
    """Ring all-reduce latency: 2(N-1)/N * M/B.

    Reference: Thakur et al. (2005), "Optimization of Collective
    Communication Operations in MPICH."
    """
    return 2 * (num_nodes - 1) / num_nodes * message_bytes / bandwidth

7. Running Tests

Before submitting a pull request, ensure the test suite passes:

# Install development dependencies
pip install -e ".[dev]"

# Run the full test suite (from the mlsysim package root)
pytest tests/ -v

# Run a specific test file
pytest tests/test_solver_invariants.py -v

Key areas of the test suite:

Test file	What it validates
`test_engine.py`	Single-node inference, OOM exceptions, precision switching
`test_hardware.py`	Registry access, ridge point calculation, JSON serialization
`test_solver_invariants.py` / `test_solver_module_exports.py`	Solver behavior and the public `mlsysim.solvers` surface
`test_empirical.py`	Empirical validation against published numbers
`test_registry_loader_contract.py` / `test_constants_allowlist.py`	YAML loader contract and the retired-constants gate

8. Submitting a Pull Request

Fork the repository on GitHub
Create a branch with a descriptive name: git checkout -b feat/add-b200-hardware
Make your changes following the patterns in this guide
Run tests to confirm nothing is broken
Open a PR against the dev branch with:
- A clear description of what changed and why
- A link to the source document for any new spec values
- Output showing your change working (python -c "..." snippet)

Branch naming conventions

Type	Pattern	Example
New feature	`feat/<scope>`	`feat/add-mi300x-hardware`
Bug fix	`fix/<scope>`	`fix/a100-bandwidth-typo`
Documentation	`docs/<scope>`	`docs/tutorial-kv-cache`
Tests	`test/<scope>`	`test/reliability-solver`

9. Architecture Overview

Understanding the module layout helps you find the right file to edit.

mlsysim/
  core/
    constants.py    # Retired shim: re-exports units only (CI allowlist enforced)
    loader.py       # YAML loading, duplicate-key checks, registry generation
    provenance.py   # Sourced values and provenance helpers
    registry/       # Base Registry pattern
    types.py        # Shared schema types
    units.py        # Unit registry and exported Pint units
  engine/           # Solvers, scenario evaluation, calibration, explainers
  physics/          # Canonical calc_* implementations (roofline, all-reduce, TCO, …)
  hardware/data/    # Hardware.* YAML zoo
  models/data/      # Models.* YAML zoo
  datasets/data/    # Datasets.* YAML zoo
  systems/registry.py    # Systems.Nodes / Racks / Fabrics / Clusters / Storage / Pods
  infrastructure/   # Infrastructure.* grids and pricing
  scenarios/        # Runnable Scenarios.* bundles and ReferenceStats.* anchors
  platforms/registry.py  # Platforms.* zoo
  tests/            # pytest suite (includes registry migration gates)
  examples/         # Standalone scripts
  docs/             # Quarto documentation site

Community Standards

MLSys·im is a pedagogical tool used in courses at Harvard and beyond. Contributions should:

Prioritize accuracy over completeness – a wrong spec is worse than a missing one
Cite sources – every number needs a URL to an official datasheet or peer-reviewed paper
Explain the analytical reasoning – a tutorial that teaches why is better than one that shows how
Use units everywhere – pint prevents dimensional errors; do not bypass it with raw floats

Thank you for helping make MLSys·im more accurate and useful for the next generation of ML systems engineers.

At a Glance

1. Reporting Issues

2. Adding Hardware to the Silicon Zoo

Step 1: Add a YAML spec under hardware/data/

Step 2: Verify with the registry contract tests

Step 3: Use the canonical path in downstream examples

Provenance rules

Formatting for QMD and LEGO cells

3. Adding Models to the Model Zoo

Transformer workloads

CNN workloads

4. Adding Systems and Infrastructure

Adding a fleet or fabric (systems/registry.py)

Adding a grid profile (infrastructure/data/grids.yaml)

5. Writing a Tutorial

Tutorial structure

Code style in tutorials

Linking to slide decks

6. Adding or Improving a Solver

Solver contribution checklist

Example: extending a solver

7. Running Tests

8. Submitting a Pull Request

Branch naming conventions

9. Architecture Overview

Community Standards

Step 1: Add a YAML spec under `hardware/data/`

Adding a fleet or fabric (`systems/registry.py`)

Adding a grid profile (`infrastructure/data/grids.yaml`)