Contributing to MLSYSIM

Add hardware specs, write solvers, build tutorials, and grow the MLSys Zoo.

MLSYSIM grows stronger with every new hardware spec, solver, tutorial, and bug report. This guide explains how to contribute – whether you are a student who spotted a wrong datasheet number, an instructor designing a teaching scenario, or a researcher who needs a new analytical solver.

NoteBefore you start

MLSYSIM is maintained as part of the ML Systems textbook project. All contributions go through GitHub. If you are not familiar with Git and pull requests, GitHub’s guide is a good starting point.

Repository: harvard-edge/cs249r_book


At a Glance

Contribution Difficulty Impact Where it lives
Report a bug or wrong spec Beginner High – specs affect all users GitHub Issues
Add hardware to the Silicon Zoo Intermediate High – expands coverage mlsysim/hardware/
Add a model to the Model Zoo Intermediate Medium – new workloads mlsysim/models/
Add a fleet or fabric Intermediate Medium – new topologies mlsysim/systems/
Add a grid or rack profile Intermediate Medium – new infra mlsysim/infrastructure/
Write a tutorial Intermediate High – improves learning docs/tutorials/
Add or improve a solver Advanced High – new analysis capabilities engine/solvers/

1. Reporting Issues

The fastest way to contribute: open an issue on GitHub.

Good bug reports include:

  • Which spec is wrong (e.g., “A100 peak TFLOP/s in Hardware.Cloud.A100”)
  • The correct value and your source (official datasheet URL preferred)
  • The version of MLSYSIM you are using (python -c "import mlsysim; print(mlsysim.__version__)")

Good feature requests include:

  • What hardware, model, or solver you want added and why
  • A link to the official specification or paper

2. Adding Hardware to the Silicon Zoo

Every chip in the Silicon Zoo lives in one of five categories: Cloud, Workstation, Mobile, Edge, or Tiny. Each entry follows a strict format with mandatory provenance metadata.

TipBackground reading

The Hardware Acceleration and Compute Infrastructure slide decks cover the accelerator landscape and datasheet specs that feed into MLSYSIM’s hardware registry.

Step 1: Add a YAML spec under hardware/data/

Chip and board specs live in the Silicon Zoo — there is no shared constants file to park them in (units live in core/units.py, physical constants in physics/constants.py). Add one YAML file under the right tier, for example mlsysim/hardware/data/cloud/MyAccelerator.yaml. The loader validates the file against HardwareNode at import time and exposes it through the same Python API (Hardware.Cloud.MyAccelerator).

__key__: MyAccelerator
name: Example Accelerator
release_year: 2026
compute:
  peak_flops: 312 TFLOPs / s
  precision_flops:
    fp32: 19.5 TFLOPs / s
    int8: 624 TOPS
memory:
  capacity: 80 GiB
  bandwidth: 2039 GB / s
interconnect:
  name: PCIe Gen4 x16
  bandwidth: 32 GB / s
tdp: 400 W
metadata:
  provenance:
    kind: datasheet
    ref: Example accelerator datasheet
    url: https://...
    verified: "2026-05-30"

Reuse shared records from mlsysim/core/provenance_catalog.py when several entries share one source. See mlsysim/PROVENANCE.md and run python -m mlsysim.tools.audit_provenance --scope all --strict before opening a PR.

The YAML data layer is intentionally strict: duplicate YAML keys, duplicate __key__ values, missing data directories, and unknown schema fields fail loading. If the value is a technology-class reference, use @tech: rather than copying a shared fact into many devices, for example latency: "@tech:Interconnect.NVLink.latency".

Step 2: Verify with the registry contract tests

The loader contract and duplicate-spec gates run automatically: tests/test_registry_loader_contract.py and tests/test_registry_no_duplicate_specs.py. The legacy core/constants.py junk drawer was deleted outright; tests/test_constants_allowlist.py enforces that neither it nor a compat shim for it ever comes back.

Step 3: Use the canonical path in downstream examples

Examples import Hardware.Cloud.A100, not A100_FLOPS_FP16_TENSOR. Downstream content should use the same canonical registry paths as package tutorials.

Provenance rules

Every registry entry requires metadata.provenance (Provenance model in core/provenance.py). Use kind honestly: datasheet, literature, estimate (needs notes), convention, or illustrative.

  1. Link to an official primary source (manufacturer datasheet, not a blog post)
  2. Set verified (YYYY-MM-DD) on datasheet and literature entries
  3. State which variant in ref or notes (e.g., SXM5 vs. PCIe)
  4. When a spec varies across SKUs, use the most conservative published value unless the variant is specified in the node name

Formatting for QMD and LEGO cells

Display helpers live in mlsysim.fmt — keep physics in mlsysim.physics.*:

Helper Use when
fmt(q, precision=1) Human-readable quantities in prose
fmt_int(n) Integer counts without spurious decimals
check(condition, message) Assert invariants in tutorials and LEGO cells
MarkdownStr Inline QMD values that must not be LaTeX-escaped

Registry operands come from zoos; derived values from mlsysim.physics.calc_* or solvers.


3. Adding Models to the Model Zoo

Models live in category YAML files under mlsysim/models/data/: language.yaml, vision.yaml, tiny.yaml, recommendation.yaml, statespace.yaml, and generativevision.yaml. Each entry includes a __type__ tag selecting the validated workload class (TransformerWorkload, CNNWorkload, SSMWorkload, DiffusionWorkload, SparseTransformerWorkload, or plain Workload).

TipBackground reading

The Network Architectures and Neural Network Computation slide decks explain the architectural parameters (layers, heads, hidden dimensions) that define each workload.

Transformer workloads

Llama3_8B:
  __type__: TransformerWorkload
  name: Llama-3.1-8B
  architecture: Transformer
  parameters: 8030000000 param
  inference_flops: 16060000000 flop
  layers: 32
  hidden_dim: 4096
  heads: 32
  kv_heads: 8

For inference_flops, the standard approximation is \(2P\) FLOPs per token for transformer forward passes (multiply-accumulate counted as 2 operations). When a more precise count is available from the paper, use it and note the source in a comment.

CNN workloads

ResNet50:
  __type__: CNNWorkload
  name: ResNet-50
  architecture: CNN
  parameters: 25600000 param
  inference_flops: 4100000000 flop
  layers: 50

4. Adding Systems and Infrastructure

MLSYSIM models the full deployment stack: individual accelerators compose into nodes, nodes compose into racks and fleets, and fleets connect via network fabrics. Infrastructure captures the grid, datacenter environment, pricing, and capacity facts that those systems run within.

TipBackground reading

The Network Fabrics and Fleet Orchestration slide decks explain the network topologies and cluster compositions that MLSYSIM models analytically.

Adding a fleet or fabric (systems/registry.py)

# A new reference node
DGX_B200 = Node(
    name="DGX B200",
    accelerator=Hardware.Cloud.B200,
    accelerators_per_node=8,
    intra_node_bw=1800 * ureg.GB / ureg.second,
    nics_per_node=8,
)

# A new cluster built from that node
Training_2K = Fleet(
    name="Training Cluster (2048 GPUs)",
    node=DGX_B200,
    count=256,  # 256 nodes x 8 GPUs = 2048
    fabric=Fabrics.InfiniBand_NDR,
)

Adding a grid profile (infrastructure/data/grids.yaml)

Grid profiles capture the carbon intensity, cooling efficiency (PUE), and water usage (WUE) of a datacenter region.

Iceland:
  name: Iceland (Geothermal)
  carbon_intensity_g_kwh: 28
  pue: 1.06
  wue: 0.0
  primary_source: geothermal
  metadata:
    provenance: "@prov:IEA_WEO_2023"
TipBackground reading

The Sustainable AI slide deck covers PUE, WUE, and carbon intensity – the exact quantities that GridProfile captures.


5. Writing a Tutorial

The best tutorials teach one insight through one concrete example. Before writing, answer three questions:

  1. What is the one thing the reader will understand after this tutorial?
  2. What would they have guessed incorrectly before reading it?
  3. What surprising number will they compute?

Tutorial structure

Follow the pattern established in Hello, Roofline, Geography is a Systems Variable, Two Phases of Inference, and Scaling to 1000 GPUs:

---
title: "Short, specific title"
subtitle: "Payoff sentence: what you learn in 10 words."
---

[2-3 sentence hook: what problem does this solve?]

By the end of this tutorial you will understand:
- [Concept 1]
- [Concept 2]
- [Concept 3]

::: {.callout-tip}
## Background concept
[1-paragraph intuition before any code]
:::

## 1. Setup
[import block -- path hack MUST be hidden with #| echo: false]

## 2. First Example
[minimal working code + output]

## 3-N. Build Understanding
[progressive complexity, callouts explaining surprising results]

## What You Learned
[bullet list recap]

## Next Steps
[2-3 links to related content]

Code style in tutorials

  • Use clean imports: Start with import mlsysim. The package is pip install-ed in the docs build environment (see .github/workflows/mlsysim-preview-dev.yml), so no path manipulation is needed.
  • Use Zoo entries: Pull from mlsysim.Hardware.Cloud.A100, mlsysim.Models.Language.Llama3_70B, etc. – no hardcoded constants in tutorial code
  • Print with units: Always use pint’s ~ format spec: f"{value.to('ms'):~.2f}"
  • Comment sparingly: Code should be readable without comments; add a callout if explanation is needed

Linking to slide decks

When your tutorial covers a topic with an existing slide deck, link to it so readers can deepen their understanding. Relevant slide decks for common tutorial topics:

Tutorial topic Slide deck
Roofline analysis Hardware Acceleration
Benchmarking & MFU Benchmarking, Performance Engineering
LLM serving (TTFT/ITL) Model Serving, Inference at Scale
Distributed training Model Training, Distributed Training
Collective communication Collective Communication
Fault tolerance Fault Tolerance
Sustainability & carbon Sustainable AI
Model compression Model Compression
Network architectures Network Architectures

6. Adding or Improving a Solver

MLSYSIM ships 32 analytical solvers (see mlsysim.solvers), each grounded in peer-reviewed literature. All inherit from the 3-Tier base classes (ForwardModel, BaseSolver, BaseOptimizer) and implement solve(**kwargs). The six foundational ones:

Solver What it computes Key references
SingleNodeModel Roofline bounds, MFU/HFU Williams et al. (2009)
DistributedModel 3D/4D parallelism, all-reduce, pipeline bubbles Shoeybi et al. (2019), Narayanan et al. (2019)
ReliabilityModel MTBF, failure probability, Young-Daly checkpointing Young (1974), Daly (2006)
SustainabilityModel Energy, carbon, water (PUE, WUE) Patterson et al. (2021)
EconomicsModel CapEx, OpEx, TCO Barroso et al. (2018)
ServingModel TTFT, ITL, KV-cache pressure Pope et al. (2023), Patel et al. (2024), Agrawal et al. (2024)
TipBackground reading

The Performance Engineering and Distributed Training slide decks cover the analytical models (roofline, all-reduce cost, pipeline bubbles) that the solvers implement.

Solver contribution checklist

  1. Place canonical equations in mlsysim/physics/ – every solver equation must be independently callable and unit-tested
  2. Inherit from BaseSolver and implement solve(**kwargs) -> dict
  3. Use pint units throughout – all inputs and outputs must carry physical units
  4. Cite the source paper in a docstring on the solver class
  5. Add tests (e.g. in tests/test_solver_invariants.py or a new test file) covering at least one known-good result from the paper
  6. Document the solver by adding a page under docs/api/

Example: extending a solver

All formulas live in mlsysim/physics/ so they can be tested in isolation:

# In mlsysim/mlsysim/physics/
def ring_allreduce_time(message_bytes, num_nodes, bandwidth):
    """Ring all-reduce latency: 2(N-1)/N * M/B.

    Reference: Thakur et al. (2005), "Optimization of Collective
    Communication Operations in MPICH."
    """
    return 2 * (num_nodes - 1) / num_nodes * message_bytes / bandwidth

7. Running Tests

Before submitting a pull request, ensure the test suite passes:

# Install development dependencies
pip install -e ".[dev]"

# Run the full test suite (from the mlsysim package root)
pytest tests/ -v

# Run a specific test file
pytest tests/test_solver_invariants.py -v

Key areas of the test suite:

Test file What it validates
test_engine.py Single-node inference, OOM exceptions, precision switching
test_hardware.py Registry access, ridge point calculation, JSON serialization
test_solver_invariants.py / test_solver_module_exports.py Solver behavior and the public mlsysim.solvers surface
test_empirical.py Empirical validation against published numbers
test_registry_loader_contract.py / test_constants_allowlist.py YAML loader contract and the retired-constants gate

8. Submitting a Pull Request

  1. Fork the repository on GitHub
  2. Create a branch with a descriptive name: git checkout -b feat/add-b200-hardware
  3. Make your changes following the patterns in this guide
  4. Run tests to confirm nothing is broken
  5. Open a PR against the dev branch with:
    • A clear description of what changed and why
    • A link to the source document for any new spec values
    • Output showing your change working (python -c "..." snippet)

Branch naming conventions

Type Pattern Example
New feature feat/<scope> feat/add-mi300x-hardware
Bug fix fix/<scope> fix/a100-bandwidth-typo
Documentation docs/<scope> docs/tutorial-kv-cache
Tests test/<scope> test/reliability-solver

9. Architecture Overview

Understanding the module layout helps you find the right file to edit.

mlsysim/
  core/
    constants.py    # Retired shim: re-exports units only (CI allowlist enforced)
    loader.py       # YAML loading, duplicate-key checks, registry generation
    provenance.py   # Sourced values and provenance helpers
    registry/       # Base Registry pattern
    types.py        # Shared schema types
    units.py        # Unit registry and exported Pint units
  engine/           # Solvers, scenario evaluation, calibration, explainers
  physics/          # Canonical calc_* implementations (roofline, all-reduce, TCO, …)
  hardware/data/    # Hardware.* YAML zoo
  models/data/      # Models.* YAML zoo
  datasets/data/    # Datasets.* YAML zoo
  systems/registry.py    # Systems.Nodes / Racks / Fabrics / Clusters / Storage / Pods
  infrastructure/   # Infrastructure.* grids and pricing
  scenarios/        # Runnable Scenarios.* bundles and ReferenceStats.* anchors
  platforms/registry.py  # Platforms.* zoo
  tests/            # pytest suite (includes registry migration gates)
  examples/         # Standalone scripts
  docs/             # Quarto documentation site

Community Standards

MLSYSIM is a pedagogical tool used in courses at Harvard and beyond. Contributions should:

  • Prioritize accuracy over completeness – a wrong spec is worse than a missing one
  • Cite sources – every number needs a URL to an official datasheet or peer-reviewed paper
  • Explain the analytical reasoning – a tutorial that teaches why is better than one that shows how
  • Use units everywhere – pint prevents dimensional errors; do not bypass it with raw floats

Thank you for helping make MLSYSIM more accurate and useful for the next generation of ML systems engineers.

Back to top