For Engineers & Researchers

Back-of-envelope estimates before you provision hardware.

MLSys·im gives you quick, type-safe analytical estimates for capacity planning, hardware selection, cost modeling, and sustainability analysis – in seconds, from specifications alone. Equations are documented and tied to the relevant literature; hardware registry specs come from datasheets where available.

Why Use Analytical Models?

Before running expensive benchmarks or provisioning cloud instances, you need directional answers:

Will this model fit in GPU memory? – Check before renting the GPU
Why does training fail when inference fits? – Separate weights, gradients, optimizer state, and activations
What’s the expected TTFT for my LLM? – Estimate before building the serving stack
How many replicas do I need for a P99 target? – Size serving capacity before benchmarking
What is the MoE routing tax? – Sweep expert imbalance before choosing EP topology
How many H100s do I actually need? – Model scaling efficiency before buying the cluster
What will this cost per year? – TCO analysis before signing the contract
How often will my training job crash? – Reliability modeling before committing to a 30-day run
What’s the carbon footprint of this deployment? – Quantify before the sustainability review

MLSys·im answers these in microseconds using first-order equations. It won’t replace profiling, but it tells you where to start profiling.

Theory Behind the Tools

Each solver implements equations from the Math Foundations page. For the full conceptual framework, see the companion slide decks linked in the Solver-to-Slides Map below.

Quick Start: Roofline Analysis

The Engine implements the Roofline Performance Model (Williams et al. 2009) to classify workloads as compute-bound or memory-bound.

import mlsysim
from mlsysim import Engine

# Single-node: Is ResNet-50 memory-bound on A100?
profile = Engine.solve(
    model=mlsysim.Models.Vision.ResNet50,
    hardware=mlsysim.Hardware.Cloud.A100,
    batch_size=1, precision="fp16"
)
print(f"{profile.bottleneck}, {profile.latency.to('ms'):~.2f}")
print(f"MFU: {profile.mfu:.1%}, Arithmetic Intensity: {profile.arithmetic_intensity:~.2f}")

The returned PerformanceProfile gives you latency, throughput, bottleneck classification, Model FLOPs Utilization (MFU), arithmetic intensity, energy, and a feasibility flag – everything you need for a first-pass hardware assessment.

LLM Serving Analysis

The ServingModel models the two-phase LLM inference lifecycle: compute-bound pre-fill and memory-bound decoding.

import mlsysim
from mlsysim.solvers import ServingModel

serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)
print(f"TTFT:       {result.ttft.to('ms'):~.1f}")
print(f"ITL:        {result.itl.to('ms'):~.2f}")
print(f"KV-cache:   {result.kv_cache_size.to('GB'):~.1f}")
print(f"Feasible:   {result.feasible}")
print(f"Mem util:   {result.memory_utilization:.0%}")

The feasibility check tells you immediately whether the model plus its KV-cache fit in device memory – before you discover the OOM at 3 AM in production.

Hardware Sweep Pattern

Compare devices programmatically instead of reading datasheets:

import mlsysim
from mlsysim import Engine

model = mlsysim.Models.Vision.ResNet50

for hw in [mlsysim.Hardware.Cloud.H100,
           mlsysim.Hardware.Cloud.A100,
           mlsysim.Hardware.Cloud.T4,
           mlsysim.Hardware.Edge.JetsonOrinNX]:
    p = Engine.solve(model=model, hardware=hw, batch_size=32, precision="fp16")
    print(f"{hw.name:20s}  {p.bottleneck:16s}  {p.latency.to('ms'):>8.2f~}  {p.throughput:>8.0f} img/s")

Distributed Training Analysis

The DistributedModel models 3D parallelism (data, tensor, pipeline) with communication overhead from ring all-reduce and pipeline bubbles.

import mlsysim
from mlsysim.solvers import DistributedModel

dist = DistributedModel()
result = dist.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    fleet=mlsysim.Systems.Clusters.Research_256,
    batch_size=512, precision="fp16",
    tp_size=8, pp_size=4, microbatch_count=16
)
print(f"Scaling efficiency:  {result.scaling_efficiency:.1%}")
print(f"DP all-reduce:       {result.dp_communication_latency.to('ms'):~.1f}")
print(f"TP overhead:         {result.tp_communication_latency.to('ms'):~.1f}")
print(f"Pipeline bubble:     {result.bubble_fraction:.1%}")
print(f"Step latency:        {result.step_latency_total.to('ms'):~.1f}")

Tune tp_size, pp_size, and microbatch_count to find the parallelism configuration that maximizes scaling efficiency for your cluster topology.

Composing Solvers for Real Questions

The core solvers are designed to chain. Here are three common engineering workflows.

“Can I serve Llama-70B on H100s within budget?”

import mlsysim
from mlsysim.solvers import ServingModel, EconomicsModel

# Step 1: Does it fit and what's the latency?
serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)

# Step 2: What does that fleet cost?
econ = EconomicsModel()
cost = econ.solve(
    fleet=mlsysim.Systems.Clusters.Research_256,
    duration_days=365,
    kwh_price=0.08
)
print(f"Annual TCO: ${cost.tco_usd:,.0f}")
print(f"  CapEx:    ${cost.capex_usd:,.0f}")
print(f"  OpEx:     ${cost.total_opex_usd:,.0f}")

“Where should I train to minimize carbon?”

import mlsysim
from mlsysim.solvers import SustainabilityModel

sustain = SustainabilityModel()
for grid in [mlsysim.Infrastructure.Grids.Quebec, mlsysim.Infrastructure.Grids.US_Avg,
             mlsysim.Infrastructure.Grids.Poland]:
    r = sustain.solve(
        fleet=mlsysim.Systems.Clusters.Research_256,
        duration_days=30,
        datacenter=grid
    )
    carbon_tons = r.carbon_footprint_kg / 1000.0
    print(f"{grid.name:12s}  {carbon_tons:8.1f} tCO2e  "
          f"PUE={r.pue:.2f}  Water={r.water_usage_liters:,.0f} L")

For the theory behind PUE, carbon intensity, and the energy hierarchy, see the Sustainable AI slide deck.

“How reliable is a 30-day training run on 256 GPUs?”

import mlsysim
from mlsysim.solvers import ReliabilityModel

rel = ReliabilityModel()
result = rel.solve(
    fleet=mlsysim.Systems.Clusters.Research_256,
    job_duration_hours=720,       # 30 days
    checkpoint_time_s=120.0       # 2 minutes per checkpoint
)
print(f"Fleet MTBF:              {result.fleet_mtbf.to('hour'):~.1f}")
print(f"P(failure before done):  {result.failure_probability:.1%}")
print(f"Optimal ckpt interval:   {result.optimal_checkpoint_interval.to('minute'):~.1f}")
print(f"Expected failures:       {result.expected_failures:.1f}")

This solver implements the Young-Daly checkpoint model – essential for capacity planning on long training jobs.

Writing Custom Solvers

Follow the built-in solver pattern to create your own analysis:

from mlsysim.hardware.types import HardwareNode

class PowerEfficiencyModel:
    def solve(self, hardware: HardwareNode) -> dict:
        flops_per_watt = hardware.compute.peak_flops / hardware.tdp
        return {
            "device": hardware.name,
            "flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
        }

See Writing a Custom Solver for the full guide.

Type Safety

All quantities are pint.Quantity objects. Unit conversions are explicit, and dimensional errors are caught at runtime:

hw = mlsysim.Hardware.Cloud.A100
hw.compute.peak_flops.to("TFLOPs/s")   # → 312.0 TFLOPs/s
hw.memory.bandwidth.to("TB/s")          # → 2.04 TB/s
hw.memory.bandwidth.to("second")        # → DimensionalityError ✓

Pint catches physical-dimension mismatches (bytes per second can never become seconds) – and since 2026-06, compute work has its own pint dimension: a bare .to("flop/s") on a bandwidth raises DimensionalityError directly, as does adding a FLOP rate to a memory bandwidth:

hw.compute.peak_flops + hw.memory.bandwidth   # → DimensionalityError ✓
hw.memory.bandwidth.to("TFLOPs/s")            # → DimensionalityError ✓

Bytes, counts, parameters, and dollars remain dimensionless aliases, so MLSys·im additionally enforces unit-family checks at every schema boundary (require_unit_family in mlsysim.core.types) to keep those distinct – a field declared as memory bandwidth rejects a count rate, and vice versa.

Solver-to-Slides Map

Each MLSys·im solver maps to specific chapters and slide decks from the Machine Learning Systems textbook. Use these for the full theoretical grounding behind each solver.

MLSys·im Solver	What It Models	Slide Deck
Engine / SingleNodeModel	Roofline analysis, compute vs. memory bottleneck	Hardware Acceleration (Vol I, Ch 11)
ServingModel	TTFT, ITL, KV-cache memory	Model Serving (Vol I, Ch 13) and Inference at Scale (Vol II, Ch 10)
DistributedModel	3D parallelism, all-reduce, pipeline bubbles	Distributed Training (Vol II, Ch 5) and Collective Communication (Vol II, Ch 6)
EconomicsModel	CapEx, OpEx, total cost of ownership	Compute Infrastructure (Vol II, Ch 2)
SustainabilityModel	Energy, carbon footprint, water usage	Sustainable AI (Vol II, Ch 15)
ReliabilityModel	MTBF, Young-Daly checkpointing	Fault Tolerance (Vol II, Ch 7)

Volume I: Foundations (17 decks, 570 slides)

Browse Vol I Decks | Download All (PDF)

Volume II: At Scale (18 decks, 529 slides)

Browse Vol II Decks | Download All (PDF)

Next Steps

Getting Started – Install and run your first analysis
Solver Guide – Which solver for which question
MLSys Zoo – Browse all available hardware, model, and infrastructure specs
API Reference – Full programmatic API documentation
Accuracy & Validation – How analytical bounds compare to empirical measurements
Math Foundations – The equations behind every solver
All Slide Decks – 35 Beamer decks with speaker notes and active learning exercises