For Engineers & Researchers

Back-of-envelope estimates before you provision hardware.

MLSYSIM gives you quick, type-safe analytical estimates for capacity planning, hardware selection, cost modeling, and sustainability analysis – in seconds, from specifications alone. Every equation is grounded in peer-reviewed literature. Every hardware spec comes from a real datasheet.


Why Use Analytical Models?

Before running expensive benchmarks or provisioning cloud instances, you need directional answers:

  • Will this model fit in GPU memory? – Check before renting the GPU
  • What’s the expected TTFT for my LLM? – Estimate before building the serving stack
  • How many H100s do I actually need? – Model scaling efficiency before buying the cluster
  • What will this cost per year? – TCO analysis before signing the contract
  • How often will my training job crash? – Reliability modeling before committing to a 30-day run
  • What’s the carbon footprint of this deployment? – Quantify before the sustainability review

MLSYSIM answers these in microseconds using first-order equations. It won’t replace profiling, but it tells you where to start profiling.

TipTheory Behind the Tools

Each solver implements equations from the Math Foundations page. For the full conceptual framework, see the companion slide decks linked in the Solver-to-Slides Map below.


Quick Start: Roofline Analysis

The Engine implements the Roofline Performance Model (Williams et al. 2009) to classify workloads as compute-bound or memory-bound.

import mlsysim
from mlsysim import Engine

# Single-node: Is ResNet-50 memory-bound on A100?
profile = Engine.solve(
    model=mlsysim.Models.ResNet50,
    hardware=mlsysim.Hardware.Cloud.A100,
    batch_size=1, precision="fp16"
)
print(f"{profile.bottleneck}, {profile.latency.to('ms'):~.2f}")
print(f"MFU: {profile.mfu:.1%}, Arithmetic Intensity: {profile.arithmetic_intensity:~.2f}")

The returned PerformanceProfile gives you latency, throughput, bottleneck classification, Model FLOPs Utilization (MFU), arithmetic intensity, energy, and a feasibility flag – everything you need for a first-pass hardware assessment.


LLM Serving Analysis

The ServingModel models the two-phase LLM inference lifecycle: compute-bound pre-fill and memory-bound decoding.

import mlsysim
from mlsysim import ServingModel

serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)
print(f"TTFT:       {result['ttft'].to('ms'):~.1f}")
print(f"ITL:        {result['itl'].to('ms'):~.2f}")
print(f"KV-cache:   {result['kv_cache_size'].to('GB'):~.1f}")
print(f"Feasible:   {result['feasible']}")
print(f"Mem util:   {result['memory_utilization']:.0%}")

The feasibility check tells you immediately whether the model plus its KV-cache fit in device memory – before you discover the OOM at 3 AM in production.


Hardware Sweep Pattern

Compare devices programmatically instead of reading datasheets:

import mlsysim
from mlsysim import Engine

model = mlsysim.Models.ResNet50

for hw in [mlsysim.Hardware.Cloud.H100,
           mlsysim.Hardware.Cloud.A100,
           mlsysim.Hardware.Cloud.T4,
           mlsysim.Hardware.Edge.JetsonAGX]:
    p = Engine.solve(model=model, hardware=hw, batch_size=32, precision="fp16")
    print(f"{hw.name:20s}  {p.bottleneck:16s}  {p.latency.to('ms'):>8.2f~}  {p.throughput:>8.0f} img/s")

Distributed Training Analysis

The DistributedModel models 3D parallelism (data, tensor, pipeline) with communication overhead from ring all-reduce and pipeline bubbles.

import mlsysim
from mlsysim import DistributedModel

dist = DistributedModel()
result = dist.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    fleet=mlsysim.Systems.Clusters.H100_64,
    batch_size=512, precision="fp16",
    tp_size=8, pp_size=4, microbatch_count=16
)
print(f"Scaling efficiency:  {result['scaling_efficiency']:.1%}")
print(f"DP all-reduce:       {result['dp_communication_latency'].to('ms'):~.1f}")
print(f"TP overhead:         {result['tp_communication_latency'].to('ms'):~.1f}")
print(f"Pipeline bubble:     {result['bubble_fraction']:.1%}")
print(f"Step latency:        {result['step_latency_total'].to('ms'):~.1f}")

Tune tp_size, pp_size, and microbatch_count to find the parallelism configuration that maximizes scaling efficiency for your cluster topology.


Composing Solvers for Real Questions

The six solvers are designed to chain. Here are three common engineering workflows.

“Can I serve Llama-70B on H100s within budget?”

import mlsysim
from mlsysim import ServingModel, EconomicsModel

# Step 1: Does it fit and what's the latency?
serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)

# Step 2: What does that fleet cost?
econ = EconomicsModel()
cost = econ.solve(
    fleet=mlsysim.Systems.Clusters.H100_8,
    duration_days=365,
    kwh_price=0.08
)
print(f"Annual TCO: ${cost['tco_usd']:,.0f}")
print(f"  CapEx:    ${cost['capex_usd']:,.0f}")
print(f"  OpEx:     ${cost['total_opex_usd']:,.0f}")

“Where should I train to minimize carbon?”

import mlsysim
from mlsysim import SustainabilityModel

sustain = SustainabilityModel()
for grid in [mlsysim.Infra.Grids.Quebec, mlsysim.Infra.Grids.US_Average,
             mlsysim.Infra.Grids.Poland]:
    r = sustain.solve(
        fleet=mlsysim.Systems.Clusters.H100_256,
        duration_days=30,
        datacenter=grid
    )
    print(f"{grid.name:12s}  {r['carbon_footprint_kg'].to('metric_ton'):>8.1f~}  "
          f"PUE={r['pue']:.2f}  Water={r['water_usage_liters']:,.0f} L")

For the theory behind PUE, carbon intensity, and the energy hierarchy, see the Sustainable AI slide deck.

“How reliable is a 30-day training run on 256 GPUs?”

import mlsysim
from mlsysim import ReliabilityModel

rel = ReliabilityModel()
result = rel.solve(
    fleet=mlsysim.Systems.Clusters.H100_256,
    job_duration_hours=720,       # 30 days
    checkpoint_time_s=120.0       # 2 minutes per checkpoint
)
print(f"Fleet MTBF:              {result['fleet_mtbf'].to('hour'):~.1f}")
print(f"P(failure before done):  {result['failure_probability']:.1%}")
print(f"Optimal ckpt interval:   {result['optimal_checkpoint_interval'].to('minute'):~.1f}")
print(f"Expected failures:       {result['expected_failures']:.1f}")

This solver implements the Young-Daly checkpoint model – essential for capacity planning on long training jobs.


Writing Custom Solvers

Follow the built-in solver pattern to create your own analysis:

from mlsysim.hardware.types import HardwareNode

class PowerEfficiencyModel:
    def solve(self, hardware: HardwareNode) -> dict:
        flops_per_watt = hardware.compute.peak_flops / hardware.tdp
        return {
            "device": hardware.name,
            "flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
        }

See Extending MLSYSIM for the full guide.


Type Safety

All quantities are pint.Quantity objects. Unit conversions are explicit, and dimensional errors are caught at runtime:

hw = mlsysim.Hardware.Cloud.A100
hw.compute.peak_flops.to("TFLOPs/s")   # → 312.0 TFLOPs/s
hw.memory.bandwidth.to("TB/s")          # → 2.0 TB/s
hw.memory.bandwidth.to("FLOP/s")        # → DimensionalityError ✓

This means you can chain computations across solvers without worrying about unit mismatches – pint catches them for you.


Solver-to-Slides Map

Each MLSYSIM solver maps to specific chapters and slide decks from the Machine Learning Systems textbook. Use these for the full theoretical grounding behind each solver.

MLSYSIM Solver What It Models Slide Deck
Engine / SingleNodeModel Roofline analysis, compute vs. memory bottleneck Hardware Acceleration (Vol I, Ch 11)
ServingModel TTFT, ITL, KV-cache memory Model Serving (Vol I, Ch 13) and Inference at Scale (Vol II, Ch 9)
DistributedModel 3D parallelism, all-reduce, pipeline bubbles Distributed Training (Vol II, Ch 5) and Collective Communication (Vol II, Ch 6)
EconomicsModel CapEx, OpEx, total cost of ownership Compute Infrastructure (Vol II, Ch 2)
SustainabilityModel Energy, carbon footprint, water usage Sustainable AI (Vol II, Ch 15)
ReliabilityModel MTBF, Young-Daly checkpointing Fault Tolerance (Vol II, Ch 7)

Volume I: Foundations (17 decks, 570 slides)

Browse Vol I Decks | Download All (PDF)

Volume II: At Scale (18 decks, 529 slides)

Browse Vol II Decks | Download All (PDF)


Next Steps

Back to top