Which Solver Do I Need?

A decision guide for choosing the right MLSys·im analytical tool

MLSys·im provides specialized analytical resolvers for different classes of ML systems questions. This page helps you pick the right release-facing workflow — and shows you how to compose solvers for real-world analyses.

Start With Your Question

“How fast will my model run on this GPU?”: Use the SingleNodeModel. It applies the roofline model to determine whether your workload is compute-bound or memory-bound and returns latency, throughput, and bottleneck classification.; Lecture slides: Hardware Acceleration (Vol I, Ch 11) · Benchmarking (Vol I, Ch 12)
“How fast will my LLM generate tokens?”: Use the ServingModel. It models the two distinct phases of autoregressive inference: the compute-bound prefill (TTFT) and the memory-bound decode (ITL), plus KV-cache memory pressure, phase splitting, prompt caching, speculative decode, and an optional chunked-prefill stall proxy.; Lecture slides: Model Serving (Vol I, Ch 13) · Inference at Scale (Vol II, Ch 10)
“How much memory do I need for training?”: Use the TrainingMemoryModel. It separates weights, gradients, optimizer state, activations, and communication buffers so training memory is not confused with inference memory.; Lecture slides: Training (Vol I, Ch 8) · Distributed Training (Vol II, Ch 5)
“How many serving replicas do I need for this SLA?”: Use the ServingCapacityModel. It composes serving latency, continuous-batching capacity, and queueing pressure into a replica-count estimate.; Lecture slides: Model Serving (Vol I, Ch 13) · Inference at Scale (Vol II, Ch 10)
“How does performance scale across multiple GPUs?”: Use the DistributedModel. It decomposes workloads using 3D/4D parallelism (DP, TP, PP, EP) and calculates communication overhead, pipeline bubbles, and scaling efficiency.; Lecture slides: Distributed Training (Vol II, Ch 5) · Collective Communication (Vol II, Ch 6) · Network Fabrics (Vol II, Ch 3)
“How much does MoE routing imbalance hurt?”: Use the MoERoutingModel. It keeps MoE modeling first-order: total parameters set memory, active parameters set compute, and a routing-imbalance factor inflates expert-parallel all-to-all traffic.; Lecture slides: Distributed Training (Vol II, Ch 5) · Inference at Scale (Vol II, Ch 10)
“How much will this cost to run?”: Use the EconomicsModel. It calculates Total Cost of Ownership: CapEx (hardware purchase), OpEx (energy + maintenance), and total TCO over a specified duration.; Lecture slides: Compute Infrastructure (Vol II, Ch 2)
“What is the carbon footprint?”: Use the SustainabilityModel. It computes energy consumption (factoring in PUE), carbon emissions (using regional grid intensity), and water usage across datacenter locations.; Lecture slides: Sustainable AI (Vol II, Ch 15)
“How often will my cluster fail during training?”: Use the ReliabilityModel. It estimates fleet-wide MTBF, failure probability for a given job duration, and the Young-Daly optimal checkpoint interval.; Lecture slides: Fault Tolerance (Vol II, Ch 7)

Quick Reference

Solver	Key Inputs	Key Outputs	Best For
SingleNodeModel	`model`, `hardware`, `batch_size`, `precision`	latency, throughput, bottleneck, MFU	“Is my model memory-bound?”
ServingModel	`model`, `hardware`, `seq_len`, `batch_size`	TTFT, ITL, KV-cache size, decode stall proxy, feasibility	“Can I serve this LLM on this GPU?”
TrainingMemoryModel	`model`, `hardware`, `batch_size`, `seq_len`	memory breakdown, feasibility	“Why does training not fit?”
ServingCapacityModel	`model`, `hardware`, `qps`, `target_p99_latency_ms`	replicas, QPS capacity, queue wait	“How many replicas do I need?”
DistributedModel	`model`, `fleet`, `tp_size`, `pp_size`, `ep_size`	scaling efficiency, communication overhead	“How many GPUs do I actually need?”
MoERoutingModel	sparse `model`, `batch_size`, `seq_len`, `ep_size`	active experts, routed bytes, all-to-all latency	“What is the MoE routing tax?”
EconomicsModel	`fleet`, `duration_days`, `kwh_price`	CapEx, OpEx, total TCO	“What will this cost over 3 years?”
SustainabilityModel	`fleet`, `duration_days`, `datacenter`	energy (kWh), carbon (kg CO₂e), water (L)	“Where should I train to minimize carbon?”
ReliabilityModel	`fleet`, `job_duration_hours`, `checkpoint_time_s`	MTBF, failure probability, checkpoint interval	“Will my training job complete?”

Code Examples

The examples below use top-level convenience imports for readability. These are supported throughout the 0.1.x series; library code can import the same classes from mlsysim.solvers when it wants to make solver-specific dependencies explicit.

Single-node roofline analysis

import mlsysim
from mlsysim.solvers import SingleNodeModel

solver = SingleNodeModel()
profile = solver.solve(
    model=mlsysim.Models.Vision.ResNet50,
    hardware=mlsysim.Hardware.Cloud.A100,
    batch_size=1
)
print(f"Bottleneck: {profile.bottleneck}")   # → Memory
print(f"Latency:    {profile.latency.to('ms'):~.2f}")
print(f"MFU:        {profile.mfu:.1%}")

LLM serving analysis

import mlsysim
from mlsysim.solvers import ServingModel

serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_8B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=2048,
    batch_size=1
)
print(f"TTFT: {result.ttft.to('ms'):~.1f}")
print(f"ITL:  {result.itl.to('ms'):~.2f}")
print(f"KV:   {result.kv_cache_size:~.2f}")
print(f"Fits: {result.feasible}")

Training memory breakdown

import mlsysim
from mlsysim.solvers import TrainingMemoryModel

memory = TrainingMemoryModel().solve(
    model=mlsysim.Models.Language.Llama3_8B,
    hardware=mlsysim.Hardware.Cloud.H100,
    batch_size=8,
    seq_len=2048,
    zero_stage=2,
    dp_size=8
)
print(f"Total:       {memory.total_memory:~.2f}")
print(f"Weights:     {memory.weights:~.2f}")
print(f"Optimizer:   {memory.optimizer_state:~.2f}")
print(f"Activations: {memory.activations:~.2f}")
print(f"Fits:        {memory.feasible}")

Serving capacity planning

import mlsysim
from mlsysim.solvers import ServingCapacityModel

capacity = ServingCapacityModel().solve(
    model=mlsysim.Models.Language.Llama3_8B,
    hardware=mlsysim.Hardware.Cloud.H100,
    qps=20,
    target_p99_latency_ms=2000,
    seq_len=1024,
    output_tokens=64
)
print(f"Replicas: {capacity.required_replicas}")
print(f"P99:      {capacity.estimated_p99_latency:~.1f}")
print(f"Util:     {capacity.utilization:.1%}")

MoE routing imbalance

from mlsysim import Systems, ureg
from mlsysim.solvers import MoERoutingModel
from mlsysim.models.types import SparseTransformerWorkload

moe = SparseTransformerWorkload(
    name="Toy-MoE-64B",
    architecture="Sparse Transformer",
    parameters=64e9 * ureg.count,
    active_parameters=8e9 * ureg.count,
    experts=8,
    active_experts_per_token=2,
    layers=32,
    hidden_dim=4096,
    heads=32,
)

routing = MoERoutingModel().solve(
    model=moe,
    batch_size=4,
    seq_len=2048,
    ep_size=8,
    routing_imbalance_factor=1.25,
    fleet=Systems.Clusters.Research_256,
)
print(f"Active experts: {routing.effective_active_experts:.2f}")
print(f"Routed bytes:   {routing.token_dispatch_bytes:~.2f}")
print(f"All-to-all:     {routing.all_to_all_latency:~.2f}")

Distributed training at scale

import mlsysim
from mlsysim import Systems
from mlsysim.solvers import DistributedModel

dist = DistributedModel()
result = dist.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    fleet=Systems.Clusters.Frontier_8K,
    batch_size=2048,
    tp_size=8,
    pp_size=4,
    microbatch_count=16
)
print(f"Scaling efficiency: {result.scaling_efficiency:.1%}")
print(f"Bubble fraction:    {result.bubble_fraction:.1%}")
print(f"DP comm latency:    {result.dp_communication_latency.to('ms'):~.2f}")

Parameter sweep (manual loop)

MLSys·im does not provide a built-in sweep function. Instead, use a simple Python loop — this keeps the analysis transparent and gives you full control over what you collect:

import mlsysim
from mlsysim.solvers import SingleNodeModel

solver = SingleNodeModel()
targets = [
    mlsysim.Hardware.Cloud.T4,
    mlsysim.Hardware.Cloud.A100,
    mlsysim.Hardware.Cloud.H100,
    mlsysim.Hardware.Cloud.B200,
]

for hw in targets:
    p = solver.solve(model=mlsysim.Models.Vision.ResNet50, hardware=hw, batch_size=32)
    print(f"{hw.name:20s}  {p.latency.to('ms'):>8.2f~}  {p.bottleneck}")

Composing Solvers

Real-world questions often require chaining multiple solvers. The output of one solver feeds naturally into the next because all solvers share typed inputs and pint.Quantity-valued outputs.

How to Validate a Result

Use MLSys·im results in three passes:

Check feasibility first. Memory, KV cache, and checkpoint sizes are direct byte counts. If these say the configuration cannot fit, a benchmark will not make it fit.
Check the binding constraint. If the result is memory-bound, compare bandwidth-oriented alternatives; if it is compute-bound, compare FLOP/s and efficiency.
Calibrate before production commitments. For latency and throughput, measure one representative configuration, back-calculate efficiency, then reuse that calibrated value for sweeps. The default efficiency=0.5 is an informed starting point, not a universal constant.

“Can I serve Llama-70B on 4 H100s within budget?”

ServingModel — check if the model fits in memory and estimate TTFT/ITL.
EconomicsModel — calculate the cost of running that fleet.

“What is the most sustainable way to train GPT-3?”

DistributedModel — find the optimal parallelism configuration.
SustainabilityModel — compare carbon footprint across regions.

“Should I use A100s or H100s for inference?”

SingleNodeModel on A100 — get latency and bottleneck.
SingleNodeModel on H100 — get latency and bottleneck.
EconomicsModel for each — compare cost per query.

Textbook Chapter Mapping

Each solver connects to specific chapters in the Machine Learning Systems textbook and corresponding lecture slide decks.

Direct PDF download links for each lecture deck. Full slide portal at mlsysbook.ai/slides.
Solver	Vol I Chapters (Slides)	Vol II Chapters (Slides)
SingleNodeModel	Training · HW Acceleration · Benchmarking	Performance Engineering
ServingModel	Model Serving	Inference at Scale
DistributedModel	—	Distributed Training · Collective Communication · Network Fabrics
EconomicsModel	—	Compute Infrastructure
SustainabilityModel	—	Sustainable AI
ReliabilityModel	—	Fault Tolerance

Engine.solve() vs. SingleNodeModel

Engine.solve() is a convenience shortcut that produces identical results to SingleNodeModel().solve(). Use Engine.solve() for quick single-node analysis. Use the individual solver classes (ServingModel, DistributedModel, etc.) when you need specialized analyses beyond the roofline.

Why Analytical Solvers?

MLSys·im is not an empirical profiler (like PyTorch Profiler) or a cycle-accurate simulator (like gem5). It is an analytical modeling platform that computes performance bounds from specifications and first-order equations. This is a deliberate design choice:

Speed. Closed-form equations evaluate in microseconds. You can sweep thousands of hardware x model x parallelism configurations in seconds — impossible with empirical profiling.
Intuition. By working from equations rather than opaque traces, students see exactly which physical quantity (bandwidth, compute, memory capacity) creates the bottleneck.
Accessibility. No hardware required. A laptop running pip install mlsysim gives you the same analysis as a $50,000 GPU cluster.
Composability. Solvers can be chained because they share typed inputs/outputs. The output of one solver feeds naturally into the next.

Solver Architecture

Every solver follows the same three-step pattern:

Takes typed registry objects — HardwareNode, TransformerWorkload, Fleet, GridProfile — as input. These carry physical units (pint.Quantity), so dimensional errors are caught at runtime.
Applies first-order equations from the Math Foundations page.
Returns typed results — either a PerformanceProfile (for SingleNodeModel) or a dict with Quantity-valued fields (for specialized solvers).

The key principle: every .solve() method is a pure function of its inputs. No hidden state, no side effects, no network calls.

Writing a Custom Solver

You can create your own solver by following the same pattern. Here is a “power efficiency” solver that computes TFLOP/s per watt across the hardware registry:

import mlsysim
from mlsysim.hardware.types import HardwareNode

class PowerEfficiencyModel:
    """Compare hardware on performance-per-watt."""

    def solve(self, hardware: HardwareNode) -> dict:
        if hardware.tdp is None:
            raise ValueError(f"{hardware.name}: no TDP specified")

        flops_per_watt = hardware.compute.peak_flops / hardware.tdp

        return {
            "device": hardware.name,
            "peak_flops": hardware.compute.peak_flops,
            "tdp": hardware.tdp,
            "flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
        }

# Use it
solver = PowerEfficiencyModel()

for hw in [mlsysim.Hardware.Cloud.H100, mlsysim.Hardware.Cloud.A100,
           mlsysim.Hardware.Cloud.T4, mlsysim.Hardware.Edge.JetsonOrinNX]:
    r = solver.solve(hw)
    print(f"{r['device']:25s}  {r['flops_per_watt']:>10.1f~}")

Use pint.Quantity for all physical calculations so that unit errors are impossible. For more complex solvers, see the source code for the built-in solver classes.

For the equations behind each solver, see Math Foundations. For full API details, see the Solver API Reference.