Which Solver Do I Need?
A decision guide for choosing the right MLSYSIM analytical tool
MLSys·im provides specialized analytical resolvers for different classes of ML systems questions. This page helps you pick the right release-facing workflow — and shows you how to compose solvers for real-world analyses.
Start With Your Question
- “How fast will my model run on this GPU?”
- Use the SingleNodeModel. It applies the roofline model to determine whether your workload is compute-bound or memory-bound and returns latency, throughput, and bottleneck classification.
- Lecture slides: Hardware Acceleration (Vol I, Ch 11) · Benchmarking (Vol I, Ch 12)
- “How fast will my LLM generate tokens?”
- Use the ServingModel. It models the two distinct phases of autoregressive inference: the compute-bound prefill (TTFT) and the memory-bound decode (ITL), plus KV-cache memory pressure, phase splitting, prompt caching, speculative decode, and an optional chunked-prefill stall proxy.
- Lecture slides: Model Serving (Vol I, Ch 13) · Inference at Scale (Vol II, Ch 10)
- “How much memory do I need for training?”
- Use the TrainingMemoryModel. It separates weights, gradients, optimizer state, activations, and communication buffers so training memory is not confused with inference memory.
- Lecture slides: Training (Vol I, Ch 8) · Distributed Training (Vol II, Ch 5)
- “How many serving replicas do I need for this SLA?”
- Use the ServingCapacityModel. It composes serving latency, continuous-batching capacity, and queueing pressure into a replica-count estimate.
- Lecture slides: Model Serving (Vol I, Ch 13) · Inference at Scale (Vol II, Ch 10)
- “How does performance scale across multiple GPUs?”
- Use the DistributedModel. It decomposes workloads using 3D/4D parallelism (DP, TP, PP, EP) and calculates communication overhead, pipeline bubbles, and scaling efficiency.
- Lecture slides: Distributed Training (Vol II, Ch 5) · Collective Communication (Vol II, Ch 6) · Network Fabrics (Vol II, Ch 3)
- “How much does MoE routing imbalance hurt?”
- Use the MoERoutingModel. It keeps MoE modeling first-order: total parameters set memory, active parameters set compute, and a routing-imbalance factor inflates expert-parallel all-to-all traffic.
- Lecture slides: Distributed Training (Vol II, Ch 5) · Inference at Scale (Vol II, Ch 10)
- “How much will this cost to run?”
- Use the EconomicsModel. It calculates Total Cost of Ownership: CapEx (hardware purchase), OpEx (energy + maintenance), and total TCO over a specified duration.
- Lecture slides: Compute Infrastructure (Vol II, Ch 2)
- “What is the carbon footprint?”
- Use the SustainabilityModel. It computes energy consumption (factoring in PUE), carbon emissions (using regional grid intensity), and water usage across datacenter locations.
- Lecture slides: Sustainable AI (Vol II, Ch 15)
- “How often will my cluster fail during training?”
- Use the ReliabilityModel. It estimates fleet-wide MTBF, failure probability for a given job duration, and the Young-Daly optimal checkpoint interval.
- Lecture slides: Fault Tolerance (Vol II, Ch 7)
Quick Reference
| Solver | Key Inputs | Key Outputs | Best For |
|---|---|---|---|
| SingleNodeModel | model, hardware, batch_size, precision |
latency, throughput, bottleneck, MFU | “Is my model memory-bound?” |
| ServingModel | model, hardware, seq_len, batch_size |
TTFT, ITL, KV-cache size, decode stall proxy, feasibility | “Can I serve this LLM on this GPU?” |
| TrainingMemoryModel | model, hardware, batch_size, seq_len |
memory breakdown, feasibility | “Why does training not fit?” |
| ServingCapacityModel | model, hardware, qps, target_p99_latency_ms |
replicas, QPS capacity, queue wait | “How many replicas do I need?” |
| DistributedModel | model, fleet, tp_size, pp_size, ep_size |
scaling efficiency, communication overhead | “How many GPUs do I actually need?” |
| MoERoutingModel | sparse model, batch_size, seq_len, ep_size |
active experts, routed bytes, all-to-all latency | “What is the MoE routing tax?” |
| EconomicsModel | fleet, duration_days, kwh_price |
CapEx, OpEx, total TCO | “What will this cost over 3 years?” |
| SustainabilityModel | fleet, duration_days, datacenter |
energy (kWh), carbon (kg CO₂e), water (L) | “Where should I train to minimize carbon?” |
| ReliabilityModel | fleet, job_duration_hours, checkpoint_time_s |
MTBF, failure probability, checkpoint interval | “Will my training job complete?” |
Code Examples
The examples below use top-level convenience imports for readability. These are supported throughout the 0.1.x series; library code can import the same classes from mlsysim.solvers when it wants to make solver-specific dependencies explicit.
Single-node roofline analysis
import mlsysim
from mlsysim import SingleNodeModel
solver = SingleNodeModel()
profile = solver.solve(
model=mlsysim.Models.ResNet50,
hardware=mlsysim.Hardware.Cloud.A100,
batch_size=1
)
print(f"Bottleneck: {profile.bottleneck}") # → Memory
print(f"Latency: {profile.latency.to('ms'):~.2f}")
print(f"MFU: {profile.mfu:.1%}")LLM serving analysis
import mlsysim
from mlsysim import ServingModel
serving = ServingModel()
result = serving.solve(
model=mlsysim.Models.Language.Llama3_8B,
hardware=mlsysim.Hardware.Cloud.H100,
seq_len=2048,
batch_size=1
)
print(f"TTFT: {result.ttft.to('ms'):~.1f}")
print(f"ITL: {result.itl.to('ms'):~.2f}")
print(f"KV: {result.kv_cache_size:~.2f}")
print(f"Fits: {result.feasible}")Training memory breakdown
import mlsysim
from mlsysim import TrainingMemoryModel
memory = TrainingMemoryModel().solve(
model=mlsysim.Models.Language.Llama3_8B,
hardware=mlsysim.Hardware.Cloud.H100,
batch_size=8,
seq_len=2048,
zero_stage=2,
dp_size=8
)
print(f"Total: {memory.total_memory:~.2f}")
print(f"Weights: {memory.weights:~.2f}")
print(f"Optimizer: {memory.optimizer_state:~.2f}")
print(f"Activations: {memory.activations:~.2f}")
print(f"Fits: {memory.feasible}")Serving capacity planning
import mlsysim
from mlsysim import ServingCapacityModel
capacity = ServingCapacityModel().solve(
model=mlsysim.Models.Language.Llama3_8B,
hardware=mlsysim.Hardware.Cloud.H100,
qps=20,
target_p99_latency_ms=2000,
seq_len=1024,
output_tokens=64
)
print(f"Replicas: {capacity.required_replicas}")
print(f"P99: {capacity.estimated_p99_latency:~.1f}")
print(f"Util: {capacity.utilization:.1%}")MoE routing imbalance
from mlsysim import MoERoutingModel, SparseTransformerWorkload, Systems, ureg
moe = SparseTransformerWorkload(
name="Toy-MoE-64B",
architecture="Sparse Transformer",
parameters=64e9 * ureg.count,
active_parameters=8e9 * ureg.count,
experts=8,
active_experts_per_token=2,
layers=32,
hidden_dim=4096,
heads=32,
)
routing = MoERoutingModel().solve(
model=moe,
batch_size=4,
seq_len=2048,
ep_size=8,
routing_imbalance_factor=1.25,
fleet=Systems.Clusters.Research_256,
)
print(f"Active experts: {routing.effective_active_experts:.2f}")
print(f"Routed bytes: {routing.token_dispatch_bytes:~.2f}")
print(f"All-to-all: {routing.all_to_all_latency:~.2f}")Distributed training at scale
import mlsysim
from mlsysim import DistributedModel, Systems
dist = DistributedModel()
result = dist.solve(
model=mlsysim.Models.Language.Llama3_70B,
fleet=Systems.Clusters.Frontier_8K,
batch_size=2048,
tp_size=8,
pp_size=4,
microbatch_count=16
)
print(f"Scaling efficiency: {result.scaling_efficiency:.1%}")
print(f"Bubble fraction: {result.bubble_fraction:.1%}")
print(f"DP comm latency: {result.dp_communication_latency.to('ms'):~.2f}")Parameter sweep (manual loop)
MLSYSIM does not provide a built-in sweep function. Instead, use a simple Python loop — this keeps the analysis transparent and gives you full control over what you collect:
import mlsysim
from mlsysim import SingleNodeModel
solver = SingleNodeModel()
targets = [
mlsysim.Hardware.Cloud.T4,
mlsysim.Hardware.Cloud.A100,
mlsysim.Hardware.Cloud.H100,
mlsysim.Hardware.Cloud.B200,
]
for hw in targets:
p = solver.solve(model=mlsysim.Models.ResNet50, hardware=hw, batch_size=32)
print(f"{hw.name:20s} {p.latency.to('ms'):>8.2f~} {p.bottleneck}")Composing Solvers
Real-world questions often require chaining multiple solvers. The output of one solver feeds naturally into the next because all solvers share typed inputs and pint.Quantity-valued outputs.
How to Validate a Result
Use MLSys·im results in three passes:
- Check feasibility first. Memory, KV cache, and checkpoint sizes are direct byte counts. If these say the configuration cannot fit, a benchmark will not make it fit.
- Check the binding constraint. If the result is memory-bound, compare bandwidth-oriented alternatives; if it is compute-bound, compare FLOP/s and efficiency.
- Calibrate before production commitments. For latency and throughput, measure one representative configuration, back-calculate
efficiency, then reuse that calibrated value for sweeps. The defaultefficiency=0.5is an informed starting point, not a universal constant.
“Can I serve Llama-70B on 4 H100s within budget?”
- ServingModel — check if the model fits in memory and estimate TTFT/ITL.
- EconomicsModel — calculate the cost of running that fleet.
“What is the most sustainable way to train GPT-3?”
- DistributedModel — find the optimal parallelism configuration.
- SustainabilityModel — compare carbon footprint across regions.
“Should I use A100s or H100s for inference?”
- SingleNodeModel on A100 — get latency and bottleneck.
- SingleNodeModel on H100 — get latency and bottleneck.
- EconomicsModel for each — compare cost per query.
Textbook Chapter Mapping
Each solver connects to specific chapters in the Machine Learning Systems textbook and corresponding lecture slide decks.
| Solver | Vol I Chapters (Slides) | Vol II Chapters (Slides) |
|---|---|---|
| SingleNodeModel | Training · HW Acceleration · Benchmarking | Performance Engineering |
| ServingModel | Model Serving | Inference at Scale |
| DistributedModel | — | Distributed Training · Collective Communication · Network Fabrics |
| EconomicsModel | — | Compute Infrastructure |
| SustainabilityModel | — | Sustainable AI |
| ReliabilityModel | — | Fault Tolerance |
Engine.solve() is a convenience shortcut that produces identical results to SingleNodeModel().solve(). Use Engine.solve() for quick single-node analysis. Use the individual solver classes (ServingModel, DistributedModel, etc.) when you need specialized analyses beyond the roofline.
Why Analytical Solvers?
MLSYSIM is not an empirical profiler (like PyTorch Profiler) or a cycle-accurate simulator (like gem5). It is an analytical modeling platform that computes performance bounds from specifications and first-order equations. This is a deliberate design choice:
- Speed. Closed-form equations evaluate in microseconds. You can sweep thousands of hardware x model x parallelism configurations in seconds — impossible with empirical profiling.
- Intuition. By working from equations rather than opaque traces, students see exactly which physical quantity (bandwidth, compute, memory capacity) creates the bottleneck.
- Accessibility. No hardware required. A laptop running
pip install mlsysimgives you the same analysis as a $50,000 GPU cluster. - Composability. Solvers can be chained because they share typed inputs/outputs. The output of one solver feeds naturally into the next.
Solver Architecture
Every solver follows the same three-step pattern:
- Takes typed registry objects —
HardwareNode,TransformerWorkload,Fleet,GridProfile— as input. These carry physical units (pint.Quantity), so dimensional errors are caught at runtime. - Applies first-order equations from the Math Foundations page.
- Returns typed results — either a
PerformanceProfile(forSingleNodeModel) or adictwithQuantity-valued fields (for specialized solvers).
The key principle: every .solve() method is a pure function of its inputs. No hidden state, no side effects, no network calls.
Writing a Custom Solver
You can create your own solver by following the same pattern. Here is a “power efficiency” solver that computes TFLOP/s per watt across the hardware registry:
import mlsysim
from mlsysim.hardware.types import HardwareNode
class PowerEfficiencyModel:
"""Compare hardware on performance-per-watt."""
def solve(self, hardware: HardwareNode) -> dict:
if hardware.tdp is None:
raise ValueError(f"{hardware.name}: no TDP specified")
flops_per_watt = hardware.compute.peak_flops / hardware.tdp
return {
"device": hardware.name,
"peak_flops": hardware.compute.peak_flops,
"tdp": hardware.tdp,
"flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
}
# Use it
solver = PowerEfficiencyModel()
for hw in [mlsysim.Hardware.Cloud.H100, mlsysim.Hardware.Cloud.A100,
mlsysim.Hardware.Cloud.T4, mlsysim.Hardware.Edge.JetsonOrinNX]:
r = solver.solve(hw)
print(f"{r['device']:25s} {r['flops_per_watt']:>10.1f~}")Use pint.Quantity for all physical calculations so that unit errors are impossible. For more complex solvers, see the source code for the built-in solver classes.
For the equations behind each solver, see Math Foundations. For full API details, see the Solver API Reference.