For Engineers & Researchers
Back-of-envelope estimates before you provision hardware.
MLSys·im gives you quick, type-safe analytical estimates for capacity planning, hardware selection, cost modeling, and sustainability analysis — in seconds, from specifications alone.
Why Use Analytical Models?
Before running expensive benchmarks or provisioning cloud instances, you need directional answers:
- Will this model fit in GPU memory? — Check before renting the GPU
- What’s the expected TTFT for my LLM? — Estimate before building the serving stack
- How many H100s do I actually need? — Model scaling efficiency before buying the cluster
- What will this cost per year? — TCO analysis before signing the contract
MLSys·im answers these in microseconds using first-order equations. It won’t replace profiling, but it tells you where to start profiling.
Quick API Usage
import mlsysim
from mlsysim import Engine, ServingModel, DistributedModel
# Single-node: Is ResNet-50 memory-bound on A100?
profile = Engine.solve(
model=mlsysim.Models.ResNet50,
hardware=mlsysim.Hardware.Cloud.A100,
batch_size=1, precision="fp16"
)
print(f"{profile.bottleneck}, {profile.latency.to('ms'):~.2f}")
# LLM serving: What's the TTFT for Llama-3.1-70B on H100?
serving = ServingModel()
result = serving.solve(
model=mlsysim.Models.Language.Llama3_70B,
hardware=mlsysim.Hardware.Cloud.H100,
seq_len=4096, batch_size=1
)
print(f"TTFT: {result.ttft.to('ms'):~.1f}")
print(f"ITL: {result.itl.to('ms'):~.2f}")
print(f"KV-cache: {result.kv_cache_size.to('GB'):~.1f}")Hardware Sweep Pattern
Compare devices programmatically instead of reading datasheets:
import mlsysim
from mlsysim import Engine
model = mlsysim.Models.ResNet50
for hw in [mlsysim.Hardware.Cloud.H100,
mlsysim.Hardware.Cloud.A100,
mlsysim.Hardware.Cloud.T4,
mlsysim.Hardware.Edge.JetsonAGX]:
p = Engine.solve(model=model, hardware=hw, batch_size=32, precision="fp16")
print(f"{hw.name:20s} {p.bottleneck:16s} {p.latency.to('ms'):>8.2f~} {p.throughput:>8.0f} img/s")Batch Size Sweep: Finding the Regime Transition
The most important tuning knob for inference cost optimization is batch size. Sweep it to find where the workload transitions from memory-bound to compute-bound:
import mlsysim
from mlsysim import Engine
model = mlsysim.Models.ResNet50
hw = mlsysim.Hardware.Cloud.A100
for bs in [1, 4, 16, 32, 64, 128, 256]:
p = Engine.solve(model=model, hardware=hw, batch_size=bs, precision="fp16")
print(f"BS={bs:>4d} {p.bottleneck:16s} {p.latency.to('ms'):>8.2f~} {p.throughput:>8.0f} img/s")At small batch sizes the workload is memory-bound (loading weights dominates), but as batch size grows past the ridge point, it becomes compute-bound. The optimal batch size for cost efficiency is typically just above this transition.
Composing Solvers for Real Questions
MLSys·im solvers are designed to chain — the output of one feeds into the next:
“Can I serve Llama-70B on 4 H100s within budget?”
from mlsysim import ServingModel, EconomicsModel
# Step 1: Does it fit and what's the latency?
serving = ServingModel()
result = serving.solve(
model=mlsysim.Models.Language.Llama3_70B,
hardware=mlsysim.Hardware.Cloud.H100,
seq_len=4096, batch_size=1
)
# Step 2: What does that fleet cost?
econ = EconomicsModel()
cost = econ.solve(
fleet=mlsysim.Systems.Clusters.Research_256,
duration_days=365,
kwh_price=0.08
)
print(f"Annual TCO: ${cost.tco_usd:,.0f}")“Where should I train to minimize carbon?”
from mlsysim import SustainabilityModel
sustain = SustainabilityModel()
for grid in [mlsysim.Infra.Grids.Quebec, mlsysim.Infra.Grids.US_Avg,
mlsysim.Infra.Grids.Poland]:
r = sustain.solve(
fleet=mlsysim.Systems.Clusters.Research_256,
duration_days=30,
datacenter=grid
)
print(f"{grid.name:12s} {r.carbon_footprint_kg / 1000:>8.1f} t CO2e")“Should I invest in FLOPS or bandwidth for next-gen inference?”
from mlsysim import SensitivitySolver
solver = SensitivitySolver()
res = solver.solve(
model=mlsysim.Models.Language.Llama3_70B,
hardware=mlsysim.Hardware.Cloud.A100
)
print(res.sensitivities)
# {'peak_flops': -0.06, 'memory_bandwidth': -0.88, 'memory_capacity': 0.00}
print(f"Binding constraint: {res.binding_constraint}")
# → memory_bandwidth (10% more BW = 8.8% faster; 10% more FLOPS = 0.6% faster)Production Workflow: Capacity Planning
MLSys·im fits into a four-stage capacity planning workflow:
| Stage | What You Do | Which Solvers | Output |
|---|---|---|---|
| 1. Requirements | Define SLA targets (latency, throughput) | SynthesisSolver |
Minimum hardware specs |
| 2. Shortlisting | Evaluate 5–10 hardware configs | ServingModel + EconomicsModel |
Ranked candidates with TCO |
| 3. Validation | Profile top 2–3 candidates | Empirical (vLLM, PyTorch Profiler) | Ground-truth measurements |
| 4. Procurement | Make the business case | SustainabilityModel + SensitivitySolver |
TCO, carbon, binding constraint report |
MLSys·im handles stages 1, 2, and 4. Stage 3 requires empirical profiling on real hardware. The value is in narrowing the design space from thousands of configurations to a handful worth profiling — saving weeks of benchmarking time.
MLSys·im models workloads in isolation. Production systems run multiple workloads on shared infrastructure with context switching, memory fragmentation, and co-location interference. Apply a 1.3–2× safety margin on MLSys·im latency estimates for production SLA planning.
“Will my 30-day training run complete without failure?”
from mlsysim import ReliabilityModel, CheckpointModel
# Step 1: How often will the cluster fail?
reliability = ReliabilityModel()
rel = reliability.solve(
fleet=mlsysim.Systems.Clusters.H100_256,
job_duration_days=30
)
print(f"Fleet MTBF: {rel.fleet_mtbf.to('hours'):~.1f}")
print(f"Expected failures in 30 days: {rel.expected_failures:.1f}")
print(f"Optimal checkpoint interval: {rel.optimal_interval.to('minutes'):~.1f}")
# Step 2: What's the MFU penalty of checkpointing at that interval?
ckpt = CheckpointModel()
penalty = ckpt.solve(
model=mlsysim.Models.Language.Llama3_70B,
hardware=mlsysim.Hardware.Cloud.H100,
optimizer="adam",
checkpoint_interval=rel.optimal_interval
)
print(f"Checkpoint size: {penalty.checkpoint_size.to('GB'):~.1f}")
print(f"MFU penalty: {penalty.mfu_penalty*100:.1f}%")“What hardware do I need to meet this latency SLA?”
from mlsysim import SynthesisSolver
# Invert the Roofline: given a target, derive minimum specs
synth = SynthesisSolver()
specs = synth.solve(
model=mlsysim.Models.Language.Llama3_70B,
target_latency=mlsysim.Q_("30 ms/token"), # ITL SLA
precision="fp16"
)
print(f"Minimum memory BW: {specs.required_bandwidth.to('TB/s'):~.2f}")
print(f"Minimum memory: {specs.required_memory.to('GB'):~.1f}")
print(f"Minimum FLOPS: {specs.required_flops.to('TFLOP/s'):~.1f}")
# Which hardware in the Zoo meets these specs?
for hw in mlsysim.Hardware.all():
if (hw.memory.bandwidth >= specs.required_bandwidth and
hw.memory.capacity >= specs.required_memory):
print(f" ✓ {hw.name}")Dimensional Strictness: Why Units Matter
All quantities in MLSys·im are pint.Quantity objects with physical units enforced at runtime. This prevents the most common class of back-of-envelope errors — the kind that cause real production incidents:
hw = mlsysim.Hardware.Cloud.A100
# These work — units are compatible:
ridge_point = hw.compute.peak_flops / hw.memory.bandwidth # → FLOP/byte ✓
time = model_bytes / hw.memory.bandwidth # → seconds ✓
# These raise DimensionalityError — caught before producing a bad number:
hw.memory.bandwidth + hw.compute.peak_flops # GB/s + FLOP/s → ERROR
hw.memory.bandwidth.to("FLOP/s") # GB/s ≠ FLOP/s → ERRORIn spreadsheet modeling, confusing gigabytes with gigabits or mixing per-device and per-node bandwidth silently produces a 8× or 4× error. In MLSys·im, these errors are structurally impossible. Think of it as type safety for systems analysis.
Writing Custom Solvers
Follow the built-in solver pattern to create your own analysis:
from mlsysim.hardware.types import HardwareNode
class PowerEfficiencyModel:
def solve(self, hardware: HardwareNode) -> dict:
flops_per_watt = hardware.compute.peak_flops / hardware.tdp
return {
"device": hardware.name,
"flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
}See Extending MLSys·im for the full guide.
Next Steps
- Getting Started — Install and run your first analysis
- Resolver Guide — Which solver for which question
- MLSys Zoo — Browse all available hardware, model, and infrastructure specs
- API Reference — Full programmatic API documentation
- Sensitivity Analysis — Identify binding constraints and derive hardware specs from SLA targets
- GPU vs. Wafer-Scale — Compare conventional and wafer-scale architectures
- Accuracy & Validation — How analytical bounds compare to empirical measurements