For Engineers & Researchers

Back-of-envelope estimates before you provision hardware.

MLSys·im gives you quick, type-safe analytical estimates for capacity planning, hardware selection, cost modeling, and sustainability analysis — in seconds, from specifications alone.


Why Use Analytical Models?

Before running expensive benchmarks or provisioning cloud instances, you need directional answers:

  • Will this model fit in GPU memory? — Check before renting the GPU
  • What’s the expected TTFT for my LLM? — Estimate before building the serving stack
  • How many H100s do I actually need? — Model scaling efficiency before buying the cluster
  • What will this cost per year? — TCO analysis before signing the contract

MLSys·im answers these in microseconds using first-order equations. It won’t replace profiling, but it tells you where to start profiling.


Quick API Usage

import mlsysim
from mlsysim import Engine, ServingModel, DistributedModel

# Single-node: Is ResNet-50 memory-bound on A100?
profile = Engine.solve(
    model=mlsysim.Models.ResNet50,
    hardware=mlsysim.Hardware.Cloud.A100,
    batch_size=1, precision="fp16"
)
print(f"{profile.bottleneck}, {profile.latency.to('ms'):~.2f}")

# LLM serving: What's the TTFT for Llama-3.1-70B on H100?
serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)
print(f"TTFT: {result.ttft.to('ms'):~.1f}")
print(f"ITL:  {result.itl.to('ms'):~.2f}")
print(f"KV-cache: {result.kv_cache_size.to('GB'):~.1f}")

Hardware Sweep Pattern

Compare devices programmatically instead of reading datasheets:

import mlsysim
from mlsysim import Engine

model = mlsysim.Models.ResNet50

for hw in [mlsysim.Hardware.Cloud.H100,
           mlsysim.Hardware.Cloud.A100,
           mlsysim.Hardware.Cloud.T4,
           mlsysim.Hardware.Edge.JetsonAGX]:
    p = Engine.solve(model=model, hardware=hw, batch_size=32, precision="fp16")
    print(f"{hw.name:20s}  {p.bottleneck:16s}  {p.latency.to('ms'):>8.2f~}  {p.throughput:>8.0f} img/s")

Batch Size Sweep: Finding the Regime Transition

The most important tuning knob for inference cost optimization is batch size. Sweep it to find where the workload transitions from memory-bound to compute-bound:

import mlsysim
from mlsysim import Engine

model = mlsysim.Models.ResNet50
hw = mlsysim.Hardware.Cloud.A100

for bs in [1, 4, 16, 32, 64, 128, 256]:
    p = Engine.solve(model=model, hardware=hw, batch_size=bs, precision="fp16")
    print(f"BS={bs:>4d}  {p.bottleneck:16s}  {p.latency.to('ms'):>8.2f~}  {p.throughput:>8.0f} img/s")

At small batch sizes the workload is memory-bound (loading weights dominates), but as batch size grows past the ridge point, it becomes compute-bound. The optimal batch size for cost efficiency is typically just above this transition.


Composing Solvers for Real Questions

MLSys·im solvers are designed to chain — the output of one feeds into the next:

“Can I serve Llama-70B on 4 H100s within budget?”

from mlsysim import ServingModel, EconomicsModel

# Step 1: Does it fit and what's the latency?
serving = ServingModel()
result = serving.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    seq_len=4096, batch_size=1
)

# Step 2: What does that fleet cost?
econ = EconomicsModel()
cost = econ.solve(
    fleet=mlsysim.Systems.Clusters.Research_256,
    duration_days=365,
    kwh_price=0.08
)
print(f"Annual TCO: ${cost.tco_usd:,.0f}")

“Where should I train to minimize carbon?”

from mlsysim import SustainabilityModel

sustain = SustainabilityModel()
for grid in [mlsysim.Infra.Grids.Quebec, mlsysim.Infra.Grids.US_Avg,
             mlsysim.Infra.Grids.Poland]:
    r = sustain.solve(
        fleet=mlsysim.Systems.Clusters.Research_256,
        duration_days=30,
        datacenter=grid
    )
    print(f"{grid.name:12s}  {r.carbon_footprint_kg / 1000:>8.1f} t CO2e")

“Should I invest in FLOPS or bandwidth for next-gen inference?”

from mlsysim import SensitivitySolver

solver = SensitivitySolver()
res = solver.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.A100
)

print(res.sensitivities)
# {'peak_flops': -0.06, 'memory_bandwidth': -0.88, 'memory_capacity': 0.00}
print(f"Binding constraint: {res.binding_constraint}")
# → memory_bandwidth (10% more BW = 8.8% faster; 10% more FLOPS = 0.6% faster)

Production Workflow: Capacity Planning

MLSys·im fits into a four-stage capacity planning workflow:

Stage What You Do Which Solvers Output
1. Requirements Define SLA targets (latency, throughput) SynthesisSolver Minimum hardware specs
2. Shortlisting Evaluate 5–10 hardware configs ServingModel + EconomicsModel Ranked candidates with TCO
3. Validation Profile top 2–3 candidates Empirical (vLLM, PyTorch Profiler) Ground-truth measurements
4. Procurement Make the business case SustainabilityModel + SensitivitySolver TCO, carbon, binding constraint report

MLSys·im handles stages 1, 2, and 4. Stage 3 requires empirical profiling on real hardware. The value is in narrowing the design space from thousands of configurations to a handful worth profiling — saving weeks of benchmarking time.

TipSafety Margins

MLSys·im models workloads in isolation. Production systems run multiple workloads on shared infrastructure with context switching, memory fragmentation, and co-location interference. Apply a 1.3–2× safety margin on MLSys·im latency estimates for production SLA planning.

“Will my 30-day training run complete without failure?”

from mlsysim import ReliabilityModel, CheckpointModel

# Step 1: How often will the cluster fail?
reliability = ReliabilityModel()
rel = reliability.solve(
    fleet=mlsysim.Systems.Clusters.H100_256,
    job_duration_days=30
)
print(f"Fleet MTBF: {rel.fleet_mtbf.to('hours'):~.1f}")
print(f"Expected failures in 30 days: {rel.expected_failures:.1f}")
print(f"Optimal checkpoint interval: {rel.optimal_interval.to('minutes'):~.1f}")

# Step 2: What's the MFU penalty of checkpointing at that interval?
ckpt = CheckpointModel()
penalty = ckpt.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    hardware=mlsysim.Hardware.Cloud.H100,
    optimizer="adam",
    checkpoint_interval=rel.optimal_interval
)
print(f"Checkpoint size: {penalty.checkpoint_size.to('GB'):~.1f}")
print(f"MFU penalty: {penalty.mfu_penalty*100:.1f}%")

“What hardware do I need to meet this latency SLA?”

from mlsysim import SynthesisSolver

# Invert the Roofline: given a target, derive minimum specs
synth = SynthesisSolver()
specs = synth.solve(
    model=mlsysim.Models.Language.Llama3_70B,
    target_latency=mlsysim.Q_("30 ms/token"),  # ITL SLA
    precision="fp16"
)
print(f"Minimum memory BW: {specs.required_bandwidth.to('TB/s'):~.2f}")
print(f"Minimum memory:    {specs.required_memory.to('GB'):~.1f}")
print(f"Minimum FLOPS:     {specs.required_flops.to('TFLOP/s'):~.1f}")

# Which hardware in the Zoo meets these specs?
for hw in mlsysim.Hardware.all():
    if (hw.memory.bandwidth >= specs.required_bandwidth and
        hw.memory.capacity >= specs.required_memory):
        print(f"  ✓ {hw.name}")

Dimensional Strictness: Why Units Matter

All quantities in MLSys·im are pint.Quantity objects with physical units enforced at runtime. This prevents the most common class of back-of-envelope errors — the kind that cause real production incidents:

hw = mlsysim.Hardware.Cloud.A100

# These work — units are compatible:
ridge_point = hw.compute.peak_flops / hw.memory.bandwidth  # → FLOP/byte ✓
time = model_bytes / hw.memory.bandwidth                    # → seconds ✓

# These raise DimensionalityError — caught before producing a bad number:
hw.memory.bandwidth + hw.compute.peak_flops   # GB/s + FLOP/s → ERROR
hw.memory.bandwidth.to("FLOP/s")              # GB/s ≠ FLOP/s → ERROR

In spreadsheet modeling, confusing gigabytes with gigabits or mixing per-device and per-node bandwidth silently produces a 8× or 4× error. In MLSys·im, these errors are structurally impossible. Think of it as type safety for systems analysis.


Writing Custom Solvers

Follow the built-in solver pattern to create your own analysis:

from mlsysim.hardware.types import HardwareNode

class PowerEfficiencyModel:
    def solve(self, hardware: HardwareNode) -> dict:
        flops_per_watt = hardware.compute.peak_flops / hardware.tdp
        return {
            "device": hardware.name,
            "flops_per_watt": flops_per_watt.to("TFLOPs/s/kW"),
        }

See Extending MLSys·im for the full guide.


Next Steps

Back to top