🚧 DEVELOPMENT PREVIEW - Built from dev@7d00dc46 • 2026-03-16 20:18 EDT • Stable version →
🧮 MLSys·im: First-principles ML systems modeling. Get started →
📚 Textbook: Read the ML Systems book. Explore →
MLSys·im
Build intuition for ML system performance, cost, and carbon — from first principles.
pip install mlsysim
import mlsysim
from mlsysim import Engine
profile = Engine.solve(
model = mlsysim.Models.ResNet50,
hardware = mlsysim.Hardware.Cloud.A100,
batch_size = 1,
precision = "fp16"
)
Bottleneck: Memory Bound
Latency: 0.42 ms
Throughput: 2,381 img/s
MFU: 12.4%
Memory: 0.10 GB / 80 GB
AI (FLOP/B): 4.2 ← below ridge point
Identify whether your workload is memory-bound or compute-bound on any hardware.
Model pre-fill and decode phases, KV-cache pressure, and time-to-first-token.
Same workload, different region — up to 41× difference in carbon footprint.
3D parallelism decomposition with scaling efficiency and pipeline bubble analysis.
Is my workload memory-bound or compute-bound?
How many GPUs do I need to train a 70B model in 24 hours?
What is the carbon footprint of training in Iowa vs. Quebec?
What is the optimal checkpoint interval for a 1000-GPU job?
How much does quantization to INT8 actually save in latency?
What is the 3-year TCO for a 64×H100 cluster?
Roofline Analysis Compute vs. memory bottleneck identification using the Iron Law. Single-node latency and throughput.
LLM Serving Time-to-first-token (TTFT), inter-token latency (ITL), and KV-cache memory pressure.
3D Parallelism Data, tensor, and pipeline parallel scaling with communication overhead and bubble analysis.
Sustainability Energy, carbon footprint (kg CO₂e), and water usage across datacenter regions.
Total Cost of Ownership CapEx, OpEx, electricity, maintenance, and per-query economics over any time horizon.
Reliability & Queueing Fleet MTBF, checkpoint intervals, tail latency (P99), and SLA compliance.
Memory-bound vs. compute-bound in 5 lines of Python. Sweep batch sizes and see the roofline crossover.
Quantify model weights, activations, and optimizer state. Find out why your 7B model won't fit on one GPU.
Model the two phases of autoregressive generation (pre-fill and decode) and diagnose KV-cache pressure.
INT8 vs. FP16 vs. FP4 — measure the memory savings, throughput gains, and accuracy costs of compression.
Ring all-reduce communication, pipeline bubbles, and scaling efficiency on 256 GPUs.
Same model, same GPU, yet up to 41× difference in carbon footprint depending on where you train.