🚧 DEVELOPMENT PREVIEW - Built from dev@a46ce0b1 • 2026-05-24 09:44 EDT • Stable version →
🧮 MLSys·im — first-principles analytical modeling for ML training and inference; model the physics before you build. 📘 The book:Vol I: Foundations · Vol II: At Scale — open access, free forever. 🛠️ Alongside the book:TinyTorch (build) · Hardware Kits (deploy) · Labs (explore) · StaffML (practice) · Lecture Slides 📬 Newsletter: ML Systems insights & updates — Subscribe →
Three first-order models for questions students ask after roofline analysis.
analysis
advanced
Use MLSys·im to explain training memory pressure, size a serving deployment, and model MoE routing imbalance without leaving the analytical framework.
The Question
You know how to run a roofline analysis. Now you want to answer three follow-up questions that appear in real design reviews:
Why does training need much more memory than inference?
How many replicas do I need for a QPS and P99 target?
How much does MoE hot-expert imbalance change the communication cost?
These questions are not separate systems. They are different views of the same demand–supply framework: tensors consume memory, requests consume serving capacity, and routed tokens consume network bandwidth.
NoteWhat You Will Learn
Break training memory into weights, gradients, optimizer state, activations, and communication buffers.
Turn a serving target into a first-pass replica count.
Sweep a simple MoE routing imbalance factor and read the all-to-all cost.
Decide which numbers are direct byte counts and which require calibration.
1. Training Memory Is Not Inference Memory
Inference mainly stores model weights and KV cache. Training also stores gradients, optimizer state, activations, and communication buffers. That is why a model that fits for inference can fail during training.
The first three terms are direct byte counts. Activation memory is the term most sensitive to framework behavior because checkpointing changes what the backward pass must store versus recompute.
Exercise
Change zero_stage from 0 to 3. Which components shrink? Why do activations not shrink from ZeRO alone?
2. Serving Capacity Combines Three Walls
A serving deployment is not sized by TTFT alone. You need base request latency, per-replica token capacity, and queueing pressure under load.
The efficiency parameter is exposed because the compute-bound part of serving depends on implementation quality. Use the default for quick comparisons; use a measured efficiency value before making a production SLA commitment.
Exercise
Double output_tokens from 64 to 128. Does the replica count change by exactly 2x? Explain the difference between base latency and queueing latency.
3. MoE Routing Imbalance
Mixture-of-Experts models reduce compute by activating only a subset of experts, but routed tokens create expert-parallel all-to-all traffic. Perfectly balanced routing is an idealization; hot experts increase the effective active work and the routed payload.
Imbalance Effective Experts Routed Bytes All-to-All
──────────────────────────────────────────────────────
1.00 2.00 117.4 MB 9.75 ms
1.25 2.50 146.8 MB 12.09 ms
1.50 3.00 176.2 MB 14.44 ms
2.00 4.00 234.9 MB 19.14 ms
This model does not simulate a router. It gives you a clean sensitivity knob: if measured routing logs show a 25% hot-expert effect, set routing_imbalance_factor=1.25 and see how the communication wall moves.
4. What Counts as Validation?
For these models, validation means matching the level of the approximation:
Output
How to validate
Weight, gradient, optimizer, KV-cache bytes
Compare against direct tensor counts.
Activation memory
Compare against framework memory traces for one model/config.
Serving replica count
Benchmark one deployment point, calibrate efficiency, then sweep.
MoE routing imbalance
Measure tokens per expert and feed the observed imbalance into the model.
MLSYSIM should be used to identify the binding constraint and compare design options before benchmarking. It should not replace empirical measurement for a final production SLA.
Next Step
Use these models inside a larger analysis:
TrainingMemoryModel before DistributedModel to rule out impossible training configurations.
ServingCapacityModel before EconomicsModel to convert traffic targets into fleet size and cost.
MoERoutingModel with DistributedModel(..., moe_routing_imbalance_factor=...) to see whether expert parallelism is limited by bandwidth.