Mathematical Foundations
The First-Principles Equations Behind Every MLSys·im Solver
MLSys·im avoids “black box” heuristics. Every output traces back to one of the equations below, each grounded in a published systems paper. Before diving into code, read this page to understand what the 22 resolvers (20 models and 2 solvers) resolving the 22 Systems Walls are computing and why.
Each solver in MLSys·im implements one or more of the models below. Click any solver name to go directly to its API documentation.
Notation Reference
All quantities carry SI units at runtime via pint. The table below defines every symbol used in the equations that follow.
| Symbol | Unit | Description |
|---|---|---|
| Model & Workload | ||
| \(P\) | params | Model parameter count |
| \(P_{\ell}\) | params | Per-layer parameter count |
| \(\|W\|\) | bytes | Total bytes read per step (weights + KV cache) |
| \(\|W_{\ell}\|\) | bytes | Per-layer weight size (\(P_{\ell} \times b_{\text{prec}}\)) |
| \(b_{\text{prec}}\) | bytes/param | Numerical precision (e.g., 2 for FP16) |
| \(L\) | — | Number of layers |
| \(H\) | — | Number of attention heads (or KV heads) |
| \(D\) | — | Head dimension |
| \(S\) | tokens | Sequence length |
| \(B\) | samples | Batch size |
| \(K\) | — | Number of reasoning steps |
| \(C\) | FLOPs | Total training compute (\(6PD\)) |
| \(I\) | FLOP/byte | Arithmetic intensity |
| Hardware & Infrastructure | ||
| \(\text{Peak}_{\text{FLOPS}}\) | FLOP/s | Peak accelerator throughput |
| \(BW_{\text{HBM}}\) | bytes/s | HBM bandwidth |
| \(BW_{\text{inject}}\) | bytes/s | Weight injection bandwidth (wafer-scale) |
| \(BW_{\text{link}}\) | bytes/s | Per-link network bandwidth |
| \(N\) | — | Number of nodes in fleet |
| \(G\) | — | GPUs per node |
| \(\alpha\) | seconds | Per-hop network latency |
| Efficiency & Utilization | ||
| \(\eta\) | — | Hardware utilization (\(\approx\) MFU) |
| \(\eta_{\text{overlap}}\) | — | Compute–communication overlap efficiency |
| \(\rho\) | — | Utilization ratio (queueing or data) |
| \(\beta\) | — | Bisection bandwidth fraction |
| \(\beta_{\text{opt}}\) | — | Optimizer state multiplier (total ckpt bytes / weight bytes) |
| Parallelism | ||
| TP, PP, DP, EP | — | Tensor, pipeline, data, expert parallelism |
| \(V\) | — | Virtual pipeline stages |
| \(M_{\text{micro}}\) | — | Number of microbatches |
| Sustainability | ||
| PUE | — | Power Usage Effectiveness |
| CI | gCO₂/kWh | Regional carbon intensity |
| WUE | L/kWh | Water Usage Effectiveness |
| Key Derived Quantities | ||
| \(I^{*}\) | FLOP/byte | Ridge point (\(\text{Peak}/BW_{\text{HBM}}\)) |
| \(B^{*}\) | samples | Optimal batch size (streaming wall) |
| \(P^{*}\) | params | Compute-optimal model size |
| \(D^{*}\) | tokens | Compute-optimal dataset size |
| \(\tau_{\text{opt}}\) | seconds | Optimal checkpoint interval |
1. The Iron Law of ML Systems (Single Node)
Implemented in mlsysim.core.solver.SingleNodeModel.
\[ T = \max \left( \frac{\text{FLOPs}}{\text{Peak\_FLOPs} \times \eta},\ \frac{\text{Bytes}}{\text{Memory\_BW}} \right) + \text{Dispatch\_Tax} \]
Where:
- \(\eta\) is the hardware utilization efficiency (typically 0.25–0.55 in practice; MLSys·im defaults to 0.5, with 0.35 recommended for inference — see Accuracy & Validation for guidance).
- \(\text{Dispatch\_Tax}\) is the constant kernel-launch overhead (e.g., CUDA overhead, ~0.01–0.1 ms).
- If \(\frac{\text{FLOPs}}{\text{Peak\_FLOPs} \times \eta} > \frac{\text{Bytes}}{\text{Memory\_BW}}\): Compute-bound — buy faster GPUs or increase arithmetic intensity.
- If \(\frac{\text{Bytes}}{\text{Memory\_BW}}\) wins: Memory-bound — increase batch size or use operator fusion.
Arithmetic Intensity is the key ratio: \(I = \text{FLOPs} / \text{Bytes}\). The roofline ridge point is \(I^* = \text{Peak\_FLOPs} / \text{Memory\_BW}\). If \(I > I^*\), you are compute-bound. If \(I < I^*\), you are memory-bound.
2. Distributed Training (3D Parallelism)
Implemented in mlsysim.core.solver.DistributedModel.
Why an analytical model? Real distributed training involves complex interactions between computation, communication, and scheduling. Empirical profiling requires access to expensive multi-GPU clusters and takes hours per configuration. MLSys·im instead decomposes the problem into three independent overheads — data parallelism (gradient synchronization), tensor parallelism (intra-layer communication), and pipeline parallelism (bubble idle time) — each governed by a closed-form equation. This lets you evaluate thousands of parallelism configurations in seconds to identify the best strategy before reserving cluster time.
The key insight is that each parallelism dimension introduces a communication tax that can be modeled from first principles: message size, network bandwidth, and topology. The single-GPU compute time comes from the roofline model (Section 1), and the distributed overhead is additive on top.
2.1 Scaling Efficiency
The solver computes an overall scaling efficiency — the fraction of ideal linear speedup actually achieved:
\[ \eta_{\text{scale}} = \frac{T_{\text{single}}}{T_{\text{single}} + T_{\text{dp}} + T_{\text{tp}} + T_{\text{bubble}}} \]
Where \(T_{\text{single}}\) is the per-GPU compute time (from the roofline model), and the remaining terms are the communication and scheduling overheads derived below. An efficiency of 80% on 256 GPUs means you get the equivalent throughput of ~205 GPUs — the rest is spent on communication.
2.2 Ring All-Reduce (Data Parallelism)
After each training step, every GPU must synchronize its gradients with every other GPU. The standard algorithm is ring all-reduce, which arranges GPUs in a logical ring and passes gradient chunks around it in two phases.
For a model of size \(M\) bytes distributed across \(N\) accelerators connected in a ring topology with inter-node bandwidth \(BW\) and latency \(L\):
\[ T_{\text{dp}} = 2(N-1) \cdot \left( \frac{M / N}{BW} + L \right) \]
The factor of 2 arises because ring all-reduce has two phases: scatter-reduce and all-gather, each requiring \(N-1\) communication steps. Each step transfers \(M/N\) bytes (one chunk of the gradient), so the total data transferred per GPU approaches \(2M\) as \(N\) grows — meaning the bandwidth cost is nearly independent of cluster size, which is why ring all-reduce scales well.
Implication: All-reduce cost grows linearly with model size \(M\) but is asymptotically constant in \(N\) — the factor \(2(N-1)/N\) approaches 2 as \(N\) grows, meaning adding more GPUs barely increases per-GPU communication time. For very large models (70B+ parameters = ~140 GB gradients in fp16), communication dominates at low batch sizes. Upgrading from 100 Gb Ethernet to InfiniBand NDR (400 Gb/s) can recover 10–30% scaling efficiency.
2.3 Pipeline Parallelism Bubble
Pipeline parallelism splits a model’s layers across multiple stages (nodes). Stage 1 processes layers 1–20, stage 2 processes layers 21–40, and so on. This allows models too large for a single GPU to be trained across multiple nodes.
The cost of pipelining is a pipeline bubble: at the start of each batch, downstream stages sit idle while waiting for upstream stages to produce output. When a pipeline of depth \(P\) processes \(M\) microbatches with \(V\) virtual stages per GPU, the fraction of time spent idle is:
\[ \text{Bubble Fraction} = \frac{P - 1}{V \times M + P - 1} \]
The intuition: with \(P\) stages and \(M\) microbatches, the pipeline takes time to fill and drain. The solution is to either increase \(M\) (more microbatches) or increase \(V\) (interleaved schedules). Both make the startup and drain phases a smaller fraction of total time.
Implication: To keep the bubble below 5% using standard 1F1B (\(V=1\)), you need \(M \geq 19 \cdot (P-1)\) microbatches. With a 4-stage pipeline (\(P=4\)), you need at least 57 microbatches. By using \(V=2\) virtual stages, you cut the required microbatches in half.
2.4 Expert Parallelism (Mixture of Experts)
To model MoE, we move from 3D to 4D Parallelism:
\[ \text{Data Parallelism} = \frac{\text{Total GPUs}}{TP \times PP \times EP} \]
Where \(EP\) is Expert Parallelism. If \(EP > 1\), the solver adds an All-to-All communication penalty for token routing:
\[ T_{\text{all-to-all}} = \frac{N-1}{N} \times \frac{\text{Message Size}}{\text{Bandwidth}} + (N-1) \times \text{Latency} \]
3. LLM Serving Lifecycle
Implemented in mlsysim.core.solver.ServingModel.
LLM autoregressive inference has two physically distinct phases. Understanding which phase dominates is critical for capacity planning.
3.1 Pre-fill Phase (Compute-Bound)
The initial forward pass over the full prompt is compute-bound because all tokens are processed in parallel:
\[ \text{TTFT} = \frac{2 \times \text{Parameters} \times \text{Seq\_Len} \times \text{Batch}}{\text{Peak\_FLOPs} \times \eta} + \text{Dispatch\_Tax} \]
The factor of 2 counts both the multiply and the add in each multiply-accumulate (MAC) operation.
3.2 Decoding Phase (Memory-Bound)
Each token decode step requires loading the entire model weight matrix plus the accumulated KV-cache:
\[ \text{ITL} = \frac{\text{Model\_Bytes} + \text{KV\_Cache\_Bytes}}{\text{Memory\_BW}} \]
This phase is almost always memory-bound on current hardware because generating one token requires the same memory load as a full matrix-vector product, but performs far fewer FLOPs.
3.3 KV-Cache Size
Simplified form (contiguous allocation):
\[ \text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times S \times B \times b \]
where \(L\) is the number of layers, \(H_{\text{kv}}\) is the number of KV heads (equals \(H\) for MHA, less for GQA), \(d\) is the head dimension, \(S\) is the sequence length, \(B\) is the batch size, and \(b\) is the bytes per parameter.
PagedAttention form (paged allocation, as in vLLM):
\[ \text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b \]
where \(p\) is the page size in tokens. When \(p = S\) (no paging), this reduces to the simplified form.
The factor of 2 counts both the K and V matrices. At fp16 (2 bytes/param), a 70B model with a 4096-token context at batch=32 requires approximately 540 GB of KV-cache — more than a single H100 node can hold.
3.4 Prompt Caching (Prefix Caching)
Literature: [4] (RadixAttention for automatic prefix caching).
When a prefix of the context (e.g., a system prompt) has been seen before, its KV-cache entries are already computed and resident in memory. Only the new tokens require prefill computation:
\[ \text{TTFT}_{\text{cached}} = \frac{2 \times \text{Parameters} \times (\text{Seq\_Len} - P_{\text{cached}}) \times \text{Batch}}{\text{Peak\_FLOPs} \times \eta} + \text{Dispatch\_Tax} \]
where \(P_{\text{cached}}\) is the number of tokens with pre-computed KV entries. The key asymmetry: TTFT improves proportionally (only new tokens need compute), but ITL is unchanged (the decode phase still loads the full model weights regardless) and memory is unchanged (the full KV-cache, including cached prefix, remains resident).
The TTFT speedup ratio is:
\[ \text{Speedup} = \frac{\text{Seq\_Len}}{\text{Seq\_Len} - P_{\text{cached}}} \]
For a chatbot with a 1024-token system prompt in a 4096-token context, caching yields a \(\frac{4096}{3072} = 1.33\times\) TTFT improvement. For RAG applications where 75% of the context is cached retrieval, the speedup reaches \(4\times\).
3.5 Disaggregated Serving
Literature: [5] (Splitwise: phase-aware LLM inference).
In disaggregated (or “splitwise”) serving, the compute-bound prefill and memory-bound decode phases run on different hardware optimized for each:
\[ \text{TTFT}_{\text{disagg}} = \text{TTFT}_{\text{prefill\_node}} + \frac{\text{KV\_Cache\_Bytes}}{\text{Network\_BW}} \]
The second term is the cost of transferring the KV-cache from the prefill node to the decode node over the network. This is worthwhile when:
- The prefill node has higher compute throughput (e.g., more FLOPS)
- The decode node has higher memory bandwidth (e.g., more HBM bandwidth per GPU)
- The KV-cache transfer time is small relative to the TTFT savings
3.6 Speculative Decoding
Literature: [6] (Fast Inference from Transformers via Speculative Decoding).
Speculative decoding uses a small draft model to generate \(K\) candidate tokens cheaply, then verifies all \(K\) in a single forward pass of the target model. The expected number of accepted tokens per step follows a geometric series [6]:
\[ \mathbb{E}[\text{accepted}] = 1 + \frac{\alpha(1 - \alpha^K)}{1 - \alpha} \]
where \(\alpha\) is the draft model’s acceptance rate (probability that its token matches the target model’s distribution). The effective ITL becomes:
\[ \text{ITL}_{\text{spec}} = \frac{K \cdot t_{\text{draft}} + t_{\text{verify}}}{\mathbb{E}[\text{accepted}]} \]
With a good draft model (\(\alpha \approx 0.8\), \(K = 4\)), speculative decoding can achieve approximately \(2.5\times\) ITL improvement. The trade-off is additional memory for the draft model weights.
4. The Data Wall
Implemented in mlsysim.core.solver.DataModel.
The data supply bandwidth is limited by the minimum bandwidth across the storage hierarchy (SSD/HDD) and the IO interconnect (PCIe):
\[ \text{Supply\_BW} = \min(\text{Storage\_BW}, \text{IO\_Interconnect\_BW}) \]
The system is stalled if the required data rate exceeds the supply bandwidth:
\[ \text{Utilization} = \frac{\text{Required\_Data\_Rate}}{\text{Supply\_BW}} \]
If \(\text{Utilization} > 1.0\), the compute units will stall waiting for data, reducing effective throughput regardless of the model’s arithmetic intensity.
5. Scaling Physics (Chinchilla Laws)
Implemented in mlsysim.core.solver.ScalingModel.
The total floating-point operations required for training a Transformer is approximately:
\[ C \approx 6 \times P \times D \]
The compute-optimal point (Chinchilla point) occurs when the number of training tokens is roughly 20 times the number of parameters:
\[ D \approx 20 \times P \]
Given a compute budget \(C\) (in FLOPs), the optimal parameter count is:
\[ P_{\text{opt}} = \sqrt{\frac{C}{120}} \]
6. Cluster Orchestration (Little’s Law)
Implemented in mlsysim.core.solver.OrchestrationModel.
Cluster utilization \(\rho\) is the ratio of the arrival rate to the service rate:
\[ \rho = \frac{\lambda}{\mu} \]
Using an M/D/1 queue model (Poisson arrivals, fixed job durations), the average wait time in the queue is:
\[ T_{\text{wait}} = \frac{\rho}{2\mu(1 - \rho)} \]
As \(\rho \to 1.0\), \(T_{\text{wait}} \to \infty\). This illustrates why maintaining some “headroom” in cluster capacity is essential for researcher productivity.
7. Model Compression (The Compression Tax)
Implemented in mlsysim.core.solver.CompressionModel.
The Compression Ratio (\(R\)) is the ratio of the original model size to the compressed size:
\[ R = \frac{\text{Size}_{\text{original}}}{\text{Size}_{\text{compressed}}} \]
For Quantization, \(R\) is determined by the baseline and target bit-widths:
\[ R_{\text{quant}} = \frac{b_{\text{base}}}{b_{\text{target}}} \]
where \(b_{\text{base}}\) is typically 16 (FP16/BF16), not 32. For example, FP16→INT4 gives \(R = 16/4 = 4\times\).
For Pruning, \(R\) is determined by the sparsity ratio (\(s\)):
\[ R_{\text{pruning}} = \frac{1}{1 - s} \]
The estimated accuracy impact (\(\Delta A\)) is non-linear and typically follows empirical heuristics derived from benchmark studies (e.g., INT8 quantization often yields <1% drop, while 4-bit quantization can yield 2–5% drop).
8. Datacenter Sustainability
Implemented in mlsysim.core.solver.SustainabilityModel.
8.1 Total Energy
\[ E = \text{IT\_Power} \times \text{Hours} \times \text{PUE} \]
Power Usage Effectiveness (PUE) accounts for cooling and facility overhead. A PUE of 1.0 is theoretical perfect efficiency; hyperscale datacenters typically achieve 1.1–1.4.
8.2 Carbon Footprint
\[ C = E \times \text{Carbon\_Intensity} \]
Where \(C\) is in \(\text{kg CO}_2\text{e}\) and \(\text{Carbon\_Intensity}\) is in \(\text{g CO}_2\text{e/kWh}\), sourced from IEA regional grid data. This value varies from ~20 g/kWh (Quebec hydro) to ~820 g/kWh (Poland coal)—a ~41× difference for identical ML workloads.
9. Total Cost of Ownership (TCO)
Implemented in mlsysim.core.solver.EconomicsModel.
\[ \text{TCO} = \text{CapEx}_{\text{amortized}} + \text{OpEx}_{\text{power}} + \text{OpEx}_{\text{networking}} + \text{OpEx}_{\text{labor}} \]
Where:
- \(\text{CapEx}_{\text{amortized}} = \text{Hardware\_Cost} / \text{Depreciation\_Years}\)
- \(\text{OpEx}_{\text{power}} = E \times \text{Electricity\_Rate}\)
10. Cluster Reliability (The Young-Daly Model)
Implemented in mlsysim.core.solver.ReliabilityModel.
The optimal checkpoint interval \(\tau_{\text{opt}}\) is defined by the Mean Time Between Failures (\(M\)) and the time it takes to write a single checkpoint (\(\delta\)):
\[ \tau_{\text{opt}} = \sqrt{2 \times \delta \times M} \]
For a cluster, the collective \(M\) drops linearly with the number of components. If a single node has an MTBF of 10,000 hours, a cluster of 1,000 nodes will have an MTBF of just 10 hours (\(10,000 / 1000\)).
11. Tail Latency (Erlang-C Queueing)
Implemented in mlsysim.core.solver.TailLatencyModel.
For an inference server farm modeled as an M/M/\(c\) queue with \(c\) replicas and per-server utilization \(\rho = \lambda / (c\mu)\):
\[ \mathbb{P}[\text{wait}] = \frac{(c\rho)^c / c! \cdot (1-\rho)^{-1}}{\sum_{k=0}^{c-1}(c\rho)^k/k! + (c\rho)^c/c! \cdot (1-\rho)^{-1}} \]
P99 latency grows non-linearly as \(\rho \to 1\). The TailLatencyModel computes P50 and P99 wait times and SLO violation probability as a function of arrival rate, service latency, and replica count.
12. Weight Streaming (Wafer-Scale Inference)
Implemented in mlsysim.core.solver.WeightStreamingModel.
Per-layer execution time is:
\[ T_{\text{layer}} = \max\!\left(\frac{|W_{\ell}|}{BW_{\text{inject}}},\; \frac{2P_{\ell} \times B}{\text{Peak} \times \eta}\right) \]
where \(|W_{\ell}|\) is the layer weight size in bytes, \(P_{\ell} = |W_{\ell}| / b_{\text{prec}}\) is the parameter count, and the factor of 2 accounts for the multiply-accumulate FLOPs per parameter.
The optimal batch size where injection and compute perfectly overlap:
\[ B^{*} = \frac{b_{\text{prec}} \times \text{Peak} \times \eta}{2 \times BW_{\text{inject}}} \]
Note that \(B^*\) depends only on numerical precision, peak compute, and injection bandwidth — it is independent of layer size.
13. Responsible Engineering (DP-SGD Overhead)
Implemented in mlsysim.core.solver.ResponsibleEngineeringModel.
Privacy and fairness guarantees impose measurable computational overhead. Differential privacy via DP-SGD requires per-sample gradient clipping and calibrated noise addition:
\[ \sigma \propto \frac{1}{\epsilon} \]
The per-step clipping and noise addition incur a training slowdown of approximately 2–10×. Fairness constraints require sufficient representation of minority subgroups, demanding additional data proportional to \(O(1/p_{\min})\) where \(p_{\min}\) is the smallest subgroup prevalence.
14. Sensitivity Analysis (Binding Constraints)
Implemented in mlsysim.core.solver.SensitivitySolver.
The SensitivitySolver identifies the binding constraint by computing partial derivatives of end-to-end latency with respect to each hardware parameter:
\[ \frac{\partial T}{\partial BW_{\text{mem}}}, \quad \frac{\partial T}{\partial \text{Peak}_{\text{FLOPS}}}, \quad \frac{\partial T}{\partial BW_{\text{net}}}, \quad \ldots \]
The parameter with the largest (most negative) sensitivity is the binding constraint — the single upgrade that would yield the greatest performance improvement. This transforms “where should I invest?” from intuition into calculation.
15. Inverse Roofline Synthesis
Implemented in mlsysim.core.solver.SynthesisSolver.
The SynthesisSolver inverts the analysis: given a workload and a service-level agreement, it derives the minimum hardware specifications required:
\[ BW_{\text{required}} = \frac{|W|}{T_{\text{target}}}, \qquad \text{FLOPS}_{\text{required}} = \frac{\text{OPs}}{T_{\text{target}} \times \eta} \]
This enables hardware–software co-design: rather than asking “how fast is my system?” the practitioner asks “what system do I need?”
16. Software Efficiency (Wall 3: MFU Decomposition)
Implemented in mlsysim.core.solver.EfficiencyModel.
The gap between peak and achieved FLOP/s is captured by the utilization parameter \(\eta\):
\[ \eta = \frac{\text{Achieved\_FLOP/s}}{\text{Peak\_FLOP/s}} \]
In practice, \(\eta\) decomposes into multiplicative factors:
\[ \eta = \eta_{\text{occupancy}} \times \eta_{\text{fusion}} \times \eta_{\text{precision}} \times \eta_{\text{memory}} \]
where occupancy captures warp-level parallelism, fusion captures kernel launch elimination, precision captures tensor core utilization, and memory captures data reuse. Well-optimized training (Megatron-LM, DeepSpeed) achieves \(\eta \approx 0.35\text{–}0.55\); unoptimized inference may drop below 0.10.
17. Continuous Batching (Wall 5: PagedAttention)
Implemented in mlsysim.core.solver.ContinuousBatchingModel.
Static batching wastes memory by pre-allocating contiguous KV-cache for the maximum sequence length. PagedAttention allocates KV-cache in fixed-size pages:
\[ \text{KV\_paged} = 2 \times L \times H_{\text{kv}} \times d \times \lceil S/p \rceil \times p \times B \times b \]
The memory savings come from eliminating internal fragmentation. The effective batch size \(B_{\text{eff}}\) under a memory budget \(M_{\text{budget}}\) is:
\[ B_{\text{eff}} = \left\lfloor \frac{M_{\text{budget}} - |W|}{2 \times L \times H_{\text{kv}} \times d \times S \times b} \right\rfloor \]
where \(|W|\) is the model weight footprint. Continuous batching dynamically inserts and removes requests as they complete, keeping \(B_{\text{eff}}\) near maximum at all times.
18. Data Transformation (Wall 9: CPU Preprocessing)
Implemented in mlsysim.core.solver.TransformationModel.
CPU preprocessing (decode, resize, tokenize, augment) must keep pace with GPU consumption:
\[ T_{\text{transform}} = \frac{B}{R_{\text{cpu}}} \]
where \(B\) is the batch size (in samples) and \(R_{\text{cpu}} = N_{\text{workers}} \times r_{\text{worker}}\) is the aggregate CPU preprocessing rate (in samples/s), with \(r_{\text{worker}}\) being the per-worker throughput after all transformations (decode, augment, normalize). The pipeline is stalled when:
\[ T_{\text{transform}} > T_{\text{compute}} \]
Common mitigations: increase \(W\) (more workers), cache preprocessed data, or use GPU-accelerated decoding (DALI).
19. Network Topology (Wall 10: Bisection Bandwidth)
Implemented in mlsysim.core.solver.TopologyModel.
The effective bandwidth available for collective communication depends on the network topology and its oversubscription ratio:
\[ \text{BW}_{\text{eff}} = \frac{\text{BW}_{\text{link}} \times \beta}{\text{oversubscription}} \]
where \(\beta\) is the bisection bandwidth ratio (1.0 for full bisection in fat-tree topologies, <1.0 for oversubscribed networks). For a \(k\)-ary fat-tree with oversubscription ratio \(r\):
\[ \text{BW}_{\text{bisection}} = \frac{k^2}{4r} \times \text{BW}_{\text{link}} \]
Fat-tree topologies (Leiserson, 1985) provide full bisection bandwidth (\(r=1\)), making them ideal for AllReduce-heavy workloads. Dragonfly and torus topologies trade bisection bandwidth for lower cost at scale.
20. Inference-Time Scaling (Wall 12: Reasoning Compute)
Implemented in mlsysim.core.solver.InferenceScalingModel.
Chain-of-thought, tree search, and other inference-time scaling strategies multiply the compute cost by the number of reasoning steps \(K\):
\[ T_{\text{reasoning}} = K \times T_{\text{step}} \]
where \(T_{\text{step}}\) is the latency of a single forward pass (from the roofline model). For best-of-\(N\) sampling:
\[ T_{\text{best-of-N}} = N \times T_{\text{generate}} + T_{\text{verify}} \]
The key insight from Snell et al. (2024) is that inference-time compute can substitute for training-time compute: a smaller model with \(K\) reasoning steps may match a larger model with a single pass, at the cost of \(K\times\) inference latency.
21. Checkpoint I/O (Wall 19: MFU Penalty)
Implemented in mlsysim.core.solver.CheckpointModel.
Periodic checkpointing imposes an I/O burst that pauses training:
\[ \text{MFU\_penalty} = \frac{T_{\text{write}}}{T_{\text{interval}}} \]
where \(T_{\text{write}} = |W| \times \beta_{\text{opt}} / BW_{\text{storage}}\) is the time to write one checkpoint and \(T_{\text{interval}}\) is the checkpoint interval. For mixed-precision Adam with FP16 weights, the checkpoint includes:
- FP16 model weights: 2 bytes/param
- FP32 master weights: 4 bytes/param
- FP32 momentum: 4 bytes/param
- FP32 variance: 4 bytes/param
Gradients are ephemeral and not checkpointed. This totals 14 bytes per parameter, giving \(\beta_{\text{opt}} \approx 7\) (ratio of 14 checkpoint bytes to 2 weight bytes):
\[ \text{Checkpoint\_Size} = P \times 14 \text{ bytes/param} \]
For a 70B model, the checkpoint is \(70\text{B} \times 14 \approx 0.98\) TB. The optimal checkpoint interval from the Young-Daly formula (Section 10) balances checkpoint overhead against expected rework from failures.
These equations are first-order analytical models. They assume: (1) uniform memory access patterns, (2) no cache effects, (3) no network contention under heavy load, and (4) linear scaling of throughput with batch size. Real systems deviate from these assumptions. MLSys·im predictions are typically accurate within ±20% of measured hardware performance — sufficient for systems intuition and capacity planning, but not a substitute for empirical profiling.
For a detailed discussion of accuracy, limitations, and when to use MLSys·im vs. empirical profiling, see Accuracy & Validation.