Mathematical Foundations

The First-Principles Equations Behind Every MLSys·im Solver

MLSys·im avoids “black box” heuristics. Every output traces back to one of the equations below. Before diving into code, read this page to understand what the solvers are computing and why.

How to read this page

Each section corresponds to one MLSys·im solver. Click any solver name to jump to its API docs, or follow the Slide Deck links to the full lecture treatment with worked examples and exercises.

1. The Roofline Model (Single-Node Performance)

Implemented in mlsysim.solvers.SingleNodeModel. Slide Deck: Hardware Acceleration (Vol I, Ch 11)

Intuition: Hardware has two speed limits — how fast it can compute (FLOP/s) and how fast it can feed data to the compute units (bytes/s). Your actual throughput is whichever limit you hit first. This is why we take the maximum of two terms, not their sum.

1.1 Latency Equation

\[ T = \max \left( \underbrace{\frac{\text{FLOPs}}{\text{Peak\_FLOP/s} \times \eta}}_{\text{compute time}},\; \underbrace{\frac{\text{Bytes}}{\text{Memory\_BW}}}_{\text{memory time}} \right) + \text{Dispatch\_Tax} \]

Symbol	Meaning	Typical Range
$\eta$	Hardware utilization efficiency	0.25–0.55 (training); ~0.35 (inference). See Accuracy & Validation.
Dispatch_Tax	Kernel-launch overhead (CUDA, driver)	0.01–0.1 ms

1.2 Arithmetic Intensity and the Ridge Point

The key diagnostic ratio is arithmetic intensity:

\[ I = \frac{\text{FLOPs}}{\text{Bytes Transferred}} \]

The ridge point is the hardware’s crossover intensity:

\[ I^* = \frac{\text{Peak\_FLOP/s}}{\text{Memory\_BW}} \]

If…	Regime	Action
$I > I^*$	Compute-bound	Faster math units, lower precision, or fewer FLOPs
$I < I^*$	Memory-bound	Larger batch size, operator fusion, or higher memory bandwidth

Worked Example: Dense Layer on H100

A dense layer with 2048→512, FP16, batch=1:

FLOPs: $2 \times 2048 \times 512 = 2.1\text{M}$
Bytes: $2048 \times 512 \times 2 = 2.1\text{MB}$
Arithmetic Intensity: $2.1\text{M} / 2.1\text{MB} = 1$ FLOP/byte
H100 Ridge Point: $989\text{ TF} / 3.35\text{ TB/s} \approx 296$ FLOP/byte

At $I = 1 \ll 296 = I^*$, this workload is deeply memory-bound — the H100 achieves <1% peak utilization. Increasing to batch=64 raises $I$ to ~64, recovering significant throughput.

Source: Slide deck exercise, Vol I Ch 11.

2. Distributed Training (3D/4D Parallelism)

Implemented in mlsysim.solvers.DistributedModel. Slide Decks: Distributed Training (Vol II, Ch 5) | Collective Communication (Vol II, Ch 6)

Real distributed training involves complex interactions between computation, communication, and scheduling. Empirical profiling requires access to expensive multi-GPU clusters and takes hours per configuration. MLSys·im decomposes the problem into independent overheads — each governed by a closed-form equation — letting you evaluate thousands of parallelism configurations in seconds.

2.1 Parallelism Decomposition

MLSys·im supports 4D parallelism. Given a cluster of $N$ total GPUs:

\[ \text{DP} = \frac{N}{\text{TP} \times \text{PP} \times \text{EP}} \]

Dimension	What It Splits	Communication Pattern
Data Parallelism (DP)	Batch across replicas	AllReduce (gradients)
Tensor Parallelism (TP)	Individual layers across GPUs	Point-to-point (intra-node NVLink)
Pipeline Parallelism (PP)	Layer groups across stages	Forward/backward activations
Expert Parallelism (EP)	MoE experts across GPUs	All-to-All (token routing)

2.2 Scaling Efficiency

The solver computes an overall scaling efficiency — the fraction of ideal linear speedup actually achieved:

\[ \eta_{\text{scale}} = \frac{T_{\text{compute}}}{T_{\text{compute}} + T_{\text{dp}} + T_{\text{tp}} + T_{\text{ep}} + T_{\text{bubble}}} \]

An efficiency of 80% on 256 GPUs means you get the throughput of ~205 GPUs — the rest is communication overhead. Published scaling efficiencies at scale: ~95% at 8 GPUs, ~85% at 64 GPUs, ~60% at 1,024 GPUs, ~40% at 8,192 GPUs.

2.3 Ring All-Reduce (Data Parallelism)

Slide Deck: Collective Communication (Vol II, Ch 6)

After each training step, every GPU must synchronize gradients. The standard algorithm is ring all-reduce, which arranges $N$ GPUs in a logical ring and passes gradient chunks in two phases (scatter-reduce, then all-gather).

Using the $\alpha$-$\beta$ communication model ($\alpha$ = per-message latency, $\beta$ = bandwidth):

\[ T_{\text{ring}} = 2(N-1) \cdot \alpha \;+\; \frac{2(N-1)}{N} \cdot \frac{M}{\beta} \]

Symbol	Meaning
$M$	Total gradient size in bytes
$N$	Number of GPUs in the ring
$\alpha$	Per-message startup latency
$\beta$	Per-link bandwidth (bytes/s)

Why ring all-reduce scales well: The bandwidth term $\frac{2(N-1)}{N} \cdot \frac{M}{\beta}$ approaches $\frac{2M}{\beta}$ as $N$ grows — asymptotically constant in the number of GPUs. Total data transferred per GPU is always ~$2M$, regardless of cluster size. The latency term $2(N-1) \cdot \alpha$ grows linearly, but is negligible for large messages.

MLSys·im also implements tree all-reduce (latency-optimal for small messages):

\[ T_{\text{tree}} = 2 \log_2(N) \cdot \alpha \;+\; 2 \log_2(N) \cdot \frac{M}{\beta} \]

The crossover point between ring and tree is approximately $M_{\text{cross}} \approx N \cdot \alpha \cdot \beta$. Below this size, tree wins (latency dominates); above it, ring wins (bandwidth dominates).

Worked Example: 140 GB Gradient Sync

A 70B-parameter model in FP16 produces a 140 GB gradient. On a 1,024-GPU cluster with InfiniBand NDR ($\beta = 50$ GB/s, $\alpha = 2\;\mu$s):

Bandwidth term: $\frac{2 \times 1023}{1024} \times \frac{140}{50} \approx 5.6$ s
Latency term: $2 \times 1023 \times 2\;\mu\text{s} \approx 4$ ms (negligible)
Total: ~5.6 s — bandwidth-dominated, as expected for large messages.

Upgrading from 100 Gb Ethernet (~12 GB/s) to NDR (~50 GB/s) recovers 10–30% scaling efficiency.

2.4 Pipeline Parallelism Bubble

Slide Deck: Distributed Training (Vol II, Ch 5)

Pipeline parallelism splits a model’s layers across $P$ stages. At the start of each batch, downstream stages sit idle while upstream stages produce output — the pipeline bubble.

Intuition: You cannot change the speed of light, but you can change the software schedule. By assigning multiple virtual stages ($V$) to each GPU, we interleave execution: while a GPU waits for one virtual stage’s next microbatch, it computes a different virtual stage’s microbatch, hiding latency behind useful work.

With $P$ pipeline stages, $M$ microbatches, and $V$ virtual stages per GPU:

\[ \text{Bubble Fraction} = \frac{P - 1}{V \times M + P - 1} \]

Configuration	Bubble Fraction
$P=4, M=16, V=1$	$3/19 = 15.8\%$
$P=4, M=16, V=2$	$3/35 = 8.6\%$
$P=8, M=32, V=1$	$7/39 = 17.9\%$
$P=16, M=16, V=1$	$15/31 = 48.4\%$ — unacceptable

To keep the bubble below 5% with standard 1F1B ($V=1$), you need $M \geq 19 \times (P - 1)$ microbatches. Interleaved schedules ($V \geq 2$) cut this requirement proportionally.

2.5 Expert Parallelism (Mixture of Experts)

Intuition: Dense Transformers obey a strict “Iron Law”: doubling parameters doubles both memory and compute. Mixture of Experts (MoE) breaks this law by routing tokens to specific expert subnetworks. The memory bound is set by the massive total parameters, but the compute bound is set by the much smaller active parameters. The physical tradeoff is a network bandwidth tax (All-to-All communication) to route tokens across the cluster.

When $\text{EP} > 1$, the solver adds an All-to-All communication penalty for token routing:

\[ T_{\text{all-to-all}} = (N-1) \cdot \alpha \;+\; \frac{N-1}{N} \cdot \frac{M}{\beta} \]

where $N$ is the EP degree. Compared to AllReduce, All-to-All creates $O(N^2)$ point-to-point connections, making it latency-sensitive — a 4 KB token routing message is latency-bound, while a 140 GB gradient sync is bandwidth-bound.

MoE routing is rarely perfectly balanced. MLSys·im exposes a single routing-imbalance factor $\gamma \ge 1$:

\[ E_{\text{active,eff}} = \min(E_{\text{total}}, \gamma \cdot E_{\text{top-k}}) \]

\[ P_{\text{active,eff}} = P_{\text{active}} \cdot \frac{E_{\text{active,eff}}}{E_{\text{top-k}}} \]

MoERoutingModel reports the effective active experts, effective active parameters, routed bytes, and optional expert-parallel all-to-all latency. This is a first-order hot-expert proxy, not a router simulation; it is intended to capture the load-balancing concern surfaced by sparsely-gated MoE, GShard, and Switch-style systems [1], [2], [3].

3. Training Scaling Laws

Used by mlsysim.physics.calc_transformer_training_flops. Slide Deck: Training (Vol I, Ch 8)

The 6PD rule estimates total training FLOPs for a Transformer:

\[ C \approx 6 \times P \times D \]

Symbol	Meaning
$C$	Total training FLOPs
$P$	Number of model parameters
$D$	Number of training tokens

The factor of 6 comes from: 2 FLOPs per parameter per token (forward pass multiply-accumulate) × 3 passes (forward + backward for weights + backward for activations). Combined with the roofline model (Section 1), this gives training time:

\[ T_{\text{train}} = \frac{6PD}{\text{Peak\_FLOP/s} \times \eta} \]

Worked Example: GPT-3 Training Time

GPT-3 175B trained on 300B tokens:

Total FLOPs: $6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23}$
On 1,024 A100s ($312\text{ TF each} \times 0.5$ MFU): $3.15 \times 10^{23} / (1024 \times 312 \times 10^{12} \times 0.5) \approx 1.97 \times 10^6\text{ s} \approx 23\text{ days}$

The actual training took ~34 days, consistent with scaling efficiency losses at 1,024 GPUs (~60%).

3.1 Training Memory Accounting

Implemented in mlsysim.solvers.TrainingMemoryModel.

Inference memory is mostly weights plus KV cache. Training memory adds gradients, optimizer state, saved activations, and communication buffers:

\[ M_{\text{train}} = M_{\text{weights}} + M_{\text{grad}} + M_{\text{opt}} + M_{\text{act}} + M_{\text{comm}} \]

For mixed-precision Adam training, the default per-parameter state is:

Component	Default bytes / parameter	ZeRO sharding
Weights	2	Stage 3
Gradients	2	Stage 2+
Adam optimizer state	12	Stage 1+

Tensor, pipeline, and expert parallelism first shard the model states across model-parallel ranks. ZeRO then shards data-parallel states [4]. Activation memory scales with local microbatch size, sequence length, hidden dimension, and the chosen activation-checkpointing strategy [5].

4. LLM Serving Lifecycle

Implemented in mlsysim.solvers.ServingModel. Slide Decks: Model Serving (Vol I, Ch 13) | Inference at Scale (Vol II, Ch 10)

LLM autoregressive inference has two physically distinct phases. Understanding which phase dominates is critical for capacity planning.

4.1 Pre-fill Phase (Compute-Bound)

The initial forward pass processes all prompt tokens in parallel:

\[ \text{TTFT} = \frac{2 \times P \times S \times B}{\text{Peak\_FLOP/s} \times \eta} + \text{Dispatch\_Tax} \]

Symbol	Meaning
$P$	Number of model parameters
$S$	Input sequence length
$B$	Batch size

The factor of 2 counts both the multiply and the add in each multiply-accumulate (MAC) operation.

4.2 Decoding Phase (Memory-Bound)

Each decode step loads the entire model weight matrix plus the accumulated KV-cache:

\[ \text{ITL} = \frac{\text{Model\_Bytes} + \text{KV\_Cache\_Bytes}}{\text{Memory\_BW}} \]

This phase is almost always memory-bound because generating one token requires loading the full weight matrix but performs far fewer FLOPs (a single matrix-vector product per layer). On an H100 at batch=1, compute takes ~3% of the time; memory loading takes ~97%.

Worked Example: Llama-3 70B Decode on H100

Single-token decode, batch=1:

Compute: $2 \times 70\text{B} / 989\text{ TF} = 0.14\text{ ms}$ (3%)
Memory: $140\text{ GB} / 3.35\text{ TB/s} = 41.8\text{ ms}$ (97%)
Total: ~42 ms/token → ~24 tokens/s

The decode phase is so memory-bound that a hypothetical 2× compute upgrade yields zero speedup. Only higher memory bandwidth (or batching) helps.

Source: Compute Infrastructure slides (Vol II, Ch 2).

4.3 KV-Cache Size

\[ \text{KV\_Bytes} = 2 \times L \times H_{\text{kv}} \times d_{\text{head}} \times S \times B \times b_{\text{elem}} \]

Symbol	Meaning
$L$	Number of Transformer layers
$H_{\text{kv}}$	Number of KV attention heads (equals $H$ for MHA; smaller for GQA/MQA)
$d_{\text{head}}$	Dimension per head
$S$	Sequence length
$B$	Batch size
$b_{\text{elem}}$	Bytes per element (2 for FP16/BF16)

The factor of 2 accounts for both the K and V tensors. With Grouped-Query Attention (GQA), $H_{\text{kv}} < H$, significantly reducing cache size — Llama-3 70B uses 8 KV heads vs. 64 query heads, an 8× reduction.

Worked Example: KV-Cache Capacity Planning

Llama-3 70B: 80 layers, 8 KV heads (GQA), $d_{\text{head}} = 128$, FP16:

Per-token: $2 \times 80 \times 8 \times 128 \times 2 = 327\text{ KB}$
Batch=16, seq=4096: $327\text{K} \times 16 \times 4096 = 21.5\text{ GB}$
Batch=32, seq=4096: $327\text{K} \times 32 \times 4096 = 43\text{ GB}$

Weights (35 GB in FP16) + KV-cache (43 GB) = 78 GB — barely fits on an 80 GB H100. At batch=64 or longer contexts, you must shard across multiple GPUs.

Source: Inference at Scale slides (Vol II, Ch 10).

4.4 Chunked Prefill and Decode Interference

Production serving systems often mix new prefill work with ongoing decode iterations. Long prefills can stall decode unless the scheduler bounds the amount of prefill work admitted per iteration. MLSys·im models this with an optional prefill chunk size. Let $S_{\text{new}} = S - S_{\text{cache}}$, with $0 \le S_{\text{cache}} < S$ and $S_{\text{chunk}} > 0$:

\[ C = \left\lceil \frac{S_{\text{new}}}{S_{\text{chunk}}} \right\rceil \]

\[ \text{TTFT}_{\text{chunked}} = \text{TTFT}_{\text{compute}} + C \times \text{Dispatch\_Tax} \]

\[ \text{Decode\_Stall}_{\max} \approx \max_i\left(T_{\text{chunk},i}\right) \]

Here $T_{\text{chunk},i}$ is the compute time for chunk $i$ plus one dispatch tax.

Symbol	Meaning
$C$	Number of prefill chunks
$S_{\text{new}}$	Prompt tokens that still require prefill computation
$S_{\text{cache}}$	Prompt tokens already covered by prefix caching
$S_{\text{chunk}}$	Maximum prefill tokens admitted per scheduler iteration

This estimates a coarse scheduling proxy used by chunked-prefill systems: smaller chunks do not reduce total prefill compute, but they reduce the largest decode stall caused by a co-scheduled prefill. MLSys·im does not implement Sarathi-Serve’s full scheduler. Disaggregated systems such as Splitwise and DistServe mitigate prefill/decode interference by assigning the phases to separate resources, trading scheduler interference for KV-cache transfer and resource-partitioning costs [6], [7], [8].

4.5 Serving Capacity Planning

Implemented in mlsysim.solvers.ServingCapacityModel.

Capacity planning composes three first-order quantities:

base request latency from ServingModel;
per-replica token throughput from ContinuousBatchingModel;
queueing delay from TailLatencyModel.

For an average response length $G$ tokens and per-replica decode throughput $R_{\text{tok}}$, the first-order request capacity is:

\[ Q_{\text{replica}} = \frac{R_{\text{tok}}}{G} \]

The required replica count is the smallest $c$ such that utilization remains stable and:

\[ P99_{\text{request}} \approx T_{\text{base}} + P99_{\text{queue}}(Q, Q_{\text{replica}}, c) \le T_{\text{target}} \]

This is deliberately a capacity model. It does not simulate request arrivals, preemption, or scheduler internals.

4.6 Serving Cost Dominance

At production scale, serving costs dominate training:

\[ C_{\text{total}} = C_{\text{training}} + C_{\text{serving}} \times T_{\text{deploy}} \times Q_{\text{rate}} \]

A 70B model costing $2M to train, serving 1M daily users at 50 requests/day, costs ~$18M/year in inference — 9× the training cost in year one. This is why inference optimization (batching, quantization, speculative decoding) has outsized economic impact.

5. Datacenter Sustainability

Implemented in mlsysim.solvers.SustainabilityModel. Slide Deck: Sustainable AI (Vol II, Ch 15)

5.1 Total Energy

\[ E = \text{IT\_Power} \times \text{Hours} \times \text{PUE} \]

Power Usage Effectiveness (PUE) accounts for cooling and facility overhead. A PUE of 1.0 is theoretical perfect efficiency; hyperscale datacenters achieve 1.05–1.10 (liquid-cooled) to 1.4–1.6 (air-cooled). The industry average is 1.58 — meaning 37% of energy goes to cooling and power distribution, not computation.

5.2 Carbon Footprint

\[ C = E \times \text{Carbon\_Intensity} \]

Where $C$ is in kg CO₂e and Carbon_Intensity is in g CO₂e/kWh, sourced from regional grid data.

Worked Example: Training Carbon Footprint

64 GPUs × 400W × 336 hours = 8,602 kWh. With PUE 1.2 = 10,322 kWh.

Grid Region	Carbon Intensity	CO₂ Emissions
Quebec (hydro)	20 g/kWh	206 kg
US Average	429 g/kWh	4,428 kg
Poland (coal)	820 g/kWh	8,464 kg

Geographic choice alone creates a ~41× difference for identical workloads. Combined with hardware choice (5×) and quantization (4×), the total optimization lever is up to 800×.

Source: Sustainable AI slides (Vol II, Ch 15).

6. Total Cost of Ownership (TCO)

Implemented in mlsysim.solvers.EconomicsModel. Slide Deck: Compute Infrastructure (Vol II, Ch 2)

\[ \text{TCO} = \underbrace{\frac{\text{Hardware\_Cost}}{\text{Depreciation\_Years}}}_{\text{CapEx (amortized)}} + \underbrace{E \times \text{Electricity\_Rate}}_{\text{OpEx (power)}} + \text{OpEx}_{\text{maintenance}} \]

Worked Example: On-Prem vs. Cloud

Metric	On-Prem (H100 Node)	Cloud On-Demand
Unit cost	$350K (amortize over 3 yr)	—
Annual cost (80% util)	~$122K/node/yr	~$224K/node/yr
Effective GPU-hour rate	~$2.40	~$4.00

Break-even utilization: ~55%. Below 55% sustained utilization, cloud is cheaper. At 20% utilization, on-prem costs ~$6/GPU-hour — more expensive than cloud on-demand.

The right metric is cost per useful TFLOP-hour, not cost per GPU-hour: an H100 delivers 6.3× the TFLOPS of an A100, so a higher sticker price may yield a lower effective cost.

Source: Compute Infrastructure slides (Vol II, Ch 2).

7. Cluster Reliability (The Young-Daly Model)

Implemented in mlsysim.solvers.ReliabilityModel. Slide Deck: Fault Tolerance (Vol II, Ch 7)

Intuition: When training on thousands of GPUs for months, hardware failures are a statistical certainty. If a node fails, you lose all progress since the last checkpoint. Checkpointing too often wastes time writing; checkpointing too rarely wastes time recomputing. The Young-Daly model finds the optimal balance.

7.1 Cluster MTBF

Individual component reliability degrades linearly with scale:

\[ \text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{component}}}{N} \]

Cluster Size ($N$)	System MTBF (50,000 hr/GPU)
8 GPUs	~8 months
1,000 GPUs	50 hours
10,000 GPUs	5 hours
16,000 GPUs	~3 hours (Llama-3 scale: 419 failures in 54 days)

7.2 Optimal Checkpoint Interval

\[ \tau_{\text{opt}} = \sqrt{2 \times \delta \times M} \]

Symbol	Meaning
$\delta$	Time to write one checkpoint
$M$	Mean Time Between Failures (cluster MTBF)

7.3 Wasted Compute Fraction

The total overhead from checkpointing has two components:

\[ \text{Waste} = \underbrace{\frac{\delta}{2\tau}}_{\text{checkpoint writes}} + \underbrace{\frac{\tau}{2M}}_{\text{lost work on failure}} \]

At the optimal $\tau_{\text{opt}}$, both terms are equal, and waste is minimized.

Worked Example: Llama-3 Scale Checkpointing

16,384 GPUs, $\delta = 2$ min, cluster MTBF $M = 180$ min:

\[\tau_{\text{opt}} = \sqrt{2 \times 2 \times 180} = \sqrt{720} \approx 27\text{ min}\]

Checkpoint Interval	Overhead
10 min (too frequent)	17%
27 min (optimal)	7%
60 min (too rare)	15%

On a cluster costing $10M/month, 7% overhead = $700K/month. Async checkpointing (overlapping writes with computation) can reduce this to ~3%.

Source: Fault Tolerance slides (Vol II, Ch 7).

8. Failure Probability

Implemented in mlsysim.physics.calc_failure_probability.

The probability of at least one failure during a job of duration $T$ follows an exponential model:

\[ P(\geq 1\text{ failure}) = 1 - e^{-T / \text{MTBF}} \]

For a 30-day training run on a cluster with MTBF of 5 hours: $P = 1 - e^{-720/5} \approx 1.0$ — failure is virtually certain. This is why checkpointing (Section 7) is not optional at scale.

9. Effective FLOPS (Fleet Goodput)

Implemented in mlsysim.physics.calc_effective_flops. Slide Deck: Performance Engineering (Vol II, Ch 9)

The actual useful computation delivered by a fleet after all overheads:

\[ \text{Effective\_FLOP/s} = \text{Peak\_FLOP/s} \times \underbrace{\text{MFU}}_{\text{per-GPU}} \times \underbrace{\eta_{\text{scale}}}_{\text{comm overhead}} \times \underbrace{\frac{\text{Goodput}}{\text{Rawput}}}_{\text{failure waste}} \]

Each factor in the cascade represents a distinct loss mechanism. Typical values for large-scale training: MFU ≈ 0.3–0.5, $\eta_{\text{scale}}$ ≈ 0.4–0.85, Goodput/Rawput ≈ 0.85–0.97. The compound effect means a fleet may deliver only 10–35% of its theoretical peak.

Limitations of First-Order Models

These equations are first-order analytical models. They assume: (1) uniform memory access patterns, (2) no cache hierarchy effects, (3) no network contention under heavy load, and (4) linear scaling of throughput with batch size.

Real systems deviate from these assumptions. Trust MLSys·im for bottleneck classification and relative comparisons. Absolute latency is workload-dependent; well-calibrated cases are often within ±15–30%, while production serving can be 1.5–2× slower than idealized roofline bounds. See Accuracy & Validation for detailed comparisons against MLPerf benchmarks.

References

The equations above are grounded in the following peer-reviewed sources:

Williams, Waterman & Patterson (2009). “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM. → Section 1
Kaplan et al. (2020). “Scaling Laws for Neural Language Models.” arXiv:2001.08361. → Section 3
Narayanan et al. (2021). “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” SC ’21. → Section 2.4
Shoeybi et al. (2019). “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv:1909.08053. → Section 2
Patarasuk & Yuan (2009). “Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.” Journal of Parallel and Distributed Computing. → Section 2.3
Shazeer et al. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR. → Section 2.5
Rajbhandari et al. (2020). “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC. → Section 3.1
Korthikanti et al. (2023). “Reducing Activation Recomputation in Large Transformer Models.” MLSys. → Section 3.1
Fedus, Zoph & Shazeer (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR. → Section 2.5
Lepikhin et al. (2020). “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” arXiv:2006.16668. → Section 2.5
Young (1974). “A First-Order Approximation to the Optimum Checkpoint Interval.” Information Processing Letters. → Section 7
Daly (2006). “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps.” Future Generation Computer Systems. → Section 7
Patterson et al. (2021). “Carbon Emissions and Large Neural Network Training.” arXiv:2104.10350. → Section 5
Barroso, Clidaras & Hölzle (2018). “The Datacenter as a Computer.” Synthesis Lectures on Computer Architecture. → Section 6

References

[1]

N. Shazeer, A. Mirhoseini, K. Maziarz, et al., “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.

[2]

D. Lepikhin et al., “GShard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020, Available: https://arxiv.org/abs/2006.16668

[3]

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022, Available: https://www.jmlr.org/papers/v23/21-0998.html

[4]

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” in SC20: International conference for high performance computing, networking, storage and analysis, IEEE, 2020, pp. 1–16. doi: 10.1109/sc41405.2020.00024.

[5]

V. A. Korthikanti et al., “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023, Available: https://proceedings.mlsys.org/paper_files/paper/2023/hash/80083951326cf5b35e5100260d64ed81-Abstract-mlsys2023.html

[6]

A. Agrawal et al., “Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve,” in 18th USENIX symposium on operating systems design and implementation (OSDI 24), USENIX Association, Jul. 2024, pp. 117–134. Available: https://www.usenix.org/conference/osdi24/presentation/agrawal

[7]

P. Patel et al., “Splitwise: Efficient generative LLM inference using phase splitting,” in 2024 ACM/IEEE 51st annual international symposium on computer architecture (ISCA), IEEE, 2024, pp. 118–132. doi: 10.1109/isca59077.2024.00019.

[8]

Y. Zhong et al., “DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in 18th USENIX symposium on operating systems design and implementation (OSDI 24), USENIX Association, Jul. 2024, pp. 193–210. Available: https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin

Configuration	Bubble Fraction
\(P=4, M=16, V=1\)	\(3/19 = 15.8\%\)
\(P=4, M=16, V=2\)	\(3/35 = 8.6\%\)
\(P=8, M=32, V=1\)	\(7/39 = 17.9\%\)
\(P=16, M=16, V=1\)	\(15/31 = 48.4\%\) — unacceptable