AI Engineering at Scale
Semester 2: Distributed Systems & Fleets — Week-by-Week
Course Overview
| Textbook | Volume II: Machine Learning Systems at Scale |
| Duration | 16 weeks (32 lectures at 75 min each) |
| Prerequisites | Foundations semester (Volume I) or equivalent systems background |
| Scope | Multi-machine distributed systems: clusters to global fleets |
| Key Concepts | 3D parallelism, collective communication, fault tolerance, fleet orchestration |
Course Goal: Master the engineering of massive-scale distributed AI systems. Students will design infrastructure that coordinates thousands of accelerators, survives hardware failures, and serves billions of requests — while accounting for cost, energy, and societal impact.
Labs trail readings by one week — students complete the lab that reinforces the previous week’s material. Volume II focuses on systems design rather than framework implementation, so there is no TinyTorch column. Modules 09–20 are available as an optional advanced track for students continuing from Semester 1.
Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.
Part I: The Fleet (Weeks 1–4)
Goal: Build the physical computer the size of a campus.
Week 1: The Scale Imperative
| Component | Assignment |
|---|---|
| Read | Introduction to Scale |
| Lab | Lab 01: The Scale Wall |
| Due | Lab 01 Decision Log |
Learning Objectives: Explain why single-machine optimization hits a ceiling. Quantify the compute requirements of frontier model training. Articulate the transition from “fast machine” to “coordinated fleet.”
Start with a provocation: “GPT-4 training used ~25,000 GPUs for ~90 days. What happens when one fails?” This frames the entire semester.
Week 2: Compute Infrastructure
| Component | Assignment |
|---|---|
| Read | Compute Infrastructure |
| Lab | Lab 02: The Interconnect Wall |
| Due | Lab 02 Decision Log |
Learning Objectives: Describe the hierarchy from chip → node → rack → pod → cluster. Calculate bisection bandwidth for a given network topology. Explain why NVLink, PCIe, and InfiniBand serve different roles.
Week 3: Network Fabrics
| Component | Assignment |
|---|---|
| Read | Network Fabrics |
| Lab | Lab 03: Communication Topologies |
| Due | Lab 03 Decision Log |
Learning Objectives: Compare fat-tree, torus, and dragonfly topologies. Calculate the communication overhead of a given all-reduce pattern. Explain how network fabric design constrains parallelism strategies.
Week 4: Distributed Storage
| Component | Assignment |
|---|---|
| Read | Data Storage |
| Lab | Lab 04: The Storage Hierarchy |
| Due | Lab 04 Decision Log |
Learning Objectives: Design a distributed storage system that feeds thousands of accelerators without starvation. Calculate throughput requirements for a given training workload. Explain the tradeoffs between local SSD, networked storage, and object stores.
Part II: Distributed Algorithms (Weeks 5–8)
Goal: Coordinate computation across thousands of nodes.
Week 5: Distributed Training
| Component | Assignment |
|---|---|
| Read | Distributed Training |
| Lab | Lab 05: The Parallelism Puzzle (3D Parallelism) |
| Due | Lab 05 Decision Log |
Learning Objectives: Implement data parallelism, tensor parallelism, and pipeline parallelism conceptually. Calculate the communication-to-computation ratio for each parallelism strategy. Choose the optimal parallelism strategy for a given model size and cluster configuration.
Lab 05 is the centerpiece of the semester. Students discover that no single parallelism strategy wins — the optimal choice depends on model size, cluster topology, and communication bandwidth. Have them argue for their strategy in a class debate.
Week 6: Collective Communication
| Component | Assignment |
|---|---|
| Read | Collective Communication |
| Lab | Lab 06: AllReduce Physics |
| Due | Lab 06 Decision Log |
Learning Objectives: Implement ring all-reduce and understand its bandwidth optimality. Calculate the time for an all-reduce operation given message size and network bandwidth. Explain how gradient compression reduces communication cost.
Week 7: Fault Tolerance
| Component | Assignment |
|---|---|
| Read | Fault Tolerance |
| Lab | Lab 07: The Scheduling Trap |
| Due | Lab 07 Decision Log |
Learning Objectives: Calculate the expected time-to-failure for a 10,000-GPU cluster. Design a checkpointing strategy that balances overhead and recovery time. Explain why fault tolerance is an engineering requirement, not an optional feature, at scale.
At 10,000 nodes with 0.1% daily failure rate, you lose ~10 nodes per day. This is not exceptional — it is the steady state. System design must assume failure, not prevent it.
Week 8: Fleet Orchestration
| Component | Assignment |
|---|---|
| Read | Fleet Orchestration |
| Lab | Lab 08: The Inference Economy |
| Due | Lab 08 Decision Log |
Learning Objectives: Explain how Slurm and Kubernetes manage heterogeneous cluster resources. Design a scheduling policy that maximizes cluster utilization while meeting SLA deadlines. Calculate the cost of fragmentation in a multi-tenant cluster.
Part III: Deployment at Scale (Weeks 9–12)
Goal: Serve the world reliably.
Week 9: Performance Engineering
| Component | Assignment |
|---|---|
| Read | Performance Engineering |
| Lab | Lab 09: The Optimization Trap |
| Due | Lab 09 Decision Log |
Learning Objectives: Profile a distributed training workload and identify the dominant bottleneck (compute, communication, or I/O). Apply the Roofline model to a multi-node system. Explain the difference between strong scaling and weak scaling.
Week 10: Inference at Scale
| Component | Assignment |
|---|---|
| Read | Inference at Scale |
| Lab | Lab 10: The KV-Cache Memory Wall |
| Due | Lab 10 Decision Log |
Learning Objectives: Explain how the KV-cache dominates memory during autoregressive generation. Calculate the memory required for a given model serving a given number of concurrent users. Describe how PagedAttention and continuous batching improve serving throughput.
Have students calculate: “How much GPU memory does serving GPT-3 (175B parameters) to 100 concurrent users at sequence length 2048 require?” The answer is sobering and motivates every optimization in this chapter.
Week 11: Edge Intelligence
| Component | Assignment |
|---|---|
| Read | Edge Intelligence |
| Lab | Lab 11: Edge Thermodynamics |
| Due | Lab 11 Decision Log |
Learning Objectives: Explain the latency, privacy, and energy advantages of edge deployment. Calculate the thermal envelope for a given edge device running inference. Design a split inference system that partitions computation between edge and cloud.
Week 12: Operations at Scale
| Component | Assignment |
|---|---|
| Read | Ops at Scale |
| Lab | Lab 12: The Silent Fleet |
| Due | Lab 12 Decision Log |
Learning Objectives: Design a monitoring and alerting system for a production ML fleet. Calculate the 3-year Total Cost of Ownership (TCO) for a cluster configuration. Explain blue-green deployments and canary rollouts for ML models.
Part IV: The Responsible Fleet (Weeks 13–16)
Goal: Manage the societal and environmental impact of scale.
Week 13: Security and Privacy
| Component | Assignment |
|---|---|
| Read | Security & Privacy |
| Lab | Lab 13: The Price of Privacy |
| Due | Lab 13 Decision Log |
Learning Objectives: Explain model inversion, membership inference, and adversarial attacks as systems problems. Calculate the throughput overhead of differential privacy. Design a secure serving pipeline with appropriate threat modeling.
Week 14: Robust AI
| Component | Assignment |
|---|---|
| Read | Robust AI |
| Lab | Lab 14: The Robustness Budget |
| Due | Lab 14 Decision Log |
Learning Objectives: Define robustness as a systems property, not just a model property. Quantify the cost of adversarial defenses in throughput and latency. Design monitoring systems that detect distribution shift before accuracy degrades.
Week 15: Sustainable and Responsible AI
| Component | Assignment |
|---|---|
| Read | Sustainable AI and Responsible AI |
| Lab | Lab 15: The Fairness Budget |
| Due | Lab 15 Decision Log + Capstone draft |
Learning Objectives: Calculate the carbon footprint of a training run given datacenter PUE and grid carbon intensity. Justify datacenter location based on carbon footprint and cost. Explain how infrastructure decisions (data pipeline, serving latency, geographic deployment) create or mitigate bias.
Week 16: Capstone — Fleet Synthesis
| Component | Assignment |
|---|---|
| Read | Conclusion |
| Lab | Lab 16: The Fleet Synthesis (Capstone) |
| Capstone | Distributed System Design Competition |
| Due | Final submission + design report |
Capstone Specification: Design a complete distributed training and serving infrastructure for a frontier model. Students must specify: cluster topology, parallelism strategy, fault tolerance mechanisms, serving architecture, carbon budget, and cost estimate. Final deliverable is a design document with quantitative justification for every architectural decision.
Assessment Summary
| Component | Weight | Description |
|---|---|---|
| Lab Decision Logs (15 labs) | 40% | Weekly written analysis using Iron Law and distributed systems terminology |
| Design Challenges (4 major) | 30% | One per Part: cluster design, parallelism selection, serving architecture, fleet governance |
| Capstone | 30% | Fleet Synthesis design document + presentation |
See Assessment & Grading for detailed rubrics.
Suggested Case Studies
These industry papers pair well with specific weeks:
| Week | Topic | Suggested Paper |
|---|---|---|
| 1 | Scale Imperative | Kaplan et al., “Scaling Laws for Neural Language Models” (2020) |
| 5 | Distributed Training | Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models” (2020) |
| 6 | Collective Comms | Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters” (2021) |
| 7 | Fault Tolerance | Jeon et al., “Analysis of Large-Scale Multi-Tenant GPU Clusters” (ATC 2019) |
| 10 | Inference at Scale | Kwon et al., “Efficient Memory Management for LLM Serving with PagedAttention” (SOSP 2023) |
| 12 | Ops at Scale | Zhao et al., “Characterizing GPU Cluster Failures at Scale” (ATC 2024) |
| 14 | Sustainability | Patterson et al., “Carbon Emissions and Large Neural Network Training” (2021) |