AI Engineering at Scale

Semester 2: Distributed Systems & Fleets — Week-by-Week

Course Overview

Textbook	Volume II: Machine Learning Systems at Scale
Duration	16 weeks (32 lectures at 75 min each)
Prerequisites	Foundations semester (Volume I) or equivalent systems background
Scope	Multi-machine distributed systems: clusters to global fleets
Key Concepts	3D parallelism, collective communication, fault tolerance, fleet orchestration

Course Goal: Master the engineering of massive-scale distributed AI systems. Students will design infrastructure that coordinates thousands of accelerators, survives hardware failures, and serves billions of requests — while accounting for cost, energy, and societal impact.

Labs and TinyTorch in Semester 2

Labs trail readings by one week — students complete the lab that reinforces the previous week’s material. Volume II focuses on systems design rather than framework implementation, so there is no TinyTorch column. Modules 09–20 are available as an optional advanced track for students continuing from Semester 1.

Lecture Slides

Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.

Part I: The Fleet (Weeks 1–4)

Goal: Build the physical computer the size of a campus.

Week 1: The Scale Imperative

Component	Assignment
Read	Introduction to Scale
Lab	Lab 01: The Scale Wall
Due	Lab 01 Decision Log

Learning Objectives: Explain why single-machine optimization hits a ceiling. Quantify the compute requirements of frontier model training. Articulate the transition from “fast machine” to “coordinated fleet.”

Instructor Tip

Start with a provocation: “GPT-4 training used ~25,000 GPUs for ~90 days. What happens when one fails?” This frames the entire semester.

Week 2: Compute Infrastructure

Component	Assignment
Read	Compute Infrastructure
Lab	Lab 02: The Interconnect Wall
Due	Lab 02 Decision Log

Learning Objectives: Describe the hierarchy from chip → node → rack → pod → cluster. Calculate bisection bandwidth for a given network topology. Explain why NVLink, PCIe, and InfiniBand serve different roles.

Week 3: Network Fabrics

Component	Assignment
Read	Network Fabrics
Lab	Lab 03: Communication Topologies
Due	Lab 03 Decision Log

Learning Objectives: Compare fat-tree, torus, and dragonfly topologies. Calculate the communication overhead of a given all-reduce pattern. Explain how network fabric design constrains parallelism strategies.

Week 4: Distributed Storage

Component	Assignment
Read	Data Storage
Lab	Lab 04: The Storage Hierarchy
Due	Lab 04 Decision Log

Learning Objectives: Design a distributed storage system that feeds thousands of accelerators without starvation. Calculate throughput requirements for a given training workload. Explain the tradeoffs between local SSD, networked storage, and object stores.

Part II: Distributed Algorithms (Weeks 5–8)

Goal: Coordinate computation across thousands of nodes.

Week 5: Distributed Training

Component	Assignment
Read	Distributed Training
Lab	Lab 05: The Parallelism Puzzle (3D Parallelism)
Due	Lab 05 Decision Log

Learning Objectives: Implement data parallelism, tensor parallelism, and pipeline parallelism conceptually. Calculate the communication-to-computation ratio for each parallelism strategy. Choose the optimal parallelism strategy for a given model size and cluster configuration.

Instructor Tip

Lab 05 is the centerpiece of the semester. Students discover that no single parallelism strategy wins — the optimal choice depends on model size, cluster topology, and communication bandwidth. Have them argue for their strategy in a class debate.

Week 6: Collective Communication

Component	Assignment
Read	Collective Communication
Lab	Lab 06: AllReduce Physics
Due	Lab 06 Decision Log

Learning Objectives: Implement ring all-reduce and understand its bandwidth optimality. Calculate the time for an all-reduce operation given message size and network bandwidth. Explain how gradient compression reduces communication cost.

Week 7: Fault Tolerance

Component	Assignment
Read	Fault Tolerance
Lab	Lab 07: The Scheduling Trap
Due	Lab 07 Decision Log

Learning Objectives: Calculate the expected time-to-failure for a 10,000-GPU cluster. Design a checkpointing strategy that balances overhead and recovery time. Explain why fault tolerance is an engineering requirement, not an optional feature, at scale.

Key Insight

At 10,000 nodes with 0.1% daily failure rate, you lose ~10 nodes per day. This is not exceptional — it is the steady state. System design must assume failure, not prevent it.

Week 8: Fleet Orchestration

Component	Assignment
Read	Fleet Orchestration
Lab	Lab 08: The Inference Economy
Due	Lab 08 Decision Log

Learning Objectives: Explain how Slurm and Kubernetes manage heterogeneous cluster resources. Design a scheduling policy that maximizes cluster utilization while meeting SLA deadlines. Calculate the cost of fragmentation in a multi-tenant cluster.

Part III: Deployment at Scale (Weeks 9–12)

Goal: Serve the world reliably.

Week 9: Performance Engineering

Component	Assignment
Read	Performance Engineering
Lab	Lab 09: The Optimization Trap
Due	Lab 09 Decision Log

Learning Objectives: Profile a distributed training workload and identify the dominant bottleneck (compute, communication, or I/O). Apply the Roofline model to a multi-node system. Explain the difference between strong scaling and weak scaling.

Week 10: Inference at Scale

Component	Assignment
Read	Inference at Scale
Lab	Lab 10: The KV-Cache Memory Wall
Due	Lab 10 Decision Log

Learning Objectives: Explain how the KV-cache dominates memory during autoregressive generation. Calculate the memory required for a given model serving a given number of concurrent users. Describe how PagedAttention and continuous batching improve serving throughput.

Instructor Tip

Have students calculate: “How much GPU memory does serving GPT-3 (175B parameters) to 100 concurrent users at sequence length 2048 require?” The answer is sobering and motivates every optimization in this chapter.

Week 11: Edge Intelligence

Component	Assignment
Read	Edge Intelligence
Lab	Lab 11: Edge Thermodynamics
Due	Lab 11 Decision Log

Learning Objectives: Explain the latency, privacy, and energy advantages of edge deployment. Calculate the thermal envelope for a given edge device running inference. Design a split inference system that partitions computation between edge and cloud.

Week 12: Operations at Scale

Component	Assignment
Read	Ops at Scale
Lab	Lab 12: The Silent Fleet
Due	Lab 12 Decision Log

Learning Objectives: Design a monitoring and alerting system for a production ML fleet. Calculate the 3-year Total Cost of Ownership (TCO) for a cluster configuration. Explain blue-green deployments and canary rollouts for ML models.

Part IV: The Responsible Fleet (Weeks 13–16)

Goal: Manage the societal and environmental impact of scale.

Week 13: Security and Privacy

Component	Assignment
Read	Security & Privacy
Lab	Lab 13: The Price of Privacy
Due	Lab 13 Decision Log

Learning Objectives: Explain model inversion, membership inference, and adversarial attacks as systems problems. Calculate the throughput overhead of differential privacy. Design a secure serving pipeline with appropriate threat modeling.

Week 14: Robust AI

Component	Assignment
Read	Robust AI
Lab	Lab 14: The Robustness Budget
Due	Lab 14 Decision Log

Learning Objectives: Define robustness as a systems property, not just a model property. Quantify the cost of adversarial defenses in throughput and latency. Design monitoring systems that detect distribution shift before accuracy degrades.

Week 15: Sustainable and Responsible AI

Component	Assignment
Read	Sustainable AI and Responsible AI
Lab	Lab 15: The Fairness Budget
Due	Lab 15 Decision Log + Capstone draft

Learning Objectives: Calculate the carbon footprint of a training run given datacenter PUE and grid carbon intensity. Justify datacenter location based on carbon footprint and cost. Explain how infrastructure decisions (data pipeline, serving latency, geographic deployment) create or mitigate bias.

Week 16: Capstone — Fleet Synthesis

Component	Assignment
Read	Conclusion
Lab	Lab 16: The Fleet Synthesis (Capstone)
Capstone	Distributed System Design Competition
Due	Final submission + design report

Capstone Specification: Design a complete distributed training and serving infrastructure for a frontier model. Students must specify: cluster topology, parallelism strategy, fault tolerance mechanisms, serving architecture, carbon budget, and cost estimate. Final deliverable is a design document with quantitative justification for every architectural decision.

Assessment Summary

Component	Weight	Description
Lab Decision Logs (15 labs)	40%	Weekly written analysis using Iron Law and distributed systems terminology
Design Challenges (4 major)	30%	One per Part: cluster design, parallelism selection, serving architecture, fleet governance
Capstone	30%	Fleet Synthesis design document + presentation

See Assessment & Grading for detailed rubrics.

Suggested Case Studies

These industry papers pair well with specific weeks:

Week	Topic	Suggested Paper
1	Scale Imperative	Kaplan et al., “Scaling Laws for Neural Language Models” (2020)
5	Distributed Training	Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models” (2020)
6	Collective Comms	Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters” (2021)
7	Fault Tolerance	Jeon et al., “Analysis of Large-Scale Multi-Tenant GPU Clusters” (ATC 2019)
10	Inference at Scale	Kwon et al., “Efficient Memory Management for LLM Serving with PagedAttention” (SOSP 2023)
12	Ops at Scale	Zhao et al., “Characterizing GPU Cluster Failures at Scale” (ATC 2024)
14	Sustainability	Patterson et al., “Carbon Emissions and Large Neural Network Training” (2021)