AI Engineering at Scale

Semester 2: Distributed Systems & Fleets — Week-by-Week

Course Overview

Textbook Volume II: Machine Learning Systems at Scale
Duration 16 weeks (32 lectures at 75 min each)
Prerequisites Foundations semester (Volume I) or equivalent systems background
Scope Multi-machine distributed systems: clusters to global fleets
Key Concepts 3D parallelism, collective communication, fault tolerance, fleet orchestration

Course Goal: Master the engineering of massive-scale distributed AI systems. Students will design infrastructure that coordinates thousands of accelerators, survives hardware failures, and serves billions of requests — while accounting for cost, energy, and societal impact.

NoteLabs and TinyTorch in Semester 2

Labs trail readings by one week — students complete the lab that reinforces the previous week’s material. Volume II focuses on systems design rather than framework implementation, so there is no TinyTorch column. Modules 09–20 are available as an optional advanced track for students continuing from Semester 1.

TipLecture Slides

Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.


Part I: The Fleet (Weeks 1–4)

Goal: Build the physical computer the size of a campus.

Week 1: The Scale Imperative

Component Assignment
Read Introduction to Scale
Lab Lab 01: The Scale Wall
Due Lab 01 Decision Log

Learning Objectives: Explain why single-machine optimization hits a ceiling. Quantify the compute requirements of frontier model training. Articulate the transition from “fast machine” to “coordinated fleet.”

TipInstructor Tip

Start with a provocation: “GPT-4 training used ~25,000 GPUs for ~90 days. What happens when one fails?” This frames the entire semester.

Week 2: Compute Infrastructure

Component Assignment
Read Compute Infrastructure
Lab Lab 02: The Interconnect Wall
Due Lab 02 Decision Log

Learning Objectives: Describe the hierarchy from chip → node → rack → pod → cluster. Calculate bisection bandwidth for a given network topology. Explain why NVLink, PCIe, and InfiniBand serve different roles.

Week 3: Network Fabrics

Component Assignment
Read Network Fabrics
Lab Lab 03: Communication Topologies
Due Lab 03 Decision Log

Learning Objectives: Compare fat-tree, torus, and dragonfly topologies. Calculate the communication overhead of a given all-reduce pattern. Explain how network fabric design constrains parallelism strategies.

Week 4: Distributed Storage

Component Assignment
Read Data Storage
Lab Lab 04: The Storage Hierarchy
Due Lab 04 Decision Log

Learning Objectives: Design a distributed storage system that feeds thousands of accelerators without starvation. Calculate throughput requirements for a given training workload. Explain the tradeoffs between local SSD, networked storage, and object stores.


Part II: Distributed Algorithms (Weeks 5–8)

Goal: Coordinate computation across thousands of nodes.

Week 5: Distributed Training

Component Assignment
Read Distributed Training
Lab Lab 05: The Parallelism Puzzle (3D Parallelism)
Due Lab 05 Decision Log

Learning Objectives: Implement data parallelism, tensor parallelism, and pipeline parallelism conceptually. Calculate the communication-to-computation ratio for each parallelism strategy. Choose the optimal parallelism strategy for a given model size and cluster configuration.

TipInstructor Tip

Lab 05 is the centerpiece of the semester. Students discover that no single parallelism strategy wins — the optimal choice depends on model size, cluster topology, and communication bandwidth. Have them argue for their strategy in a class debate.

Week 6: Collective Communication

Component Assignment
Read Collective Communication
Lab Lab 06: AllReduce Physics
Due Lab 06 Decision Log

Learning Objectives: Implement ring all-reduce and understand its bandwidth optimality. Calculate the time for an all-reduce operation given message size and network bandwidth. Explain how gradient compression reduces communication cost.

Week 7: Fault Tolerance

Component Assignment
Read Fault Tolerance
Lab Lab 07: The Scheduling Trap
Due Lab 07 Decision Log

Learning Objectives: Calculate the expected time-to-failure for a 10,000-GPU cluster. Design a checkpointing strategy that balances overhead and recovery time. Explain why fault tolerance is an engineering requirement, not an optional feature, at scale.

NoteKey Insight

At 10,000 nodes with 0.1% daily failure rate, you lose ~10 nodes per day. This is not exceptional — it is the steady state. System design must assume failure, not prevent it.

Week 8: Fleet Orchestration

Component Assignment
Read Fleet Orchestration
Lab Lab 08: The Inference Economy
Due Lab 08 Decision Log

Learning Objectives: Explain how Slurm and Kubernetes manage heterogeneous cluster resources. Design a scheduling policy that maximizes cluster utilization while meeting SLA deadlines. Calculate the cost of fragmentation in a multi-tenant cluster.


Part III: Deployment at Scale (Weeks 9–12)

Goal: Serve the world reliably.

Week 9: Performance Engineering

Component Assignment
Read Performance Engineering
Lab Lab 09: The Optimization Trap
Due Lab 09 Decision Log

Learning Objectives: Profile a distributed training workload and identify the dominant bottleneck (compute, communication, or I/O). Apply the Roofline model to a multi-node system. Explain the difference between strong scaling and weak scaling.

Week 10: Inference at Scale

Component Assignment
Read Inference at Scale
Lab Lab 10: The KV-Cache Memory Wall
Due Lab 10 Decision Log

Learning Objectives: Explain how the KV-cache dominates memory during autoregressive generation. Calculate the memory required for a given model serving a given number of concurrent users. Describe how PagedAttention and continuous batching improve serving throughput.

TipInstructor Tip

Have students calculate: “How much GPU memory does serving GPT-3 (175B parameters) to 100 concurrent users at sequence length 2048 require?” The answer is sobering and motivates every optimization in this chapter.

Week 11: Edge Intelligence

Component Assignment
Read Edge Intelligence
Lab Lab 11: Edge Thermodynamics
Due Lab 11 Decision Log

Learning Objectives: Explain the latency, privacy, and energy advantages of edge deployment. Calculate the thermal envelope for a given edge device running inference. Design a split inference system that partitions computation between edge and cloud.

Week 12: Operations at Scale

Component Assignment
Read Ops at Scale
Lab Lab 12: The Silent Fleet
Due Lab 12 Decision Log

Learning Objectives: Design a monitoring and alerting system for a production ML fleet. Calculate the 3-year Total Cost of Ownership (TCO) for a cluster configuration. Explain blue-green deployments and canary rollouts for ML models.


Part IV: The Responsible Fleet (Weeks 13–16)

Goal: Manage the societal and environmental impact of scale.

Week 13: Security and Privacy

Component Assignment
Read Security & Privacy
Lab Lab 13: The Price of Privacy
Due Lab 13 Decision Log

Learning Objectives: Explain model inversion, membership inference, and adversarial attacks as systems problems. Calculate the throughput overhead of differential privacy. Design a secure serving pipeline with appropriate threat modeling.

Week 14: Robust AI

Component Assignment
Read Robust AI
Lab Lab 14: The Robustness Budget
Due Lab 14 Decision Log

Learning Objectives: Define robustness as a systems property, not just a model property. Quantify the cost of adversarial defenses in throughput and latency. Design monitoring systems that detect distribution shift before accuracy degrades.

Week 15: Sustainable and Responsible AI

Component Assignment
Read Sustainable AI and Responsible AI
Lab Lab 15: The Fairness Budget
Due Lab 15 Decision Log + Capstone draft

Learning Objectives: Calculate the carbon footprint of a training run given datacenter PUE and grid carbon intensity. Justify datacenter location based on carbon footprint and cost. Explain how infrastructure decisions (data pipeline, serving latency, geographic deployment) create or mitigate bias.

Week 16: Capstone — Fleet Synthesis

Component Assignment
Read Conclusion
Lab Lab 16: The Fleet Synthesis (Capstone)
Capstone Distributed System Design Competition
Due Final submission + design report

Capstone Specification: Design a complete distributed training and serving infrastructure for a frontier model. Students must specify: cluster topology, parallelism strategy, fault tolerance mechanisms, serving architecture, carbon budget, and cost estimate. Final deliverable is a design document with quantitative justification for every architectural decision.


Assessment Summary

Component Weight Description
Lab Decision Logs (15 labs) 40% Weekly written analysis using Iron Law and distributed systems terminology
Design Challenges (4 major) 30% One per Part: cluster design, parallelism selection, serving architecture, fleet governance
Capstone 30% Fleet Synthesis design document + presentation

See Assessment & Grading for detailed rubrics.


Suggested Case Studies

These industry papers pair well with specific weeks:

Week Topic Suggested Paper
1 Scale Imperative Kaplan et al., “Scaling Laws for Neural Language Models” (2020)
5 Distributed Training Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models” (2020)
6 Collective Comms Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters” (2021)
7 Fault Tolerance Jeon et al., “Analysis of Large-Scale Multi-Tenant GPU Clusters” (ATC 2019)
10 Inference at Scale Kwon et al., “Efficient Memory Management for LLM Serving with PagedAttention” (SOSP 2023)
12 Ops at Scale Zhao et al., “Characterizing GPU Cluster Failures at Scale” (ATC 2024)
14 Sustainability Patterson et al., “Carbon Emissions and Large Neural Network Training” (2021)
Back to top