Volume II: At Scale

Machine Learning Systems at Scale

18 decks covering distributed infrastructure, training, deployment, and governance across GPU fleets. 529 slides, 125 SVG figures, approximately 19 hours of teaching material.

Download All PDFs (ZIP) Download All PPTX (ZIP) View Source
Ch Title Slides SVGs ~Time Active Learning PDF PPTX Source
0 ML Systems at Scale 24 5 49 min 9 PDF PPTX Source
1 Introduction 34 10 76 min 9 PDF PPTX Source
2 Compute Infrastructure 33 7 72 min 10 PDF PPTX Source
3 Network Fabrics 32 9 68 min 8 PDF PPTX Source
4 Data Storage 33 7 69 min 9 PDF PPTX Source
5 Distributed Training Systems 32 7 70 min 9 PDF PPTX Source
6 Collective Communication 29 7 64 min 8 PDF PPTX Source
7 Fault Tolerance and Reliability 32 9 68 min 8 PDF PPTX Source
8 Fleet Orchestration 33 8 74 min 9 PDF PPTX Source
9 Inference at Scale 32 8 71 min 10 PDF PPTX Source
10 Performance Engineering 32 8 72 min 8 PDF PPTX Source
11 Edge Intelligence 31 6 66 min 10 PDF PPTX Source
12 ML Operations at Scale 31 8 66 min 9 PDF PPTX Source
13 Security and Privacy 33 10 71 min 10 PDF PPTX Source
14 Robust AI 33 8 71 min 8 PDF PPTX Source
15 Sustainable AI 31 7 65 min 9 PDF PPTX Source
16 Responsible AI 31 7 66 min 11 PDF PPTX Source
17 Conclusion 23 7 47 min 10 PDF PPTX Source
Total 529 125 ~19 hrs 163
Tip

PPTX files are image-based (300 DPI) — visually identical to the PDF. Use them for PowerPoint presenter mode and slide annotations. For editable slides, download the LaTeX source.

Back to top