--- output-file: index.html sidebar: vol2-content format: html: title-block-style: default title: "Machine Learning Systems at Scale" date: today date-format: long doi: "v0.2.0" doi-title: "Version" author: name: Vijay Janapa Reddi email: vj@eecs.harvard.edu url: https://vijay.seas.harvard.edu affiliation: Harvard University --- ::: {.content-visible when-format="html:js"} # Welcome {.unnumbered} ```{=html}

Modern machine learning operates at scales that fundamentally change engineering requirements—models too large for single GPUs, services spanning continents, deployments carrying societal responsibilities. This book addresses AI engineering at scale. The treatment follows the lifecycle of a massive-scale system: defining the distributed architecture, building the physical infrastructure fleet, ensuring operational reliability, deploying to global users, and hardening the system for safety and responsibility.

Machine Learning Systems at Scale

Publisher: The MIT Press (2027)

📖 Click here to download PDF

``` ## What You Will Learn {.unnumbered} The four parts extend ML systems foundations into production-scale systems by following the **Fleet Stack** from bottom to top: - **Part I: The Fleet**—Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment. - **Part II: Distributed ML**—Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration. - **Part III: Deployment at Scale**—Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets. - **Part IV: The Responsible Fleet**—Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations. ## Prerequisites {.unnumbered} This book assumes: - **Foundational or equivalent** background in single-machine ML systems - **Programming proficiency** in Python with familiarity in NumPy - **Mathematics foundations** in linear algebra, calculus, and probability - Familiarity with distributed systems concepts (networking, parallelism) is helpful for advanced topics ## Learn by Doing {.unnumbered} Within the broader AI engineering curriculum, this volume is the scale and governance spine. Pair the chapters with [Co-Labs](https://mlsysbook.ai/labs/) for fleet-scale trade-off exercises, [MLSys·im](https://mlsysbook.ai/mlsysim/) for first-principles infrastructure modeling, and [StaffML](https://mlsysbook.ai/staffml/) for physics-grounded systems design practice. Instructors can adopt the full scale sequence through [The AI Engineering Blueprint](https://mlsysbook.ai/instructors/). ## Support Our Mission {.unnumbered} ```{=html}

2026 Goal: Help 100,000 students learn ML Systems. Sponsors like the EDGE AI Foundation match every star with funding that supports learning.

Loading... ⭐ Star on GitHub

Support us on Open Collective →

``` ```{=html} ``` ## Listen to the AI Podcast {.unnumbered} ```{=html}

This short podcast, created with Google's Notebook LM and inspired by insights from our IEEE education viewpoint paper, offers an accessible overview of the book's key ideas and themes.

``` ## Want to Help Out? {.unnumbered} This is a collaborative project, and your input matters. If you would like to contribute, check out our [contribution guidelines](https://github.com/harvard-edge/cs249r_book/blob/main/book/docs/CONTRIBUTING.md). Feedback, corrections, and new ideas are welcome. Simply file a GitHub [issue](https://github.com/harvard-edge/cs249r_book/issues). :::