Machine Learning Systems at Scale
Welcome
Modern machine learning operates at scales that fundamentally change engineering requirements—models too large for single GPUs, services spanning continents, deployments carrying societal responsibilities. This book addresses AI engineering at scale. The treatment follows the lifecycle of a massive-scale system: defining the distributed architecture, building the physical infrastructure fleet, ensuring operational reliability, deploying to global users, and hardening the system for safety and responsibility.
Machine Learning Systems at Scale
Publisher: The MIT Press (2026)
📖 Click here to download PDF
What You Will Learn
This book extends the foundations into production-scale systems through four parts that follow the Fleet Stack from bottom to top:
- Part I: The Fleet — Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment.
- Part II: Distributed ML — Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration.
- Part III: Deployment at Scale — Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets.
- Part IV: The Responsible Fleet — Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations.
Prerequisites
This book assumes:
- Foundational or equivalent background in single-machine ML systems
- Programming proficiency in Python with familiarity in NumPy
- Mathematics foundations in linear algebra, calculus, and probability
- Familiarity with distributed systems concepts (networking, parallelism) is helpful for advanced topics
Support Our Mission
2026 Goal: Help 100,000 students learn ML Systems. Sponsors like the EDGE AI Foundation match every star with funding that supports learning.
Listen to the AI Podcast
This short podcast, created with Google's Notebook LM and inspired by insights from our IEEE education viewpoint paper, offers an accessible overview of the book's key ideas and themes.
Want to Help Out?
This is a collaborative project, and your input matters. If you’d like to contribute, check out our contribution guidelines. Feedback, corrections, and new ideas are welcome. Simply file a GitHub issue.