Preface
Who This Book Is For
This book is for anyone who understands machine learning on a single machine and now faces the challenge of making it work at scale. The problems of scale are not the problems of a single node, only bigger. They are qualitatively different: networks partition, hardware fails as a statistical certainty, and the societal impact of every design decision is amplified by millions of users.
The scope ranges from designing the physical substrate of an AI data center, to scaling training beyond the limits of a single accelerator, to reasoning about what happens when a production fleet serves a global user base. Throughout, the focus is on the physics of distribution—the invariants that govern every fleet-scale ML system regardless of the framework, the model, or the era.
Why a Fleet-Level Textbook
In 2012, training AlexNet took five to six days on two GPUs (Krizhevsky et al. 2012). By 2023, public GPT-4-class estimates placed frontier training at roughly 25,000 GPUs running for about three months (SemiAnalysis 2023); OpenAI did not disclose the actual hardware configuration (OpenAI et al. 2023). This is not the same engineering problem at a larger scale. It is a categorically different engineering problem.
On a single machine, performance is governed by the memory wall—the gap between processor speed and memory bandwidth. In a distributed cluster, a new wall emerges: the bisection bandwidth wall, where the minimum network bandwidth that bisects the cluster caps total system throughput. Data no longer moves through local hierarchies alone; it traverses optical fabrics governed by the speed of light between racks and across continents. Hardware failures, rare enough on a single machine to be treated as exceptions, become routine statistical events at fleet scale. A training job that does not plan for failure will fail.
These are not operational annoyances. They are physical constraints—as fundamental and as permanent as the memory wall, Amdahl’s Law, or the energy cost of data movement. They require their own engineering discipline, with its own invariants:
- The Bisection Bandwidth Theorem: The performance of a distributed system is limited by the minimum bandwidth that bisects the network topology.
- The Fleet Law: Every training step pays a communication tax that grows with cluster size while per-node computation shrinks.
- The Young-Daly Checkpoint Law: The optimal checkpoint interval balances the cost of writing state against the expected cost of lost work due to failure.
- The Serving Cost Dominance Law: For high-traffic production models, cumulative inference operational expenditure can exceed the one-time training cost, making serving efficiency a lifecycle optimization target.
- The Fairness Impossibility Law: No system can simultaneously satisfy calibration, equalized odds, and demographic parity when base rates differ between groups.
If single-machine ML systems engineering asks “what can one machine build?”, then fleet-scale ML systems engineering asks “what can a thousand machines build together, and at what cost to reliability, efficiency, and society?”
That final clause is what sets fleet-scale engineering apart. A single machine affects its user; a fleet affects everyone its outputs touch, across millions of decisions per second. At that scale, reliability, efficiency, and fairness stop being features and become engineering constraints as binding as bandwidth or power. A system that scales but cannot be trusted has not been engineered. It has only been enlarged.
This book provides that foundation. It follows the Hennessy and Patterson pedagogical model, extending the quantitative, principles-first approach to the distributed scale (Hennessy and Patterson 2011; Hennessy and Patterson 2019; Patterson and Hennessy 2017). Just as their work taught a generation to reason about processor performance from first principles, this book teaches the reader to reason about fleet performance—replacing intuition about distributed systems with measurement, and ad hoc scaling with engineering.
Introduction to Machine Learning Systems taught the bricks: the tensors, kernels, memory hierarchies, and optimization loops that compose any single-node system. A fleet is not built from more bricks. It is an orchestra. A single musician can produce music of remarkable beauty, but an orchestra is not simply many musicians playing at once. It requires a conductor, a score, precise synchronization, and the ability to recover gracefully when a player drops out. The acoustics of the concert hall shape the sound in ways that no individual instrument can control. The principles that govern orchestral performance—coordination, communication latency, fault tolerance, and the physics of the performance space—are categorically different from those that govern a solo performer. ML fleets work the same way. A training cluster, a serving fleet, and an edge deployment are vastly different systems, but beneath them sit the same building blocks: parallelism strategies, collective communication primitives, checkpoint protocols, scheduling algorithms, and the trade-offs between throughput, latency, fault tolerance, and cost.
What This Book Covers
This book focuses on the Machine Learning Fleet: the warehouse-scale computer where the network is the bus, power density is the speed limit, and failure is not an exception but a statistical certainty. This is the unit of modern ML computation, where models too large for any single device are trained across thousands of accelerators, served to millions of users, and governed by constraints that span engineering, economics, and society.
The content is organized into four parts:
- Part I: The Fleet builds the physical substrate of the AI data center from the ground up: the landscape of distributed ML systems, the silicon and cooling of compute infrastructure, the network fabrics that connect thousands of accelerators, and the scalable data storage that feeds training pipelines.
- Part II: Distributed ML establishes the logic of distribution beyond a single machine: the parallelism strategies (data, tensor, and pipeline) for training models too large for single devices, the collective communication primitives that synchronize gradients, the fault tolerance mechanisms that ensure reliability when failure is routine, and the orchestration systems that manage the fleet.
- Part III: Deployment at Scale takes the trained model from the cluster to the world: performance engineering for efficiency, inference serving at massive scale, edge intelligence for resource-constrained devices, and the operational lifecycle required to manage production fleets.
- Part IV: The Responsible Fleet confronts the forces that determine whether the fleet serves its users or harms them: security and privacy, robust system design, environmental sustainability, and responsible engineering as constraints as fundamental as power density or bisection bandwidth.
These four parts trace a path through the Fleet Stack, a layered abstraction that extends the classical ML systems stack to the distributed scale. At the bottom, Facility & Accelerators and Fabric & Storage form the physical substrate. Above them, Distributed Execution and Reliability & Control manage parallel work, communication, scheduling, and recovery. Higher still, Runtime & Serving and Operations take models to users and manage the production lifecycle. At the top, Assurance and Governance & Sustainability turn trust, risk, policy, and environmental cost into engineering constraints. Each layer provides an abstraction to the layer above and consumes the layer below. Engineering decisions at the bottom constrain possibilities at the top.
Throughout the book, a margin figure highlights which layer each chapter addresses, providing a navigation cue at every chapter opening:
The full stack above shows each row with its scope, grouped into the four Parts of the book. The Roman numerals (I–IV) mark those Parts; the row intensity marks emphasis. The stack uses one Volume II accent color throughout so that shade, not hue, carries the chapter-focus signal.
In the margin figures throughout the book, the intensity of each layer reflects how strongly the chapter engages that tier—brighter (higher intensity) means the chapter’s primary focus; dimmer (lower intensity) means contextual relevance as constraints propagate upward. Consider three chapters that illustrate how the emphasis shifts:
In Compute Infrastructure, the bottom row dominates because the chapter is about silicon, power, cooling, and physical deployment. In Distributed Training, Distributed Execution is brightest while Reliability & Control remains visibly relevant because large training runs depend on scheduling and recovery. In Responsible AI, Governance & Sustainability is primary while Assurance stays strong because accountability depends on measurable trust and risk controls. The same stack, three different stories: decisions made at one layer ripple through the layers above and below it.
Small visual notes sit in the margins wherever a relationship is fastest to grasp by eye. They are reading aids, not figures to cite later. At fleet scale, this visual grammar makes bandwidth gaps, failure blast radius, checkpoint ratios, and edge-energy trade-offs visible without interrupting the argument. When a miniature uses a log scale, it says so inside the image.
Each part builds on the previous one, so sequential reading is recommended. The reading paths below offer alternatives for readers with specific goals.
Suggested Reading Paths
Readers with different backgrounds and goals can choose among several paths through the material:
The Complete Path involves reading all chapters in sequence, following the causal chain of constraints that governs the ML fleet. The physical silicon and cooling limits dictate the wiring of the data center, which determines how data is supplied. With the hardware built, the algorithms partition the math across it, requiring precise network routing mechanics and the ability to survive inevitable hardware loss. This surviving fleet must be scheduled, tuned, and scaled out to serve millions of users. Finally, deploying at this scale demands rigorous operational frameworks that are inextricably linked to defending the system and ensuring it serves society responsibly. This path develops every concept from first principles of distribution.
The Infrastructure Path focuses on the physical fleet and orchestration. Readers should focus on Part I (Compute Infrastructure, Network Fabrics, Scalable Data Storage) and Fleet Orchestration. This path builds a deep understanding of the physical and operational substrate.
The Scaling Path provides depth in distributed training. Readers should skim Part I for infrastructure vocabulary, then focus on Distributed Training, Collective Communication, Fault Tolerance, and Performance Engineering.
The Production Path covers deployment and operations. After the introduction, readers should move to Inference at Scale, Operations at Scale, Edge Intelligence, and the Responsible Fleet chapters. This path covers everything from serving architecture to governance.
Conventions
This book uses a small set of typographic conventions to mark what each piece of prose is doing.
Bold marks first definitions
A term in bold within body prose is its formal first introduction in the volume. Each term receives this treatment exactly once. Subsequent appearances are unbolded; by then the term is part of the working vocabulary the chapter carries forward.
Every machine learning system has a dual mandate: it must manage statistical uncertainty and physical constraints simultaneously.
The bold is a contract: register this term, it will recur. A matching index entry in the back of the book points to the defining occurrence, providing a way to return to it.
Bold also appears in a second, structural role: as a functional label at the start of an item inside a callout box or a summary bullet—for example, Recommendation:, Trade-off:, or Result:. In that context, the bold is scaffolding rather than a first definition.
Italics carry specific rhetorical jobs
Italics are not used for general emphasis. Every italic span in this book performs one of a small number of specific jobs:
- Direct opposition: training vs. inference, logical vs. physical.
- Impossibility, stated as physical, mathematical, or logical law: “sub-10 ms systems cannot use distant cloud infrastructure—physics forbids it.”
- Word used as word: “the term gradient refers to a vector of partial derivatives.”
- Thesis punchline: a single italicized sentence that captures the controlling insight of a section, for example Data is Source Code.
An italicized word is performing one of these jobs. A word that is not italicized carries no implicit emphasis.
Callouts have visual identities that match their job
Callout boxes are not decoration. Each type signals a specific kind of content, so the visual identity announces what the box contains before the reader engages with it.
| Callout | Contains |
|---|---|
| Definition | Formal definition of a key term, with its significance, distinction from the nearest neighbor concept, and a common pitfall |
| Example | Historical case or concrete scenario, structured as context, insight, and systems lesson |
| Perspective | Analytical observation or mathematical nuance that enriches the surrounding argument |
| Notebook | Worked calculation walked through step by step, with values computed inline from the book’s constants; intermediate tables and equations are the math itself |
| Checkpoint | Self-check questions to verify understanding of the preceding section |
| War Story | Real production failure, structured as context, failure, consequence, and lesson |
| Lighthouse | Recurring workloads characterized by a profile (parameters, FLOPs, dominant constraint), used as anchored case studies across multiple chapters |
| Principle | Canonical book principle or invariant referenced from later chapters |
| Theorem | Formal, equation-backed result or bound, stated with its conditions and applied in later quantitative analysis |
| Takeaways | Four to seven most important results from a chapter, in summary form |
| Chapter Connection | Bridge from the chapter just ended to the chapter beginning, naming the open question the next chapter addresses |
When a callout contains a table, figure, or equation, that artifact is part of the callout’s pedagogical unit, not decoration. A Lighthouse’s profile is its characterization. A Notebook’s intermediate tables are the worked math. Reading these as separated artifacts loses the teaching arc. Standalone reference tables, by contrast, sit between paragraphs as numbered artifacts the reader returns to via @tbl- references; the list of tables at the back of each volume collects them. Both forms are valid; the visual marker (callout vs. standalone) tells the reader which one they are looking at.
A Learning Platform
Whichever path readers choose, this book is one component of a broader learning ecosystem. The complete text, interactive Colab notebooks, lecture slides, exercises, hands-on labs, and supplementary materials for every chapter are freely available at mlsysbook.ai.
An Open Source Textbook
This book is built in the open. Its full text, figures, and build system live at mlsysbook.ai/git, and contributions are welcome from every reader.
This is a deliberate choice. If AI engineering is to become a shared discipline rather than a collection of isolated practices, its foundational texts must be accessible to everyone. ML systems is a field shaped by practitioners across industry, academia, and the open source community worldwide. A textbook covering this field should reflect that breadth and be available to all, regardless of geography or institutional affiliation.
Contributions from readers, whether fixing an error, suggesting a clearer explanation, adding a worked example, or proposing new content, have materially improved every chapter. Readers who find something that could be better are encouraged to open an issue or submit a pull request. This book improves because readers participate in building it.
Prerequisites
Readers should bring the following background:
- Single-machine ML systems: Understanding of ML workflows, neural network training, model optimization, and deployment on a single machine. Readers should be comfortable reasoning about memory hierarchies, computational graphs, and hardware-software interactions.
- Programming: Fluency in Python, including functions, classes, and data manipulation with NumPy. Familiarity with at least one ML framework (PyTorch, TensorFlow, or JAX).
- Mathematics: Comfort with linear algebra, basic calculus, and probability at the undergraduate level.
The following background is helpful but not required:
- Distributed systems: Familiarity with networking concepts (TCP/IP, bandwidth, latency), parallelism, and basic distributed computing deepens understanding of the fleet infrastructure and training chapters.
- Cloud infrastructure: Experience with container orchestration (Kubernetes, Docker) or cluster schedulers (Slurm) provides useful operational context.
- Production engineering: Understanding of monitoring, observability, and site reliability practices will enrich the deployment and operations chapters.
Beyond This Book
This book covers the Machine Learning Fleet as it exists today: warehouse-scale training clusters, global serving infrastructure, and the governance challenges of operating at scale. Topics at the research frontier—including fully autonomous fleet management, next-generation interconnects, neuromorphic computing at scale, and formal verification of fleet-level safety properties—extend beyond this scope but represent active areas of investigation.
Using This Book in a Course
This book grew out of CS249r at Harvard University and the TinyML edX professional certificate program. That teaching reinforced an insight that carries to fleet scale: the same physics governs every level of the hierarchy. Bandwidth, energy, and failure set the limits on a microcontroller and on a data center alike. What changes with scale is not the physics but which limit binds, and a new binding limit is a new engineering problem. On one machine, the memory wall dominates. Across a fleet, bisection bandwidth and routine hardware failure take over, and what was once a rare exception becomes the constraint the entire system is designed around. The physics is continuous; the engineering is not.
The book is designed to support a one-semester advanced course covering the full distributed ML systems stack. Instructors may also select individual parts for shorter modules:
- Parts I–II (The Fleet and Distributed ML) suit an advanced half-semester on distributed training infrastructure.
- Parts III–IV (Deployment at Scale and The Responsible Fleet) suit an applied half-semester on production ML at scale.
For survey courses or executive programs, the four Part introductions and their governing principles alone provide a condensed overview of the quantitative laws that govern fleet-scale ML systems.
Slides, assignments, and instructor materials for the advanced sequence are available at mlsysbook.ai.
Copyright and Licensing
© Vijay Janapa Reddi and MIT Press. This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). The source is available at mlsysbook.ai/git.