Preface

Who This Book Is For

This book is for anyone who wants to understand how machine learning systems actually work, from the mathematics that define a neural network to the hardware that executes it and the engineering decisions that make it run efficiently.

Most readers will not become ML hardware architects or compiler engineers. Yet every practitioner who builds, trains, optimizes, or deploys a model makes decisions that depend on understanding the system beneath the code. This book provides that systems foundation for undergraduates encountering ML systems for the first time, practicing engineers strengthening their foundations, and researchers who need to reason about performance and deployment.

Why a Systems Textbook

In 1968, a NATO conference in Garmisch, Germany, coined the term software engineering to address a crisis: software systems had grown too complex for ad hoc programming to manage (Naur and Randell 1969). Building reliable software required its own discipline, with its own principles, methodologies, and rigor. Decades later, Hennessy and Patterson transformed computer architecture from an art into a quantitative science, giving engineers the tools to measure, predict, and optimize processor performance from first principles. In both cases, practitioners and a body of knowledge already existed. What was missing was the recognition that these activities constituted a discipline.

Naur, Peter, and Brian Randell, eds. 1969. Software Engineering: Report of a Conference Sponsored by the NATO Science Committee, Garmisch, Germany, 7–11 October 1968. Scientific Affairs Division, NATO.

Machine learning faces its own Garmisch moment.

For the past decade, the field has been dominated by model-centric thinking: a focus on discovering new algorithms and architectures. That focus matters, but a model is not a system. A neural network that performs well in a research notebook is useless if it cannot be served within latency constraints, if its data pipeline cannot scale to petabytes, or if its energy cost exceeds its economic value. The gap between a working model and a production system is not merely an implementation gap; it is a gap in fundamental engineering principles.

This book treats ML systems engineering not as a collection of tools, such as Kubernetes, PyTorch, and CUDA, but as a discipline governed by physical invariants. Just as civil engineers cannot ignore gravity, AI engineers cannot ignore the constraints that govern every system they build:

The Iron Law of ML Systems: The immutable relationship between data movement, arithmetic intensity, and system performance.
Data Gravity: The physical cost of moving information at scale.
The Energy-Movement Invariant: Moving data costs orders of magnitude more energy than computing on it.
Amdahl’s Law: The serial fraction of a pipeline caps total system speedup.

Computer science studies what is computable. ML research studies what is learnable. AI engineering studies what is buildable, given the physics of real hardware and real data.

This book is the foundation for that discipline. We aim to do for AI systems what Patterson and Hennessy (2017) did for computer architecture: replace intuition with measurement, and art with engineering. The goal is to teach system-level reasoning before implementation, treating data, algorithms, and hardware not as separate silos but as a single, coupled optimization problem.

Patterson, David A., and John L. Hennessy. 2017. Computer Architecture: A Quantitative Approach. 6th ed. Morgan Kaufmann.

Consider a box of LEGO bricks. The same interlocking pieces build a spaceship, a castle, or an entire city. The sets change; the bricks do not. ML systems work the same way. A convolutional network for image classification, a transformer for language generation, and a reinforcement learning agent for robotics are vastly different applications, but beneath them sit the same building blocks: computational graphs, memory hierarchies, data pipelines, optimization loops, and the trade-offs between latency, throughput, and energy. Mastery of those building blocks makes unfamiliar systems legible.

That is why this book teaches enduring principles rather than current tools. Dominant frameworks have shifted from Theano to TensorFlow to PyTorch in under a decade. Hardware has evolved from repurposed graphics processors to purpose-built tensor accelerators, with new architectures emerging every year. Any textbook that teaches today’s tools will be obsolete before its next edition. The building blocks endure.

What This Book Covers

This book focuses on the single compute node: one machine, one to eight accelerators, a shared memory space. This is the fundamental unit of ML computation, where mathematical models meet physical hardware. Mastering it is the prerequisite for everything else.

The content is organized into four parts, each with a distinct role in the systems argument:

Part I: Foundations develops the vocabulary, mental models, and end-to-end workflow that guide every ML project, from the characteristics that distinguish ML systems from traditional software through the data engineering pipelines that feed them.
Part II: Build connects mathematical foundations to working systems through deep learning computation, neural architectures, frameworks, and training procedures.
Part III: Optimize addresses data efficiency, model compression, hardware acceleration, and the benchmarking methodology needed to measure progress rigorously.
Part IV: Deploy covers serving infrastructure, operational practices, and responsible engineering for systems that must be fair, transparent, and trustworthy in production.

These four parts trace a path through the ML Systems Stack, a layered abstraction modeled on the classical computer systems stack. Each layer provides an abstraction to the layer above and consumes the layer below: from hardware at the bottom through frameworks, models, training, and serving, up to operations and applications at the top. Data flows through all layers as a cross-cutting interconnect, much like a data bus in computer architecture.

Throughout the book, a margin figure highlights which layers each chapter addresses, providing a recurring map of the stack:

The data bar alongside the stack is not merely decorative. Its uniform shading reflects how central data is to each chapter’s concerns, a separate dimension from the layer intensities. The connecting wires between each layer and the data bar show which layers interact with data in that chapter’s context. Three chapters illustrate how the lens shifts:

In Hardware Acceleration, the layer intensity concentrates at the bottom of the stack while the data bar is faint: the chapter is about silicon, not data. In Model Training, the middle layers glow and the data bar is moderately lit, reflecting that training consumes data but is fundamentally about optimization. In Responsible Engineering, the top layers dominate and the data bar carries a moderate tint, because biased training data surfaces as biased applications even though several layers sit between them. The same stack tells three different stories, with data as an independent dimension.

The margins also carry small visual notes at the point where a relationship is easiest to understand by sight. These are not numbered figures to cite later; they are reading aids. A ladder makes a scale gap visible, a trend sketch shows direction, a roofline thumbnail locates a bottleneck regime, and a dominance bar shows which term controls a total. When a miniature uses a log scale, it says so inside the image.

Each part builds on the previous one. We recommend reading sequentially, though the following reading paths offer alternatives for readers with specific goals.

Conventions

This book uses a small set of typographic conventions to mark what each piece of prose is doing.

Bold marks first definitions

A term in bold within body prose is its formal first introduction in the volume. Each term receives this treatment exactly once. Subsequent appearances are unbolded; by then the term is part of the working vocabulary the chapter carries forward.

Every machine learning system has a dual mandate: it must manage statistical uncertainty and physical constraints simultaneously.

The bold is a contract: register this term, it will recur. A matching index entry in the back of the book points to the defining occurrence, providing a way to return to it.

Bold also appears in a second, structural role: as a functional label at the start of an item inside a callout box or a summary bullet—for example, Recommendation:, Trade-off:, or Result:. In that context, the bold is scaffolding rather than a first definition.

Italics carry specific rhetorical jobs

Italics are not used for general emphasis. Every italic span in this book performs one of a small number of specific jobs:

Direct opposition: training vs. inference, logical vs. physical.
Impossibility, stated as physical, mathematical, or logical law: “sub-10 ms systems cannot use distant cloud infrastructure—physics forbids it.”
Word used as word: “the term gradient refers to a vector of partial derivatives.”
Thesis punchline: a single italicized sentence that captures the controlling insight of a section, for example Data is Source Code.

An italicized word is performing one of these jobs. A word that is not italicized carries no implicit emphasis.

Callouts have visual identities that match their job

Callout boxes are not decoration. Each type signals a specific kind of content, so the visual identity announces what the box contains before the reader engages with it.

Callout	Contains
Definition	Formal definition of a key term, with its significance, distinction from the nearest neighbor concept, and a common pitfall
Example	Historical case or concrete scenario, structured as context, insight, and systems lesson
Perspective	Analytical observation or mathematical nuance that enriches the surrounding argument
Notebook	Worked calculation walked through step by step, with values computed inline from the book’s constants; intermediate tables and equations are the math itself
Checkpoint	Self-check questions to verify understanding of the preceding section
War Story	Real production failure, structured as context, failure, consequence, and lesson
Lighthouse	Recurring workloads characterized by a profile (parameters, FLOPs, dominant constraint), used as anchored case studies across multiple chapters
Principle	Canonical book principle or invariant referenced from later chapters
Theorem	Formal, equation-backed result or bound, stated with its conditions and applied in later quantitative analysis
Takeaways	Four to seven most important results from a chapter, in summary form
Chapter Connection	Bridge from the chapter just ended to the chapter beginning, naming the open question the next chapter addresses

When a callout contains a table, figure, or equation, that artifact is part of the callout’s pedagogical unit, not decoration. A Lighthouse’s profile is its characterization. A Notebook’s intermediate tables are the worked math. Reading these as separated artifacts loses the teaching arc. Standalone reference tables, by contrast, sit between paragraphs as numbered artifacts the reader returns to via @tbl- references; the list of tables at the back of each volume collects them. Both forms are valid; the visual marker (callout vs. standalone) tells the reader which one they are looking at.

Whichever path readers follow, the text is meant to connect with supporting resources beyond the chapters themselves.

A Learning Platform

This book is one component of a broader learning ecosystem. The complete text, interactive Colab notebooks, lecture slides, exercises, hands-on labs, and supplementary materials for every chapter are freely available at mlsysbook.ai.

An Open-Source Textbook

This book is open source. The full text, figures, and build system are available at mlsysbook.ai/git, and every reader is invited to contribute.

This is a deliberate choice. If AI engineering is to become a shared discipline rather than a collection of isolated practices, its foundational texts must be accessible to everyone. ML systems is a field shaped by practitioners across industry, academia, and the open-source community worldwide. A textbook covering this field should reflect that breadth and be available to all, regardless of geography or institutional affiliation.

Contributions from readers, whether fixing an error, suggesting a clearer explanation, adding a worked example, or proposing new content, have materially improved every chapter. Issues and pull requests provide the mechanism for that collaboration. This book improves because its readers participate in building it.

Prerequisites

Required:

Programming: Fluency in Python, including functions, classes, and basic data manipulation with NumPy.
Mathematics: Comfort with linear algebra, basic calculus, and probability at the undergraduate level.

Helpful but not required:

Computer systems: Familiarity with memory hierarchies and basic computer architecture deepens the optimization and hardware chapters.
Machine learning basics: Prior exposure to supervised learning concepts provides useful context, but Part II develops the necessary foundations from first principles.

Beyond This Book

This book covers the single compute node described earlier. Topics such as distributed training strategies, fleet-scale infrastructure, production deployment at global scale, and governance challenges that arise when ML systems operate in the real world extend beyond the scope of this book.

Completing this book therefore gives readers a foundation for the broader practice of ML systems engineering at scale.

Using This Book in a Course

This book grew out of CS249r at Harvard University and the TinyML edX professional certificate program. The TinyML course taught ML systems at the most constrained end of the spectrum: microcontrollers running on milliwatts with kilobytes of memory. It revealed a surprising lesson: the fundamental constraints that govern tiny devices (memory bandwidth, compute density, energy per operation) are the same constraints that govern data center GPUs. The physics scales; only the numbers change. That insight shaped this book. What began as a course on resource-constrained ML matured into a comprehensive treatment of ML systems engineering, grounded in the principle that mastery of the fundamentals at any scale prepares engineers for every scale.

In course settings, the book is designed to support a one-semester sequence covering the full ML systems stack. Instructors may also select individual parts for shorter modules:

Parts I–II (Foundations and Build) suit an introductory half-semester on ML development fundamentals.
Parts III–IV (Optimize and Deploy) suit an applied half-semester on production ML engineering.

For survey courses or executive programs, the four Part introductions alone provide a condensed overview of the quantitative principles that govern ML systems.

Lecture slides, assignments, and instructor resources are available at mlsysbook.ai.

Copyright and Licensing

© Vijay Janapa Reddi and MIT Press. This work is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). The source is available at mlsysbook.ai/git.