Acknowledgments

Origins

The companion volume grew out of TinyML and a realization that the constraints governing milliwatt devices are the same constraints governing data-center accelerators. This volume grew out of a different realization, one that came from working on ML Fleet Efficiency as a visiting researcher at Google. Peter Mattson, who had been a collaborator on MLPerf, made that opportunity possible. What I encountered there changed how I think about ML systems: the engineering required to coordinate thousands of accelerators into a single coherent system is a discipline unto itself, and that discipline had no textbook.

That work introduced me to the ML Productivity Goodput (MPG) metric, fleet-level heterogeneity at a scale I had not seen before, and the daily reality of failure as a statistical certainty rather than an exception. The gap between what practitioners knew and what was written down was enormous. The researchers who built these systems published individual results, but the connective tissue between those results, the principles that made them cohere into a discipline, existed only in the heads of the people doing the work. This book is an attempt to document that connective tissue.

Collaborators and Colleagues

I learned most of what I know about fleet-scale systems from the ML Fleet Efficiency team at Google: Arissa Wongpanich, Tayo Oguntebi, Jose Baiocchi Paredes, Yu Emma Wang, Phitchaya Mangpo Phothilimthana, Ritwika Mitra, and Zongwei Zhou. Naveen Kumar led that effort, and the way he thought about fleet-level optimization shaped how I think about it still. Robert Hundt pushed my understanding of performance engineering at scale. Together, Naveen and Robert showed me what fleet-level efficiency looks like from the inside. David Kanter continued the MLPerf collaboration from the companion volume, and our discussions on benchmarking distributed systems sharpened the quantitative methodology that runs through every chapter. Greg Diamos provided invaluable perspective on scaling laws and their origins at Baidu, where some of the earliest work on large-scale distributed training took shape.

The researchers whose papers fill the bibliography of this book deserve particular acknowledgment. Practitioners and researchers who built the systems, measured the failures, and published what they learned discovered the principles taught here, rather than inventing them for this textbook. This book synthesizes their work into a coherent narrative. The debt is to them.

Support

The Harvard Data Science Initiative, Harvard Extension School, and the National Science Foundation supported this work; the Edge AI Foundation and ICTP supported educational outreach and scholarships; and our industry partners provided infrastructure and tooling that enabled practical experimentation.

At MIT Press, Susan Hartman believed in this project from the beginning and guided both volumes from an open-source experiment to published textbooks. Her editorial judgment and support made this book real. I am equally grateful to MIT Press for agreeing to keep this book available as open access. A textbook about a discipline shaped by the open-source community should be accessible to everyone in it.

An open-source community shaped this book in ways that no single author could. Dozens of contributors caught errors, improved explanations, filed issues, and submitted pull requests. The book improved because people who owed it nothing chose to make it better. I am grateful to every one of them.

For the complete and most up-to-date list of all contributors, please visit the GitHub contributors page. For those interested in contributing, please consult our contribution guidelines.

Every GitHub star is a signal that someone cares about this material, is learning from it, and believes the work matters. The online community of learners, with their questions, their curiosity, and their willingness to engage with an unfinished textbook, kept this project moving when the scope felt overwhelming.

Finally, I thank my wife and my children. I wrote this book in the hours that belonged to them: weekends that disappeared into revisions, evenings that stretched past midnight, mornings when I was present but not quite there. They never complained. They asked how the writing was going, brought me coffee, and gave me the space to finish. Whatever merit this book has, their patience and their support made it possible.