---
output-file: index.html
sidebar: vol2-content
format:
  html:
    title-block-style: default
    title: "Machine Learning Systems at Scale"
    date: today
    date-format: long
    doi: "v0.2.0"
    doi-title: "Version"
    author:
      name: Vijay Janapa Reddi
      email: vj@eecs.harvard.edu
      url: https://vijay.seas.harvard.edu
      affiliation: Harvard University
---

::: {.content-visible when-format="html:js"}

# Welcome {.unnumbered}

```{=html}
<div class="abstract-section">
  <div class="abstract-content">
    <p>Modern machine learning operates at scales that fundamentally change engineering requirements—models too large for single GPUs, services spanning continents, deployments carrying societal responsibilities. This book addresses AI engineering at scale. The treatment follows the lifecycle of a massive-scale system: defining the distributed architecture, building the physical infrastructure fleet, ensuring operational reliability, deploying to global users, and hardening the system for safety and responsibility.</p>
  </div>

  <a href="assets/downloads/Machine-Learning-Systems-Vol2.pdf" target="_blank" class="book-card-link" title="Download PDF">
    <div class="book-card">
      <img src="assets/images/covers/cover-hardcover-book-vol2.webp" alt="Machine Learning Systems Book Cover" class="book-image" />
      <p class="book-title">Machine Learning Systems at Scale</p>
      <p class="book-subtitle">Publisher: The MIT Press (2027)</p>
      <p style="font-size: 0.8em; color: #6c757d; margin-top: 6px; margin-bottom: 0;">📖 Click here to download PDF</p>
    </div>
  </a>
</div>
```

## What You Will Learn {.unnumbered}

The four parts extend ML systems foundations into production-scale systems by following the **Fleet Stack** from bottom to top:

- **Part I: The Fleet**—Build the physical computer. Architect the datacenter infrastructure, high-bandwidth network fabrics, and scalable data storage that form the foundation of every distributed ML deployment.
- **Part II: Distributed ML**—Master the algorithms of scale. Learn how to coordinate computation across thousands of devices using parallelism strategies, collective communication primitives, fault tolerance mechanisms, and fleet orchestration.
- **Part III: Deployment at Scale**—Serve the world. Navigate the shift from training to inference, optimize performance across the serving stack, push intelligence to the edge, and manage the operational lifecycle of production fleets.
- **Part IV: The Responsible Fleet**—Harden and govern the system. Address security, robustness, environmental sustainability, and responsible engineering in large-scale operations.

## Prerequisites {.unnumbered}

This book assumes:

- **Foundational or equivalent** background in single-machine ML systems
- **Programming proficiency** in Python with familiarity in NumPy
- **Mathematics foundations** in linear algebra, calculus, and probability
- Familiarity with distributed systems concepts (networking, parallelism) is helpful for advanced topics

## Learn by Doing {.unnumbered}

Within the broader AI engineering curriculum, this volume is the scale and governance spine. Pair the chapters with [Co-Labs](https://mlsysbook.ai/labs/) for fleet-scale trade-off exercises, [MLSys·im](https://mlsysbook.ai/mlsysim/) for first-principles infrastructure modeling, and [StaffML](https://mlsysbook.ai/staffml/) for physics-grounded systems design practice. Instructors can adopt the full scale sequence through [The AI Engineering Blueprint](https://mlsysbook.ai/instructors/).

## Support Our Mission {.unnumbered}

```{=html}
<div class="support-mission">
  <p><strong>2026 Goal:</strong> Help 100,000 students learn ML Systems. Sponsors like the <a href="https://edgeaifoundation.org/" target="_blank" rel="noopener noreferrer">EDGE AI Foundation</a> match every star with funding that supports learning.</p>

  <div class="support-actions">
    <span class="star-count" id="star-count">Loading...</span>
    <a href="https://github.com/harvard-edge/cs249r_book" target="_blank" rel="noopener" class="github-star-btn">⭐ Star on GitHub</a>
  </div>

  <p class="support-note">
    <a href="https://opencollective.com/mlsysbook" target="_blank" rel="noopener">Support us on Open Collective →</a>
  </p>
</div>
```

```{=html}
<script>
async function fetchGitHubStars() {
  const starElement = document.getElementById('star-count');

  try {
    const response = await fetch('https://api.github.com/repos/harvard-edge/cs249r_book');
    const data = await response.json();
    const starCount = data.stargazers_count;
    const formattedCount = starCount.toLocaleString();
    starElement.textContent = formattedCount;
    starElement.style.opacity = '1';
  } catch (error) {
    console.error('Failed to fetch GitHub stars:', error);
    starElement.textContent = 'Loading...';
    starElement.style.opacity = '1';
  }
}

document.addEventListener('DOMContentLoaded', fetchGitHubStars);
</script>
```

## Listen to the AI Podcast {.unnumbered}

```{=html}
<div class="podcast-section">
  <p>
    This short podcast, created with Google's Notebook LM and inspired by insights from our <a href="https://web.eng.fiu.edu/gaquan/Papers/ESWEEK24Papers/CPS-Proceedings/pdfs/CODES-ISSS/563900a043/563900a043.pdf" target="_blank" rel="noopener">IEEE education viewpoint paper</a>, offers an accessible overview of the book's key ideas and themes.
  </p>
  <audio controls="controls">
    <source src="assets/media/notebooklm_podcast_mlsysbookai.mp3" type="audio/mpeg" />
    Your browser does not support the audio element.
  </audio>
</div>
```

## Want to Help Out? {.unnumbered}

This is a collaborative project, and your input matters. If you would like to contribute, check out our [contribution guidelines](https://github.com/harvard-edge/cs249r_book/blob/main/book/docs/CONTRIBUTING.md). Feedback, corrections, and new ideas are welcome. Simply file a GitHub [issue](https://github.com/harvard-edge/cs249r_book/issues).

:::
