AI Systems Foundations

Semester 1: Single-Machine Engineering — Week-by-Week

Course Overview

Textbook	Volume I: Introduction to Machine Learning Systems
Duration	16 weeks (32 lectures at 75 min each)
Prerequisites	Programming (Python), linear algebra, intro probability
Scope	Single-machine systems: 1 to 8 accelerators
Key Framework	The Iron Law: \(T \approx D_{\text{vol}}/\text{BW} + O/(R_{\text{peak}} \cdot \eta) + L_{\text{lat}}\)

Course Goal: Transition students from “using models” to “engineering systems.” By the end, students will have built a complete deep learning framework from scratch and optimized it for real-world deployment constraints.

How Labs and Readings Relate

Labs trail readings by one week — students complete the lab that reinforces the previous week’s material, giving them time to absorb the theory before exploring it hands-on.

Lecture Slides

Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.

Part I: The Physics of AI (Weeks 1–4)

Goal: Understand that data movement and compute have a physical cost.

Week 1: Why ML Systems?

Component	Assignment
Read	Introduction
Lab	Lab 00: The Architect’s Portal (orientation)
Build	TinyTorch Module 01: Tensor
Due	Lab 00 Decision Log

Learning Objectives: Define what an ML system is beyond the model. Identify the three pillars of the Iron Law. Explain why a 10\(\times\) GPU upgrade does not yield a 10\(\times\) speedup.

Instructor Tip

Have students predict: “A GPU is how many times faster than a CPU for a \(1024{\times}1024\) matrix multiply?” Record predictions on the board. Revisit after Lab 01.

Week 2: The ML Systems Landscape

Component	Assignment
Read	ML Systems
Lab	Lab 01: The Magnitude Gap
Build	TinyTorch Module 01: Tensor (continued)
Due	Module 01 notebook + Lab 01 Decision Log

Learning Objectives: Map the full ML systems stack (application → framework → runtime → hardware). Quantify the memory wall using real hardware specs. Distinguish compute-bound from memory-bound workloads.

Week 3: The ML Workflow

Component	Assignment
Read	ML Workflow
Lab	Lab 02: The Workflow Pipeline
Build	TinyTorch Module 02: Activations
Due	Lab 02 Decision Log

Learning Objectives: Trace the end-to-end ML pipeline from data to deployment. Identify bottlenecks at each pipeline stage. Explain why training and inference have different system requirements.

Week 4: Data Engineering

Component	Assignment
Read	Data Engineering
Lab	Lab 03: The Data Pipeline
Build	TinyTorch Module 02: Activations (continued)
Due	Module 02 notebook + Lab 03 Decision Log

Learning Objectives: Calculate data pipeline throughput and identify I/O bottlenecks. Explain how data format, storage, and preprocessing affect training speed. Design a data pipeline that keeps the accelerator fed.

Part II: Building the Stack (Weeks 5–8)

Goal: Demystify the framework layer by implementing it from scratch.

Week 5: Neural Network Computation

Component	Assignment
Read	Neural Computation
Lab	Lab 04: The Computation Graph
Build	TinyTorch Module 03: Layers
Due	Lab 04 Decision Log

Learning Objectives: Implement forward and backward passes for dense layers. Trace memory allocation during a forward pass. Calculate FLOPs for a given network architecture.

Week 6: Neural Network Architectures

Component	Assignment
Read	NN Architectures
Lab	Lab 05: Architecture Tradeoffs
Build	TinyTorch Module 04: Losses
Due	Module 03 notebook + Lab 05 Decision Log

Learning Objectives: Compare CNNs, RNNs, and Transformers from a systems perspective (memory, compute, parallelism). Explain why Transformers parallelize better than RNNs. Calculate the memory footprint of attention for a given sequence length.

Week 7: ML Frameworks

Component	Assignment
Read	ML Frameworks
Lab	Lab 06: The Dispatch Tax
Build	TinyTorch Module 05: DataLoader
Due	Module 04 notebook + Lab 06 Decision Log

Learning Objectives: Explain eager vs. graph execution and their tradeoffs. Identify GPU starvation from a profiling trace. Describe how operator fusion reduces memory traffic.

Instructor Tip

This is the “aha” week. Students have been building TinyTorch piece by piece — now they see how frameworks like PyTorch solve the same problems at scale. Ask: “What would you do differently in your TinyTorch implementation now?”

Week 8: Training

Component	Assignment
Read	Training
Lab	Lab 07: The Training Loop
Build	TinyTorch Module 06: Autograd
Due	Module 05 notebook + Lab 07 Decision Log

Learning Objectives: Implement automatic differentiation (reverse mode). Explain how batch size affects memory, throughput, and convergence. Profile a training loop and identify the dominant cost.

Milestone Check

By Week 8, students should have a working TinyTorch that can: create tensors, apply activations, build layers, compute losses, load data, and auto-differentiate. This is the foundation for everything that follows.

Part III: The Optimization Frontier (Weeks 9–12)

Goal: Make it fast, make it small, measure everything.

Week 9: Data Selection and Curation

Component	Assignment
Read	Data Selection
Lab	Lab 08: Data Quality
Build	TinyTorch Module 07: Optimizers
Due	Module 06 notebook + Lab 08 Decision Log

Learning Objectives: Quantify the impact of data quality on model performance. Explain curriculum learning from a systems perspective. Implement SGD and Adam optimizers from scratch.

Week 10: Model Compression

Component	Assignment
Read	Model Compression
Lab	Lab 09: The Data Selection Paradox
Build	TinyTorch Module 08: Training
Due	Module 07 notebook + Lab 09 Decision Log

Learning Objectives: Implement post-training quantization (FP32 → INT8). Calculate the memory savings and accuracy tradeoff for a given model. Explain pruning, distillation, and quantization as manipulations of Iron Law terms.

Instructor Tip

Lab 09 is where students viscerally experience the accuracy-efficiency tradeoff. Have them find the exact quantization level where accuracy drops below their threshold — the “cliff” is more memorable than any lecture.

Week 11: Hardware Acceleration

Component	Assignment
Read	Hardware Acceleration
Lab	Lab 10: The Compression Paradox
Build	TinyTorch Module 08: Training (continued)
Due	Module 08 notebook + Lab 10 Decision Log

Learning Objectives: Plot a workload on the Roofline model and determine if it is compute-bound or memory-bound. Explain how Tensor Cores, systolic arrays, and spatial architectures accelerate matrix operations. Calculate operational intensity for a given kernel.

Week 12: Benchmarking

Component	Assignment
Read	Benchmarking
Lab	Lab 11: The Hardware Roofline
Build	No new module — catch-up week
Due	Lab 11 Decision Log

Learning Objectives: Design a fair benchmark for an ML system. Distinguish throughput, latency, and tail latency (P50/P99). Explain why “faster” is meaningless without specifying the metric, workload, and baseline.

Part IV: Deployment & Production (Weeks 13–16)

Goal: Deploy systems that don’t fail silently.

Week 13: Model Serving

Component	Assignment
Read	Model Serving
Lab	Lab 12: The Benchmarking Trap
Build	Capstone prep
Due	Lab 12 Decision Log

Learning Objectives: Explain batching strategies for inference (static, dynamic, continuous). Calculate the throughput-latency tradeoff for a given SLA. Design a serving system that meets a P99 latency target.

Week 14: ML Operations

Component	Assignment
Read	ML Operations
Lab	Lab 13: The Tail Latency Trap
Build	Capstone prep
Due	Lab 13 Decision Log

Learning Objectives: Define model drift (data drift, concept drift) and explain why it matters for production systems. Design a monitoring pipeline that detects drift before accuracy degrades. Explain CI/CD for ML models.

Week 15: Responsible Engineering

Component	Assignment
Read	Responsible Engineering
Lab	Lab 14: The Silent Degradation Problem
Build	Capstone work
Due	Lab 14 Decision Log + Capstone draft

Learning Objectives: Quantify the energy cost of training and inference. Explain how system design choices (precision, batch size, hardware) affect fairness and accessibility. Articulate the engineer’s responsibility beyond accuracy metrics.

Week 16: Capstone — The AI Olympics

Component	Assignment
Read	Conclusion
Lab	Lab 15: No Free Fairness
Capstone	AI Olympics Competition
Due	Final submission + 1,000-word design report

Capstone Specification: Deploy the “Smart Doorbell” application across multiple tracks (Cloud, Edge, Mobile, Tiny). Maximize accuracy while staying under fixed latency (\(< 50 \text{ ms}\)) and memory (\(< 256 \text{ KB}\)) budgets. Final deliverable includes a design report traceably mapped to the Iron Law.

See Assessment & Grading for the complete AI Olympics rubric.

TinyTorch Module Summary

Week	Module	Topic	Hours	Milestone Unlocked
1–2	01	Tensor	4–6	—
3–4	02	Activations	5–7	—
5	03	Layers	5–7	Perceptron (1958)
6	04	Losses	4–6	XOR Crisis (1969)
7	05	DataLoader	5–7	—
8	06	Autograd	6–8	—
9	07	Optimizers	5–7	—
10–11	08	Training	6–8	MLP Revival (1986)
12–16	—	Capstone focus	—	—

For the Full Sequence

If teaching both semesters, TinyTorch Modules 09–20 (Convolutions through Capstone) continue in Semester 2 or can be offered as an advanced track alongside Volume II.

Suggested Case Studies

These industry papers pair well with specific weeks. Assign as optional reading or use as discussion starters:

Week	Topic	Suggested Paper
4	Data Engineering	Sambasivan et al., “Everyone Wants to Do the Model Work, Not the Data Work” (CHI 2021)
7	ML Frameworks	Chen et al., “TVM: An Automated End-to-End Optimizing Compiler” (OSDI 2018)
10	Model Compression	Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” (NeurIPS 2022)
11	HW Acceleration	Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA 2017)
13	Model Serving	Yu et al., “Orca: A Distributed Serving System for Transformer-Based Models” (OSDI 2022)
14	ML Operations	Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015)