AI Systems Foundations

Semester 1: Single-Machine Engineering — Week-by-Week

Course Overview

Textbook Volume I: Introduction to Machine Learning Systems
Duration 16 weeks (32 lectures at 75 min each)
Prerequisites Programming (Python), linear algebra, intro probability
Scope Single-machine systems: 1 to 8 accelerators
Key Framework The Iron Law: \(T \approx D_{vol}/BW + O/(R_{peak} \cdot \eta) + L_{lat}\)

Course Goal: Transition students from “using models” to “engineering systems.” By the end, students will have built a complete deep learning framework from scratch and optimized it for real-world deployment constraints.

NoteHow Labs and Readings Relate

Labs trail readings by one week — students complete the lab that reinforces the previous week’s material, giving them time to absorb the theory before exploring it hands-on.

TipLecture Slides

Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.


Part I: The Physics of AI (Weeks 1–4)

Goal: Understand that data movement and compute have a physical cost.

Week 1: Why ML Systems?

Component Assignment
Read Introduction
Lab Lab 00: The Architect’s Portal (orientation)
Build TinyTorch Module 01: Tensor
Due Lab 00 Decision Log

Learning Objectives: Define what an ML system is beyond the model. Identify the three pillars of the Iron Law. Explain why a 10x GPU upgrade does not yield a 10x speedup.

TipInstructor Tip

Have students predict: “A GPU is how many times faster than a CPU for a 1024x1024 matrix multiply?” Record predictions on the board. Revisit after Lab 01.

Week 2: The ML Systems Landscape

Component Assignment
Read ML Systems
Lab Lab 01: The Magnitude Gap
Build TinyTorch Module 01: Tensor (continued)
Due Module 01 notebook + Lab 01 Decision Log

Learning Objectives: Map the full ML systems stack (application → framework → runtime → hardware). Quantify the memory wall using real hardware specs. Distinguish compute-bound from memory-bound workloads.

Week 3: The ML Workflow

Component Assignment
Read ML Workflow
Lab Lab 02: The Workflow Pipeline
Build TinyTorch Module 02: Activations
Due Lab 02 Decision Log

Learning Objectives: Trace the end-to-end ML pipeline from data to deployment. Identify bottlenecks at each pipeline stage. Explain why training and inference have different system requirements.

Week 4: Data Engineering

Component Assignment
Read Data Engineering
Lab Lab 03: The Data Pipeline
Build TinyTorch Module 02: Activations (continued)
Due Module 02 notebook + Lab 03 Decision Log

Learning Objectives: Calculate data pipeline throughput and identify I/O bottlenecks. Explain how data format, storage, and preprocessing affect training speed. Design a data pipeline that keeps the accelerator fed.


Part II: Building the Stack (Weeks 5–8)

Goal: Demystify the framework layer by implementing it from scratch.

Week 5: Neural Network Computation

Component Assignment
Read Neural Computation
Lab Lab 04: The Computation Graph
Build TinyTorch Module 03: Layers
Due Lab 04 Decision Log

Learning Objectives: Implement forward and backward passes for dense layers. Trace memory allocation during a forward pass. Calculate FLOPs for a given network architecture.

Week 6: Neural Network Architectures

Component Assignment
Read NN Architectures
Lab Lab 05: Architecture Tradeoffs
Build TinyTorch Module 04: Losses
Due Module 03 notebook + Lab 05 Decision Log

Learning Objectives: Compare CNNs, RNNs, and Transformers from a systems perspective (memory, compute, parallelism). Explain why Transformers parallelize better than RNNs. Calculate the memory footprint of attention for a given sequence length.

Week 7: ML Frameworks

Component Assignment
Read ML Frameworks
Lab Lab 06: The Dispatch Tax
Build TinyTorch Module 05: DataLoader
Due Module 04 notebook + Lab 06 Decision Log

Learning Objectives: Explain eager vs. graph execution and their tradeoffs. Identify GPU starvation from a profiling trace. Describe how operator fusion reduces memory traffic.

TipInstructor Tip

This is the “aha” week. Students have been building TinyTorch piece by piece — now they see how frameworks like PyTorch solve the same problems at scale. Ask: “What would you do differently in your TinyTorch implementation now?”

Week 8: Training

Component Assignment
Read Training
Lab Lab 07: The Training Loop
Build TinyTorch Module 06: Autograd
Due Module 05 notebook + Lab 07 Decision Log

Learning Objectives: Implement automatic differentiation (reverse mode). Explain how batch size affects memory, throughput, and convergence. Profile a training loop and identify the dominant cost.

NoteMilestone Check

By Week 8, students should have a working TinyTorch that can: create tensors, apply activations, build layers, compute losses, load data, and auto-differentiate. This is the foundation for everything that follows.


Part III: The Optimization Frontier (Weeks 9–12)

Goal: Make it fast, make it small, measure everything.

Week 9: Data Selection and Curation

Component Assignment
Read Data Selection
Lab Lab 08: Data Quality
Build TinyTorch Module 07: Optimizers
Due Module 06 notebook + Lab 08 Decision Log

Learning Objectives: Quantify the impact of data quality on model performance. Explain curriculum learning from a systems perspective. Implement SGD and Adam optimizers from scratch.

Week 10: Model Compression

Component Assignment
Read Model Compression
Lab Lab 09: Quantization (INT8/INT4)
Build TinyTorch Module 08: Training
Due Module 07 notebook + Lab 09 Decision Log

Learning Objectives: Implement post-training quantization (FP32 → INT8). Calculate the memory savings and accuracy tradeoff for a given model. Explain pruning, distillation, and quantization as manipulations of Iron Law terms.

TipInstructor Tip

Lab 09 is where students viscerally experience the accuracy-efficiency tradeoff. Have them find the exact quantization level where accuracy drops below their threshold — the “cliff” is more memorable than any lecture.

Week 11: Hardware Acceleration

Component Assignment
Read Hardware Acceleration
Lab Lab 10: The Roofline Model
Build TinyTorch Module 08: Training (continued)
Due Module 08 notebook + Lab 10 Decision Log

Learning Objectives: Plot a workload on the Roofline model and determine if it is compute-bound or memory-bound. Explain how Tensor Cores, systolic arrays, and spatial architectures accelerate matrix operations. Calculate operational intensity for a given kernel.

Week 12: Benchmarking

Component Assignment
Read Benchmarking
Lab Lab 11: Benchmarking Methodology
Build No new module — catch-up week
Due Lab 11 Decision Log

Learning Objectives: Design a fair benchmark for an ML system. Distinguish throughput, latency, and tail latency (P50/P99). Explain why “faster” is meaningless without specifying the metric, workload, and baseline.


Part IV: Deployment & Production (Weeks 13–16)

Goal: Deploy systems that don’t fail silently.

Week 13: Model Serving

Component Assignment
Read Model Serving
Lab Lab 12: Tail Latency (P99)
Build Capstone prep
Due Lab 12 Decision Log

Learning Objectives: Explain batching strategies for inference (static, dynamic, continuous). Calculate the throughput-latency tradeoff for a given SLA. Design a serving system that meets a P99 latency target.

Week 14: ML Operations

Component Assignment
Read ML Operations
Lab Lab 13: Drift Detection
Build Capstone prep
Due Lab 13 Decision Log

Learning Objectives: Define model drift (data drift, concept drift) and explain why it matters for production systems. Design a monitoring pipeline that detects drift before accuracy degrades. Explain CI/CD for ML models.

Week 15: Responsible Engineering

Component Assignment
Read Responsible Engineering
Lab Lab 14: Fairness and Efficiency
Build Capstone work
Due Lab 14 Decision Log + Capstone draft

Learning Objectives: Quantify the energy cost of training and inference. Explain how system design choices (precision, batch size, hardware) affect fairness and accessibility. Articulate the engineer’s responsibility beyond accuracy metrics.

Week 16: Capstone — The AI Olympics

Component Assignment
Read Conclusion
Lab Lab 15: Capstone Integration
Capstone AI Olympics Competition
Due Final submission + 1,000-word design report

Capstone Specification: Deploy the “Smart Doorbell” application across multiple tracks (Cloud, Edge, Mobile, Tiny). Maximize accuracy while staying under fixed latency (\(<50ms\)) and memory (\(<256KB\)) budgets. Final deliverable includes a design report traceably mapped to the Iron Law.

See Assessment & Grading for the complete AI Olympics rubric.


TinyTorch Module Summary

Week Module Topic Hours Milestone Unlocked
1–2 01 Tensor 4–6
3–4 02 Activations 5–7
5 03 Layers 5–7 Perceptron (1958)
6 04 Losses 4–6 XOR Crisis (1969)
7 05 DataLoader 5–7
8 06 Autograd 6–8
9 07 Optimizers 5–7
10–11 08 Training 6–8 MLP Revival (1986)
12–16 Capstone focus
NoteFor the Full Sequence

If teaching both semesters, TinyTorch Modules 09–20 (Convolutions through Capstone) continue in Semester 2 or can be offered as an advanced track alongside Volume II.


Suggested Case Studies

These industry papers pair well with specific weeks. Assign as optional reading or use as discussion starters:

Week Topic Suggested Paper
4 Data Engineering Sambasivan et al., “Everyone Wants to Do the Model Work, Not the Data Work” (CHI 2021)
7 ML Frameworks Chen et al., “TVM: An Automated End-to-End Optimizing Compiler” (OSDI 2018)
10 Model Compression Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” (NeurIPS 2022)
11 HW Acceleration Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA 2017)
13 Model Serving Yu et al., “Orca: A Distributed Serving System for Transformer-Based Models” (OSDI 2022)
14 ML Operations Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015)
Back to top