AI Systems Foundations
Semester 1: Single-Machine Engineering — Week-by-Week
Course Overview
| Textbook | Volume I: Introduction to Machine Learning Systems |
| Duration | 16 weeks (32 lectures at 75 min each) |
| Prerequisites | Programming (Python), linear algebra, intro probability |
| Scope | Single-machine systems: 1 to 8 accelerators |
| Key Framework | The Iron Law: \(T \approx D_{vol}/BW + O/(R_{peak} \cdot \eta) + L_{lat}\) |
Course Goal: Transition students from “using models” to “engineering systems.” By the end, students will have built a complete deep learning framework from scratch and optimized it for real-world deployment constraints.
Labs trail readings by one week — students complete the lab that reinforces the previous week’s material, giving them time to absorb the theory before exploring it hands-on.
Each chapter has a companion Beamer slide deck with speaker notes, timing guidance, and active learning exercises. Available as PDF, PowerPoint, and LaTeX source at mlsysbook.ai/slides.
Part I: The Physics of AI (Weeks 1–4)
Goal: Understand that data movement and compute have a physical cost.
Week 1: Why ML Systems?
| Component | Assignment |
|---|---|
| Read | Introduction |
| Lab | Lab 00: The Architect’s Portal (orientation) |
| Build | TinyTorch Module 01: Tensor |
| Due | Lab 00 Decision Log |
Learning Objectives: Define what an ML system is beyond the model. Identify the three pillars of the Iron Law. Explain why a 10x GPU upgrade does not yield a 10x speedup.
Have students predict: “A GPU is how many times faster than a CPU for a 1024x1024 matrix multiply?” Record predictions on the board. Revisit after Lab 01.
Week 2: The ML Systems Landscape
| Component | Assignment |
|---|---|
| Read | ML Systems |
| Lab | Lab 01: The Magnitude Gap |
| Build | TinyTorch Module 01: Tensor (continued) |
| Due | Module 01 notebook + Lab 01 Decision Log |
Learning Objectives: Map the full ML systems stack (application → framework → runtime → hardware). Quantify the memory wall using real hardware specs. Distinguish compute-bound from memory-bound workloads.
Week 3: The ML Workflow
| Component | Assignment |
|---|---|
| Read | ML Workflow |
| Lab | Lab 02: The Workflow Pipeline |
| Build | TinyTorch Module 02: Activations |
| Due | Lab 02 Decision Log |
Learning Objectives: Trace the end-to-end ML pipeline from data to deployment. Identify bottlenecks at each pipeline stage. Explain why training and inference have different system requirements.
Week 4: Data Engineering
| Component | Assignment |
|---|---|
| Read | Data Engineering |
| Lab | Lab 03: The Data Pipeline |
| Build | TinyTorch Module 02: Activations (continued) |
| Due | Module 02 notebook + Lab 03 Decision Log |
Learning Objectives: Calculate data pipeline throughput and identify I/O bottlenecks. Explain how data format, storage, and preprocessing affect training speed. Design a data pipeline that keeps the accelerator fed.
Part II: Building the Stack (Weeks 5–8)
Goal: Demystify the framework layer by implementing it from scratch.
Week 5: Neural Network Computation
| Component | Assignment |
|---|---|
| Read | Neural Computation |
| Lab | Lab 04: The Computation Graph |
| Build | TinyTorch Module 03: Layers |
| Due | Lab 04 Decision Log |
Learning Objectives: Implement forward and backward passes for dense layers. Trace memory allocation during a forward pass. Calculate FLOPs for a given network architecture.
Week 6: Neural Network Architectures
| Component | Assignment |
|---|---|
| Read | NN Architectures |
| Lab | Lab 05: Architecture Tradeoffs |
| Build | TinyTorch Module 04: Losses |
| Due | Module 03 notebook + Lab 05 Decision Log |
Learning Objectives: Compare CNNs, RNNs, and Transformers from a systems perspective (memory, compute, parallelism). Explain why Transformers parallelize better than RNNs. Calculate the memory footprint of attention for a given sequence length.
Week 7: ML Frameworks
| Component | Assignment |
|---|---|
| Read | ML Frameworks |
| Lab | Lab 06: The Dispatch Tax |
| Build | TinyTorch Module 05: DataLoader |
| Due | Module 04 notebook + Lab 06 Decision Log |
Learning Objectives: Explain eager vs. graph execution and their tradeoffs. Identify GPU starvation from a profiling trace. Describe how operator fusion reduces memory traffic.
This is the “aha” week. Students have been building TinyTorch piece by piece — now they see how frameworks like PyTorch solve the same problems at scale. Ask: “What would you do differently in your TinyTorch implementation now?”
Week 8: Training
| Component | Assignment |
|---|---|
| Read | Training |
| Lab | Lab 07: The Training Loop |
| Build | TinyTorch Module 06: Autograd |
| Due | Module 05 notebook + Lab 07 Decision Log |
Learning Objectives: Implement automatic differentiation (reverse mode). Explain how batch size affects memory, throughput, and convergence. Profile a training loop and identify the dominant cost.
By Week 8, students should have a working TinyTorch that can: create tensors, apply activations, build layers, compute losses, load data, and auto-differentiate. This is the foundation for everything that follows.
Part III: The Optimization Frontier (Weeks 9–12)
Goal: Make it fast, make it small, measure everything.
Week 9: Data Selection and Curation
| Component | Assignment |
|---|---|
| Read | Data Selection |
| Lab | Lab 08: Data Quality |
| Build | TinyTorch Module 07: Optimizers |
| Due | Module 06 notebook + Lab 08 Decision Log |
Learning Objectives: Quantify the impact of data quality on model performance. Explain curriculum learning from a systems perspective. Implement SGD and Adam optimizers from scratch.
Week 10: Model Compression
| Component | Assignment |
|---|---|
| Read | Model Compression |
| Lab | Lab 09: Quantization (INT8/INT4) |
| Build | TinyTorch Module 08: Training |
| Due | Module 07 notebook + Lab 09 Decision Log |
Learning Objectives: Implement post-training quantization (FP32 → INT8). Calculate the memory savings and accuracy tradeoff for a given model. Explain pruning, distillation, and quantization as manipulations of Iron Law terms.
Lab 09 is where students viscerally experience the accuracy-efficiency tradeoff. Have them find the exact quantization level where accuracy drops below their threshold — the “cliff” is more memorable than any lecture.
Week 11: Hardware Acceleration
| Component | Assignment |
|---|---|
| Read | Hardware Acceleration |
| Lab | Lab 10: The Roofline Model |
| Build | TinyTorch Module 08: Training (continued) |
| Due | Module 08 notebook + Lab 10 Decision Log |
Learning Objectives: Plot a workload on the Roofline model and determine if it is compute-bound or memory-bound. Explain how Tensor Cores, systolic arrays, and spatial architectures accelerate matrix operations. Calculate operational intensity for a given kernel.
Week 12: Benchmarking
| Component | Assignment |
|---|---|
| Read | Benchmarking |
| Lab | Lab 11: Benchmarking Methodology |
| Build | No new module — catch-up week |
| Due | Lab 11 Decision Log |
Learning Objectives: Design a fair benchmark for an ML system. Distinguish throughput, latency, and tail latency (P50/P99). Explain why “faster” is meaningless without specifying the metric, workload, and baseline.
Part IV: Deployment & Production (Weeks 13–16)
Goal: Deploy systems that don’t fail silently.
Week 13: Model Serving
| Component | Assignment |
|---|---|
| Read | Model Serving |
| Lab | Lab 12: Tail Latency (P99) |
| Build | Capstone prep |
| Due | Lab 12 Decision Log |
Learning Objectives: Explain batching strategies for inference (static, dynamic, continuous). Calculate the throughput-latency tradeoff for a given SLA. Design a serving system that meets a P99 latency target.
Week 14: ML Operations
| Component | Assignment |
|---|---|
| Read | ML Operations |
| Lab | Lab 13: Drift Detection |
| Build | Capstone prep |
| Due | Lab 13 Decision Log |
Learning Objectives: Define model drift (data drift, concept drift) and explain why it matters for production systems. Design a monitoring pipeline that detects drift before accuracy degrades. Explain CI/CD for ML models.
Week 15: Responsible Engineering
| Component | Assignment |
|---|---|
| Read | Responsible Engineering |
| Lab | Lab 14: Fairness and Efficiency |
| Build | Capstone work |
| Due | Lab 14 Decision Log + Capstone draft |
Learning Objectives: Quantify the energy cost of training and inference. Explain how system design choices (precision, batch size, hardware) affect fairness and accessibility. Articulate the engineer’s responsibility beyond accuracy metrics.
Week 16: Capstone — The AI Olympics
| Component | Assignment |
|---|---|
| Read | Conclusion |
| Lab | Lab 15: Capstone Integration |
| Capstone | AI Olympics Competition |
| Due | Final submission + 1,000-word design report |
Capstone Specification: Deploy the “Smart Doorbell” application across multiple tracks (Cloud, Edge, Mobile, Tiny). Maximize accuracy while staying under fixed latency (\(<50ms\)) and memory (\(<256KB\)) budgets. Final deliverable includes a design report traceably mapped to the Iron Law.
See Assessment & Grading for the complete AI Olympics rubric.
TinyTorch Module Summary
| Week | Module | Topic | Hours | Milestone Unlocked |
|---|---|---|---|---|
| 1–2 | 01 | Tensor | 4–6 | — |
| 3–4 | 02 | Activations | 5–7 | — |
| 5 | 03 | Layers | 5–7 | Perceptron (1958) |
| 6 | 04 | Losses | 4–6 | XOR Crisis (1969) |
| 7 | 05 | DataLoader | 5–7 | — |
| 8 | 06 | Autograd | 6–8 | — |
| 9 | 07 | Optimizers | 5–7 | — |
| 10–11 | 08 | Training | 6–8 | MLP Revival (1986) |
| 12–16 | — | Capstone focus | — | — |
If teaching both semesters, TinyTorch Modules 09–20 (Convolutions through Capstone) continue in Semester 2 or can be offered as an advanced track alongside Volume II.
Suggested Case Studies
These industry papers pair well with specific weeks. Assign as optional reading or use as discussion starters:
| Week | Topic | Suggested Paper |
|---|---|---|
| 4 | Data Engineering | Sambasivan et al., “Everyone Wants to Do the Model Work, Not the Data Work” (CHI 2021) |
| 7 | ML Frameworks | Chen et al., “TVM: An Automated End-to-End Optimizing Compiler” (OSDI 2018) |
| 10 | Model Compression | Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” (NeurIPS 2022) |
| 11 | HW Acceleration | Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit” (ISCA 2017) |
| 13 | Model Serving | Yu et al., “Orca: A Distributed Serving System for Transformer-Based Models” (OSDI 2022) |
| 14 | ML Operations | Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015) |