TA Guide

Everything you need to run labs, grade assignments, and support students

Welcome to the teaching team. This guide covers what you need to know to be an effective TA for the ML Systems course.


Before the Semester

TA Preparation Checklist

Complete these before Week 1:


Grading Decision Logs

Decision Logs are the most important written artifact in the course. Every student submits one per week, so plan your grading time accordingly. Here is how to do it efficiently.

The 3-Question Speed Rubric

For each Decision Log, ask three questions:

  1. Numbers? Did the student cite specific values from the instruments? (latency, memory, throughput, accuracy)
  2. Why? Did the student use Iron Law terminology to explain the cause?
  3. Tradeoff? Did the student acknowledge what they sacrificed for what they gained?

Yes to all three → 27-30 points. Missing one → 18-22. Missing two+ → 6-12.

See Assessment & Grading for the full rubric and sample student work at each quality level.

Time Budget

Expect 3–5 minutes per Decision Log using the speed rubric. For a 30-student section, that’s ~2 hours per week. See Assessment & Grading for full grading load estimates across all assignment types.

Batch Grading Tips

  • Grade all Decision Logs for one week in a single sitting (consistency matters)
  • Read the excellent sample first to calibrate your expectations
  • Mark the first 5, then check with another TA — align before continuing
  • Flag borderline cases for the instructor rather than agonizing

Grading TinyTorch

Auto-Graded (70 points)

  • Run pytest on student submissions — pass/fail per test
  • Students who pass all tests get 70/70; no partial credit per test
  • If a student passes 90%+ of tests, check whether the failures are edge cases vs. fundamental errors

Systems Thinking Questions (30 points)

Each module has 3 manually-graded questions (10 points each). Use this scale:

Score What It Looks Like
10 Correct reasoning + quantitative estimate + hardware awareness
7 Right direction, missing numbers or hardware specifics
4 Partially correct with significant conceptual gaps
1 Attempted but fundamentally wrong

Running Lab Sections (50 minutes)

Probing Questions to Use While Circulating

Situation What to Ask
Student says “it’s faster” “How much faster? Which Iron Law term changed?”
Student hits an OOM error “Find the exact value where it breaks. What constraint did you hit?”
Student doesn’t know what to try “Change one variable. What happened? Now try a different one.”
Student finishes Part B early “Can you find a configuration 2x better than your best? What’s the limit?”
Student’s prediction was wrong “What did you assume that turned out to be false?”

Common Student Struggles by Week

Semester 1 (Foundations)

Week Common Issue How to Help
1-2 “What is a system? I thought this was an ML class.” Redirect: “The model is just one layer. What carries the data to the model? What executes the math?”
5-6 TinyTorch Module 03 (Layers) — broadcasting bugs Check tensor shapes at each step; remind students that NumPy broadcasting rules apply
6-8 TinyTorch Module 06 (Autograd) — wrong gradients Most common cause: incorrect topological sort order. Have them draw the computation graph on paper first
8 “My training loop is slow” Ask: “Is the GPU actually busy? Check utilization. The bottleneck is usually data loading, not compute.”
10 Lab 09 (Quantization) — “INT8 destroyed my model” Check if they are quantizing batch norm layers. Remind them to use calibration data
13-16 Capstone overwhelm Break it down: “First, meet the accuracy target. Then optimize for latency. Then for memory. One constraint at a time.”

Semester 2 (Scale)

Week Common Issue How to Help
5 “Which parallelism should I use?” “Calculate communication-to-computation ratio for each strategy. The math tells you.”
6 “AllReduce is confusing” Draw the ring on the whiteboard with 4 nodes. Walk through one full cycle
7 “Why checkpoint so often?” “Calculate expected time-to-failure for 1000 GPUs. Now multiply by cost per GPU-hour.”
10 KV-cache memory confusion “How many bytes per token per layer? Multiply by sequence length times batch size times number of layers.”

Office Hours Protocol

How Much Help is Too Much?

  • Do: Ask clarifying questions. Help students debug their approach, not their code.
  • Do: Point students to the right textbook section or lab instrument.
  • Don’t: Write code for students. Don’t give away Part C answers.
  • Don’t: Debug TinyTorch implementations line by line — have them add print statements and explain what they see.

The 10-Minute Rule

If a student has been stuck for 10+ minutes during office hours:

  1. Ask them to explain what they’ve tried (this often unsticks them)
  2. If still stuck, narrow the problem: “Is it a shape error, a value error, or a logic error?”
  3. If still stuck after 15 minutes, give a directed hint: “Look at how the gradient flows through this specific node”

Escalation

  • Grading disputes: Flag for the instructor. Do not overrule your own grade without discussion.
  • Academic integrity concerns: Flag for the instructor immediately. Do not confront the student.
  • Accessibility needs: Refer to the instructor and campus disability services.

Quick Reference: What’s Due Each Week

See the full syllabi for detailed weekly breakdowns:

Each week, students typically submit:

  1. A Decision Log (200 words) for the lab they completed
  2. A TinyTorch module (Foundations only) auto-graded via pytest
  3. A Design Challenge (bi-weekly) for the open-ended Part C problems
Back to top