Week 14: Course Synthesis — The Architecture 2.0 Roadmap

December 03, 2024 • By Vijay Janapa Reddi • @profvjreddi • 33 min read • Synthesis, Architecture, Systems

We set out thirteen weeks ago with an ambitious question: Can AI agents become co-designers of computer systems? Not just tools that optimize within fixed constraints, but true collaborators that reason across the full computing stack—from code to silicon, from algorithms to physical layout, from specifications to verified implementations.

The answer is yes—and also no. It depends on what you mean by “co-designer.”

The simple part: Yes, AI can meaningfully contribute to system design in ways that were impossible even two years ago. We’ve seen concrete examples at every layer of the stack.

The complex part: The challenges that remain aren’t just technical hurdles to overcome with bigger models or more compute. They’re fundamental questions about the nature of system design itself—questions about circular dependencies that resist decomposition, tacit knowledge that exists only in experienced architects’ intuitions, and trust in systems whose complexity exceeds human comprehension.

This isn’t a conclusion. It’s a roadmap.

🔑 Key Takeaways

AI amplifies, not replaces: The most successful approaches combine AI exploration with human judgment, not autonomous AI design
Three types of tacit reasoning: Co-design (circular dependencies), predictive (uncertain futures), and adaptive (continuous adjustment) reasoning define architectural expertise
Trust requires diversity: Validation through multiple independent approaches, not blind faith in any single AI system
Feedback loops are the bottleneck: Closing the gap between design decisions and their consequences enables faster iteration
Hybrid approaches win: Combine analytical models (what we know) with learning (what we can't articulate) rather than pure ML or pure analysis

Where We Started: The Architecture 2.0 Vision

Week 1 introduced the TAO to TAOS framework. Traditional computer systems innovation followed TAO: Technology (Moore’s Law), Architecture (exploiting parallelism), and Optimization (compiler advances). This worked when design spaces were tractable and human intuition could guide exploration.

But three converging forces made this insufficient:

Demand explosion: Every domain now needs specialized hardware (LLMs, autonomous vehicles, cryptocurrency, video transcoding). One-size-fits-all computing is over.
Talent crisis: Training competent architects takes years. Demand for specialized hardware grows exponentially. We can’t train architects fast enough.
AI inflection point: For the first time, AI systems can understand code, reason about performance, and generate functional designs.

The S in TAOS stands for Specialization—not as another optimization technique, but as a fundamental shift requiring AI assistance. Modern design spaces contain 10^14 to 10^2300 possible configurations. If you evaluated one design every nanosecond, exploring 10^14 possibilities would take over three years. For 10^2300 configurations, you’d need more time than has elapsed since the Big Bang—by a factor larger than the number of atoms in the universe.

This isn’t hyperbole. It’s the mathematical reality of the design spaces we now face.

Week 2 then confronted us with five fundamental challenges that make Architecture 2.0 uniquely difficult: the dataset crisis (we must deliberately create every training example through expensive simulation), the algorithm evolution treadmill (ML techniques evolve faster than we can build systems around them), the tools infrastructure gap (our simulators weren’t designed for AI agents), the reproducibility crisis, and real-world robustness requirements. These challenges compound each other, creating feedback loops that have kept AI-assisted architecture perpetually “five years away.” Understanding these obstacles was essential before diving into solutions.

The Three-Phase Journey: From Code to Silicon

The course structure wasn’t arbitrary. We deliberately moved from software (where AI already shows promise) through architecture (where tacit knowledge dominates) to chip design (where irrevocability creates existential constraints). Each phase revealed different challenges and different forms of reasoning required for system design.

graph LR subgraph Phase1["Phase 1: AI for Software"] A[Week 3
Code Generation] --> B[Week 4
CPU Optimization] B --> C[Week 5
GPU Kernels] C --> D[Week 6
Distributed Systems] end subgraph Phase2["Phase 2: AI for Architecture"] E[Week 7
Tacit Knowledge] --> F[Week 8
Co-Design] F --> G[Week 9
Prediction] G --> H[Week 10
Adaptation] end subgraph Phase3["Phase 3: AI for Chip Design"] I[Week 11
RTL Generation] --> J[Week 12
Physical Design] J --> K[Week 13
Verification] end D --> E H --> I style Phase1 fill:#e8f5e9 style Phase2 fill:#fff3e0 style Phase3 fill:#ffebee

Phase 1: AI for Software (Weeks 3-6)

Software provided our entry point because the artifacts are explicit—literally written as text—and feedback loops are relatively tight. You can compile, test, iterate. Mistakes are fixable.

Week 3 confronted us with the reality gap: AI that wins programming contests struggles with production code. The insight wasn’t that contests are easy—they’re not. It’s that contest problems have clear specifications, deterministic verification, and isolated scope. Real software engineering demands understanding system context, navigating technical debt, and making architectural tradeoffs across codebases evolved over years.

SWE-bench exposed this gap systematically. Even frontier models struggle with real GitHub issues because production software exists in complex ecosystems where the “correct” solution depends on constraints never written down.

Week 4 showed us production reality at Google scale with ECO. CPU performance optimization isn’t just about finding faster algorithms—it’s about validation pipelines, continuous monitoring, rollback mechanisms, and human review. ECO made 6,400 commits in production, saving 500,000 CPU cores of compute. But every single change required human expert review.

The lesson: AI doesn’t eliminate human expertise. It amplifies it. And the infrastructure to deploy AI-generated optimizations safely is often more complex than the optimization algorithms themselves.

Week 5 dove into GPU kernel optimization, where complexity actually favors AI. Kevin’s multi-turn RL approach showed that mimicking expert iterative refinement beats single-shot generation. KernelBench revealed that even frontier models struggle to match PyTorch baselines, but the trajectory is clear: when human expertise is bottlenecked (GPU optimization experts are scarce and expensive), AI assistance becomes economically compelling.

Week 6 marked a fundamental shift—from deterministic single-machine optimization to probabilistic distributed system adaptation. COSMIC’s co-design of workload mapping and network topology demonstrated something crucial: optimal design requires reasoning across traditional abstraction boundaries. You cannot optimize workload and infrastructure separately. The artificial boundaries we created for human comprehension limit what’s achievable.

The transition from Week 3 to Week 6 traced an arc: from explicit artifacts (code) to implicit behaviors (distributed system dynamics), from static optimization (fix the code once) to continuous adaptation (adjust to changing conditions), from testing (does it work?) to formal reasoning (can we prove properties hold?).

Phase 2: AI for Architecture (Weeks 7-10)

Architecture forced us to confront knowledge that exists nowhere in written form—the tacit expertise accumulated by senior architects through decades of building systems.

Week 7 asked the fundamental epistemological question: How do AI agents learn what was never written down? We examined two approaches. Concorde encodes domain knowledge explicitly through analytical models (roofline, queueing theory) combined with ML for second-order effects. ArchGym creates experiential learning environments where agents develop intuition through exploration.

Both work, but they capture fundamentally different types of knowledge. Concorde represents what we can articulate. ArchGym attempts to learn what we cannot. The gap between them defines the frontier of current capabilities.

Week 8 introduced co-design reasoning—the first type of tacit architectural knowledge. Through the lens of hardware/software mapping, we saw how circular dependencies emerge: tile sizes depend on memory hierarchy, which depends on access patterns, which depend on dataflow, which depends on bandwidth, which depends on tile sizes. You’re back where you started.

This isn’t just hard. It’s a different kind of problem. Traditional optimization assumes you can decompose into independent subproblems. Co-design problems resist decomposition. Everything depends on everything else.

DOSA and AutoTVM showed contrasting approaches: encode relationships explicitly and use gradients (analytical), or learn from experience and generalize (experiential). Neither fully solves the problem. Both reveal what solving it would require.

Week 9 examined predictive reasoning—the second type of tacit knowledge. Memory system architects must design for access patterns they cannot fully observe or characterize. The patterns are sparse (1% signal, 99% noise), heterogeneous (different workloads behave completely differently), and time-sensitive (predict too early or too late and prefetching fails).

The contrast between learned prefetchers and learned indexes was illuminating. Prefetching struggles because patterns are unstable and timing is critical. Learned indexes succeed because data distributions are relatively stable and timing is flexible. The lesson: ML works when patterns exist and remain relatively constant. When patterns shift continuously or timing constraints are nanosecond-scale, traditional approaches often win.

Week 10 explored adaptive reasoning—the third type of tacit knowledge. LLM serving systems must continuously adjust as conditions change: workload patterns shift, resource needs emerge during execution (KV-cache grows with every token), and optimal strategies evolve. Learning to rank treats scheduling as a continuously learning system. Text-to-text regression shows that semantic understanding of configurations outperforms numerical feature engineering.

Here’s what made Week 10 unique: nobody has decades of experience. LLM serving at scale only emerged 2-3 years ago. The tacit knowledge doesn’t exist yet—it’s being created right now through current experience. This creates both challenge (no established playbook) and opportunity (humans and AI learning together, systems that explain their own behavior, optimization through semantic understanding rather than black-box tuning).

The progression from Week 7 to Week 10 revealed that architectural thinking isn’t monolithic. It’s a collection of distinct reasoning types:

Co-design: Handle circular dependencies
Predictive: Design for uncertain futures with incomplete information
Adaptive: Continuously adjust as conditions change

Current AI systems possess fragments of these capabilities but don’t integrate them. Analytical models (like DOSA) encode structural understanding but lack adaptation. Learning-based approaches (like AutoTVM, Learning to Rank) recognize patterns but need structural guidance. The path forward requires knowing when to apply which type of reasoning.

Phase 3: AI for Chip Design (Weeks 11-13)

Chip design confronted us with the irrevocability constraint: once you tape out, you can’t patch hardware. Software engineers fix bugs with updates. Hardware engineers recall products and lose hundreds of millions of dollars.

Week 11 established the stakes. RTL generation isn’t just about writing syntactically correct Verilog. It’s about creating designs that synthesize efficiently, meet timing, physically realize, and work correctly. Multi-stage feedback loops mean problems discovered in physical design (weeks later) require architectural changes (restart from scratch). Benchmarking is fundamentally harder than software because correctness is necessary but not sufficient—performance, power, and area all matter, and you only know final quality after months of implementation.

Week 12 revealed the constraint inversion: at modern process nodes (5nm, 3nm, beyond), physical design limitations fundamentally constrain what architectures are viable. Wire delay exceeds gate delay. Power delivery limits density. Routing congestion constrains connectivity. Physics dominates.

You cannot design architecture independently from physical implementation anymore. But physical design takes months. This creates a feedback loop crisis: by the time you discover your architecture won’t physically realize, you’ve invested months. With a 2-3 year design cycle, you can explore maybe 10-20 architectural alternatives—sampling 10^-13 of the design space.

The philosophical divide between RL-based placement and GPU-accelerated optimization (DREAMPlace) crystallizes a broader tension: learn strategies through experience versus encode domain knowledge explicitly. Both aim to close feedback loops from months to hours. Neither fully succeeds yet. But both suggest directions for AI’s real opportunity—not autonomous chip design, but enabling rapid exploration of physically-viable architectures.

Week 13 closed the loop with verification: does the chip actually work correctly? The state space explosion means a simple 32-bit processor has more possible states than there are atoms in 10^3,700 universes. You cannot test your way to correctness. Formal verification offers mathematical proof but faces scalability limits and the specification problem (who verifies the specification?).

LLM-assisted verification promises to lower barriers—generating assertions from natural language properties, suggesting common properties for design patterns, accelerating debug. But it introduces the circular verification problem: if AI helps design and verify chips, how do we ensure we’re not systematically introducing correlated errors?

The trust problem pervades Phase 3. We need validation at every stage, but the irrevocability constraint means we can’t afford to be wrong. Trust must be earned through diversity (multiple independent approaches), transparency (make reasoning visible), bounded domains (use AI only for well-defined tasks), and empirical validation (silicon remains the final arbiter).

The Five Fundamental Challenges: A Systems Perspective

Stepping back from the week-by-week narrative, five meta-challenges emerged that cut across all domains:

graph TD A[Five Fundamental Challenges] --> B[1. Feedback Loop Crisis
Information arrives too late] A --> C[2. Tacit Knowledge
Expertise never written down] A --> D[3. Trust & Validation
How do we know it's correct?] A --> E[4. Co-Design Boundaries
Everything depends on everything] A --> F[5. Determinism to Dynamism
Static vs continuous adaptation] style A fill:#1a365d,color:#fff style B fill:#2563eb,color:#fff style C fill:#7c3aed,color:#fff style D fill:#dc2626,color:#fff style E fill:#059669,color:#fff style F fill:#d97706,color:#fff

Challenge 1: The Feedback Loop Crisis

At every layer, slow feedback loops limit iteration:

Software: ECO needs weeks of validation before deployment
GPU kernels: Profile, refine, repeat—hours per iteration
Architecture: Physical design feedback takes months
Verification: Bugs discovered late are exponentially expensive

The pattern is the same everywhere: critical information arrives too late. Architecture decisions get made without knowing if they’ll physically realize. Design choices get locked in without verification feedback. Optimizations ship without understanding production impact.

AI can help by predicting outcomes faster. You don’t need perfect accuracy—directional correctness is often enough. If a fast model can say “Design A will probably meet timing but Design B won’t,” that’s useful even if the frequency predictions are off.

But here’s the thing: maybe we’re optimizing the wrong problem. Instead of making slow processes faster, should we redesign processes to provide continuous feedback? Verification-in-the-loop rather than verification-at-the-end. Physical-awareness during architectural design rather than discovering constraints months later. Systems designed for AI-assisted workflows rather than AI retrofitted into human workflows.

Challenge 2: The Tacit Knowledge Problem

The most valuable insights about system design exist only in experienced architects’ heads:

Memory system experts “just know” which patterns are worth optimizing for
Physical designers have intuition about what floorplans will route well
Verification engineers recognize which properties are important to check
Performance engineers understand when to trust models versus empirical data

This knowledge was developed through decades of seeing what works and what fails. It’s relational (how components interact), context-dependent (when strategies apply), and often subconscious (experts struggle to articulate their reasoning).

We identified three types of tacit architectural reasoning:

Co-design reasoning: Handle circular dependencies where you cannot optimize components independently
Predictive reasoning: Design for uncertain futures with fundamentally incomplete information
Adaptive reasoning: Continuously adjust as conditions change, learning from live system behavior

graph LR subgraph CoDesign["Co-Design Reasoning"] A1[Tile Size] --> A2[Memory] A2 --> A3[Access Patterns] A3 --> A4[Dataflow] A4 --> A1 end subgraph Predictive["Predictive Reasoning"] B1[Incomplete
Information] --> B2[Pattern
Recognition] B2 --> B3[Informed
Bets] end subgraph Adaptive["Adaptive Reasoning"] C1[Observe] --> C2[Learn] C2 --> C3[Adjust] C3 --> C1 end style CoDesign fill:#e8f5e9 style Predictive fill:#fff3e0 style Adaptive fill:#e3f2fd

Current AI approaches struggle with all three. Supervised learning can’t easily learn reasoning that experts can’t articulate. Reinforcement learning requires expensive exploration and struggles with sparse rewards. Analytical models work when we can codify relationships but miss emergent behaviors.

What works? Hybrid approaches. Concorde composes analytical models with ML. Kevin learns iterative refinement rather than one-shot generation. Text-to-text regression shows semantic understanding beating numerical features. The pattern: combine what you can articulate with what you can only learn from data.

But there’s a humbling realization: some knowledge may remain tacit. The philosophical frameworks about where to locate complexity, the intuitions about which design patterns will age well, the judgment calls under uncertainty—these may always require human architects. Architecture 2.0 isn’t about eliminating human expertise. It’s about creating interfaces for human-AI collaboration where agents handle systematic exploration and humans provide philosophical guidance.

Challenge 3: Trust and Validation

At every layer, we confronted the question: How do we know AI-generated designs are correct?

The irrevocability constraint makes this existential. Software bugs are fixable. Hardware bugs cost millions. But the trust problem goes deeper than just avoiding mistakes:

Circular validation: If AI generates RTL and AI generates verification assertions, and both misinterpret the specification identically, verification will pass despite the design being wrong. Using AI to check AI’s work doesn’t eliminate errors if both make correlated mistakes.

Simulation-reality gaps: Models that predict physical design outcomes must be validated against actual implementations. But you only know ground truth after months of implementation. Fast evaluation is inaccurate. Accurate evaluation is too slow for exploration.

Black-box reasoning: When an RL agent places macros or a learned policy schedules requests, we see outputs but not reasoning. This makes debugging hard and trust harder. If something goes wrong, we don’t know why or how to fix it.

What works? Diversity—use multiple independent tools and cross-check for consistency. If three different approaches agree, you’re probably okay. Transparency matters too: analytical methods like DREAMPlace are auditable (you can inspect the objective function), while RL-based placement is a black box. For production, transparency often beats raw performance. And ultimately, there’s no substitute for building systems and testing them. Models help guide exploration, but silicon is the final arbiter.

But here’s the insight that matters: trust isn’t binary. You don’t need perfect confidence to make progress. You need enough confidence to make the next decision. If AI can provide “Design A is probably better than Design B” with 80% confidence, that’s sufficient to guide exploration even if you’ll validate thoroughly before committing.

The future of trustworthy AI for systems: not autonomous design, but decision support that amplifies human judgment.

Challenge 4: Co-Design Across Boundaries

The artificial boundaries we created for human comprehension—abstraction layers separating software from architecture from physical implementation—limit what’s achievable.

We saw this escalate across the course:

Week 6: COSMIC co-designs workload mapping and network topology. Optimizing either independently leaves performance on the table.
Week 8: Hardware/software mapping creates circular dependencies where tile sizes, memory hierarchy, dataflow, and bandwidth constraints form an interdependent web.
Week 10: LLM serving demands co-optimizing model architecture, system software, and hardware together. Request scheduling depends on model behavior depends on hardware capabilities depends on scheduling decisions.
Week 12: Architecture, physical layout, and workload characteristics form a three-way co-design problem. You cannot design architecture without knowing physical constraints. You cannot optimize physical layout without understanding workload priorities. You cannot characterize workloads without knowing what architecture will run them.

Traditional approaches solve these problems sequentially: design architecture, then implement it, discover problems, iterate. When iterations take months and cost millions, you can only afford a few iterations per generation.

AI’s opportunity: Reason across boundaries simultaneously. Not replacing human decision-making, but exploring the coupled space faster. “If I make the cache this size, what mappings become possible? If I optimize for these mappings, what does that imply about the hardware I should build? If I build that hardware, will my target workloads run efficiently?”

But there’s a deeper challenge: current ML frameworks assume you can define a clear objective function and search for solutions that maximize it. Co-design problems don’t work this way. The objective itself changes depending on design choices you haven’t made yet. The evaluation function depends on decisions you’re trying to make.

This suggests we need new optimization frameworks that handle coupled objectives, emergent constraints, and circular dependencies. Not just better search algorithms within existing paradigms, but new paradigms that match the structure of real design problems.

Challenge 5: From Determinism to Dynamism

The progression from Phase 1 to Phase 2 to Phase 3 traced an arc from deterministic (software, where code either works or doesn’t) to probabilistic (distributed systems, where behavior emerges from interactions) to irrevocable (hardware, where mistakes are permanent).

But there’s another dimension that matters: some problems are about finding a good solution once (compiler optimization, physical design, RTL generation), while others require continuous adaptation (distributed scheduling, LLM serving, power management). These need fundamentally different AI approaches.

For static optimization, AI explores design spaces and suggests things humans would miss. You run it once, get a better artifact, done. For dynamic adaptation, AI must keep learning as conditions change—workloads shift, resource needs emerge, optimal strategies evolve. The goal isn’t a solution but a policy that adjusts. Static optimization benefits from analytical methods, careful search, and extensive validation. Dynamic adaptation requires online learning, safe exploration, and graceful degradation.

Current AI research often conflates these. Papers show offline learning on fixed datasets, then claim the approach will work for online adaptation. But online adaptation faces exploration-exploitation tradeoffs, distribution shift, delayed rewards, and safety constraints that offline learning never encounters.

The lesson: know which type of problem you’re solving. Don’t apply static optimization techniques to dynamic adaptation domains, and don’t require continuous adaptation when static optimization suffices. Match the approach to the problem.

Rethinking AI for Systems: A Methodological Framework

Given these challenges, what would it mean to design AI approaches specifically for system design, rather than adapting techniques from other domains?

Principle 1: Embrace Hybrid Approaches

The dichotomy between “analytical models” and “learned policies” is false. Real progress requires composition:

For co-design reasoning: Use analytical models to encode first principles (physics, queuing theory, fundamental tradeoffs) and ML to capture second-order interactions that resist analytical modeling. Concorde demonstrates this beautifully—5 orders of magnitude speedup by combining what we know explicitly (analytical) with what we learn empirically (ML).

For predictive reasoning: Use learned prefetchers for pattern recognition but traditional heuristics for timing control. Semantic models (text-to-text regression) for high-level understanding but numerical models for precise predictions. Don’t replace 50 years of systems knowledge—augment it.

For adaptive reasoning: Use RL for exploration but provide structure through action spaces, reward shaping, and safety constraints informed by domain knowledge. Learning to rank combines learned preferences with explicit constraints about fairness and safety.

The principle is simple: use structure when you have it, learn when you don’t. The art is knowing which is which.

Principle 2: Design for Continuous Feedback

Instead of speeding up existing flows, redesign flows for continuous feedback:

Verification-in-the-loop: Instead of verifying after you’re done, verify continuously as you design. Generate assertions incrementally. Check properties as you write RTL. Catch bugs immediately instead of weeks later.

Physical-awareness during architecture: Instead of discovering physical problems months later, predict physical realizability during architectural design. Fast models that give directional guidance: “This communication pattern will struggle with wire delay. Consider co-locating these units.”

Co-design exploration: Instead of sequential design (architecture → implementation → discovery of problems), explore the coupled space. “If I make these architectural choices, what physical designs become viable? If I target these physical constraints, what architectures work?”

This requires new tools designed for AI-assisted workflows, not AI retrofitted into human workflows. The tooling determines what’s possible.

Principle 3: Separate Concerns by Role, Not by Tool

The question isn’t “should AI design chips autonomously?” It’s “what role does AI play in the design process?”

graph TD subgraph AI["AI Roles"] A1[🔍 Explorer
Search vast spaces
Find patterns] A2[✓ Validator
Verify properties
Find counterexamples] end subgraph Human["Human Roles"] H1[🧭 Guide
Philosophical frameworks
Tradeoff decisions] H2[⚖️ Arbiter
Risk assessment
Final commitments] end A1 <-->|"presents options"| H1 H1 <-->|"guides search"| A1 A2 <-->|"reports findings"| H2 H2 <-->|"sets criteria"| A2 style AI fill:#e3f2fd style Human fill:#fff3e0

AI as explorer: Systematically search vast design spaces humans can’t comprehend. Find interesting regions. Identify patterns. Present options.

Humans as guides: Provide philosophical frameworks, make tradeoff decisions, supply context, validate results, handle edge cases.

AI as validator: Verify properties, find counterexamples, check consistency, identify coverage gaps. Amplify formal methods.

Humans as arbiters: Decide when validation is sufficient, determine acceptable risk, make final commitments.

The collaboration is multiplicative, not additive. AI explores spaces humans couldn’t search. Humans guide exploration toward meaningful regions. Together they achieve what neither could alone.

But this requires new interfaces. Current tools assume either “human does everything” or “AI does everything.” We need interfaces for fluid collaboration where control shifts dynamically based on the task.

Principle 4: Match Approach to Problem Structure

Different system design problems need different AI approaches:

For problems with circular dependencies (mapping, co-design): Use methods that reason about the coupled space. Gradient-based optimization if you can make the problem differentiable. Game-theoretic approaches for multi-agent design. Meta-learning to discover optimization strategies.

For problems with emergent behavior (performance, distributed systems): Use simulation + surrogate models. Fast approximate evaluation to guide search. Bayesian optimization with uncertainty quantification. Don’t trust predictions—validate on full simulation for promising candidates.

For problems with incomplete information (architecture for future workloads): Design for robustness over optimality. Multi-task learning across diverse scenarios. Sensitivity analysis to understand what breaks designs. Graceful degradation strategies.

For problems with temporal dynamics (scheduling, adaptation): Use online learning with safe exploration. Counterfactual reasoning to disentangle cause and effect. Meta-learning to adapt quickly to distribution shift.

For irrevocable decisions (chip design): Use conservative design margins. Formal verification where possible. Extensive validation before committing. Worst-case analysis, not just expected value.

Don’t apply the same ML hammer to every problem. Understand the problem structure and choose approaches that match it.

Principle 5: Build on Decades of Systems Knowledge

The biggest mistake would be assuming AI makes 50 years of systems research irrelevant. The wisdom accumulated by the systems community—about abstraction, modularity, end-to-end principles, measurement methodology, evaluation—remains valid.

What changes: The bottlenecks shift. Design space exploration that was intractable becomes feasible. Optimizations too complex for humans become accessible. Cross-layer reasoning that violated abstraction boundaries becomes possible.

What doesn’t change: The fundamental principles about what makes systems robust, maintainable, debuggable, evolvable. The need for clear specifications. The importance of evaluation methodology. The reality that “it works on my machine” proves nothing.

AI should accelerate progress, not throw away decades of accumulated understanding. The LLM that reinvents TCP from first principles because it never read the original papers will make all the mistakes we made in the 1980s.

Principle 6: Acknowledge What We Don’t Know

Humility matters. There are questions we can’t answer yet:

Can AI learn tacit architectural knowledge from examples? Or does some knowledge fundamentally resist learning—requiring human intuition developed through decades of experience?

When does AI generalize versus memorize? If an LLM generates a correct FIFO because it saw thousands of FIFOs in training, is that generalization or memorization? How do we tell?

Where should human judgment remain essential? Not just because AI can’t do it yet, but because human oversight is inherently necessary? The philosophical tradeoffs, the risk decisions, the accountability?

How do we validate AI-assisted designs without circular reasoning? If AI helps design and verify, are we systematically introducing correlated errors?

These aren’t just open research questions. They’re foundational questions about the nature of design, the limits of learning, and the role of human expertise.

What We Heard from Industry

One of the privileges of teaching this course was bringing in practitioners who live these challenges daily. Nine industry experts joined us across the semester, and their perspectives consistently reinforced one theme: the gap between research papers and production reality is vast.

Amir Yazdanbakhsh from Google DeepMind told us that ECO’s infrastructure for deploying optimizations safely was more complex than the optimization algorithms themselves. Every AI-generated change required human expert review. The lesson: production isn’t about clever algorithms. It’s about trust, validation, and integration with existing workflows.

Martin Maas from Google DeepMind emphasized that distributed systems require continuous adaptation, not one-shot optimization. The environment changes. Workloads shift. What worked yesterday might fail tomorrow. AI for systems can’t just find good solutions—it needs to keep finding them as conditions evolve.

Suvinay Subramanian from Google’s TPU team articulated something we rarely see in papers: hardware design involves “subjective bets” about where to locate complexity. Google’s philosophy—simpler hardware, more complex software—isn’t mathematically derived. It’s a bet about organizational capabilities and future flexibility. How does an AI agent learn these philosophical frameworks?

Milad Hashemi explained why learned prefetchers that look great in papers struggle in production. The patterns are too sparse, too noisy, and too time-sensitive. When you need predictions in nanoseconds and accuracy in the 99th percentile, the gap between “works in simulation” and “works in silicon” becomes a chasm.

Esha Choukse from Microsoft Research grounded our discussion of LLM serving in the reality of operating at scale. The KV-cache grows with every token. Memory becomes the bottleneck. Request patterns are unpredictable. The neat abstractions in papers dissolve when you’re running inference for millions of users.

Mark Ren from NVIDIA painted a picture of an EDA industry at an inflection point. Traditional tools encode decades of expertise, but they’re struggling with modern design complexity. The opportunity for AI isn’t replacing that expertise—it’s augmenting it, using differentiable optimization where analytical models exist and RL where the search space is too large for hand-crafted heuristics.

Kartik Hegde from ChipStack brought the verification perspective: AI can lower barriers to formal methods by generating assertions from natural language. But generated assertions must still be validated, and validation requires expertise. AI augments verification engineers; it doesn’t replace them.

The consistent message across all these conversations: AI for systems is about collaboration, not automation. The practitioners who are actually deploying these techniques aren’t trying to remove humans from the loop. They’re trying to make humans more effective by handling the tedious parts, exploring spaces humans can’t search, and providing decision support rather than decisions.

A Note on Model Size: Frontier vs. Specialized

This came up repeatedly throughout the course, and I want to address it directly because I see students make this mistake often: not everything needs a frontier model.

There’s a natural tendency to assume bigger is better. GPT-5 is more capable than GPT-4, which was more capable than GPT-3. The scaling laws suggest more parameters and more data lead to better performance. So why wouldn’t you always use the biggest model available?

Because systems have constraints that benchmarks don’t capture.

Latency matters. A prefetcher needs predictions in nanoseconds. A scheduler needs decisions in microseconds. Frontier model inference takes milliseconds at best. That’s a 1000x gap. You cannot close it by making the model faster—you need a fundamentally smaller model that runs on specialized hardware close to where decisions happen.

Cost matters at scale. Running GPT-4 for every optimization decision in a datacenter would cost more than the savings from the optimizations. A 70B model costs roughly 100x more per inference than a 7B model. When you’re making millions of decisions per day, that multiplier is the difference between viable and absurd.

You often don’t need general knowledge. Do you really need a model that understands Shakespeare, quantum physics, and medieval history to optimize memory access patterns? The knowledge required is narrow and deep, not broad. A small model trained specifically on memory system traces might outperform a frontier model that has to retrieve relevant knowledge from a vast parameter space.

Auditability matters for safety-critical applications. Frontier models are black boxes—you can’t fully inspect why they made a decision. For chip design, where bugs cost millions, you might prefer a simpler model whose behavior you can characterize and bound, even if its average performance is lower.

Control matters. Frontier models are controlled by a handful of companies. They can change, deprecate, or restrict access. For production systems with long lifetimes, depending on an external API is a risk. A specialized model you train and own gives you control over its lifecycle.

The practical path forward isn’t choosing one or the other—it’s knowing when to use which:

Use frontier models for tasks where breadth matters: understanding natural language specifications, debugging across abstraction layers, handling novel scenarios you haven’t trained for, interactive assistance where latency is human-scale (seconds, not microseconds).

Use specialized models for tasks where efficiency matters: real-time decisions, production deployment at scale, safety-critical applications requiring auditability, domains with abundant training data and narrow scope.

Use distillation to get the best of both: have frontier models generate reasoning traces, explanations, and training data, then distill that knowledge into efficient specialized models. You get frontier-level reasoning compressed into deployable form.

The students who will have the most impact aren’t those who can prompt GPT-4 effectively. I believe that skill will commoditize quickly. It’s those who understand when a 7B model trained on the right data will outperform a 700B model trained on everything, and know how to build that specialized model.

Where Does the Field Go From Here?

If I had to bet on what will matter most over the next five years, here’s how I’d think about it.

The immediate bottleneck is evaluation. We don’t yet have good ways to measure whether AI is actually getting better at system design. Current benchmarks reward pattern matching and memorization. We need evaluations that test genuine understanding—can the system design something novel, not just recombine things it’s seen? Week 11 and Week 13 kept returning to this theme: what you measure shapes what you build. Get the benchmarks wrong and the whole field optimizes for the wrong thing.

The feedback loop crisis won’t solve itself. Physical design takes months. Verification takes weeks. Architecture decisions get made without knowing if they’ll physically realize. The obvious answer is faster surrogate models—predict timing without full place-and-route, estimate performance without cycle-accurate simulation. But the harder question is knowing when to trust those predictions. A model that’s 90% accurate sounds good until you realize the 10% it gets wrong might be exactly the designs you care about.

Human-AI collaboration needs new interfaces. Current tools assume binary control: either the human does everything or the AI does everything. Real design is more fluid than that. Sometimes you want AI to explore freely. Sometimes you want to take the wheel. Sometimes you want to jointly reason through a tradeoff. We don’t have good tools for that spectrum yet. Building them requires understanding not just the AI capabilities but the cognitive needs of designers.

Looking further out—and this is where PhD students should pay attention—the problems that will matter in 4-5 years aren’t the ones getting the most attention today.

Can AI learn to reason, not just match patterns? We identified three types of reasoning that define architectural expertise: co-design reasoning (handling circular dependencies), predictive reasoning (making bets about uncertain futures), and adaptive reasoning (continuous adjustment as conditions change). Current ML has fragments of these capabilities but doesn’t integrate them. Whether AI can develop genuine reasoning—or whether these capabilities require something fundamentally different—is an open question. It’s also, I think, the most important one.

Verification will become the ultimate constraint. As AI generates more of the design, who verifies it? Week 13 showed us the state space explosion—a simple processor has more states than atoms in the universe. You can’t test your way to correctness. Formal methods help but don’t scale. If AI helps both design and verify, you risk circular validation where correlated errors pass undetected. Someone needs to solve this. Maybe that someone is you.

The biggest opportunity might be closing the loop from deployment back to design. Right now, we design systems, deploy them, and that’s it. The silicon never learns from how it’s actually used. What if it could? Not just adapting runtime behavior—Week 10 explored that—but feeding field experience back into the next generation of designs. Systems that get better over their lifetime. We’re not there yet, but it feels like the natural endpoint of everything we’ve discussed.

Finally, there’s the question of access. Right now, only a handful of companies have the resources, data, and infrastructure to seriously pursue AI for systems. How do we democratize this? Open-source tools, educational platforms, frameworks that don’t require Google-scale compute. If Architecture 2.0 remains the province of a few large players, we’ll have failed at something important.

Closing Thoughts: What Have We Actually Learned?

Thirteen weeks ago, we asked: Can AI agents become co-designers of computer systems?

The answer is yes—but with critical qualifications that reshape the question itself.

AI can:

Explore design spaces too vast for human comprehension
Find optimizations humans would never consider
Accelerate iteration by closing feedback loops
Lower barriers to formal verification and sophisticated optimization
Amplify human expertise by handling routine tasks

AI cannot (yet):

Replace the philosophical judgment about where to locate complexity
Make tradeoff decisions when objectives conflict and priorities aren’t quantifiable
Develop the intuitive understanding that comes from decades of experience
Verify its own work without risking circular validation
Design for radically uncertain futures where patterns from the past don’t apply

The frontier isn’t autonomous AI designing systems alone. It’s collaboration: AI explores while humans guide, AI optimizes locally while humans reason globally, AI handles systematic search while humans apply judgment. AI generates; humans validate and decide.

This isn’t a disappointing compromise. It’s recognition that system design is fundamentally about judgment under uncertainty, tradeoffs among conflicting objectives, and bets about unknowable futures—domains where human cognition remains essential.

But the partnership is transformative. An architect working with AI assistance can explore design spaces that were previously intractable, validate ideas in hours that previously took months, and reason across abstraction boundaries that previously required separate expert teams.

The course revealed five meta-insights:

Circular dependencies are fundamental: Many real problems don’t decompose into independent subproblems. AI must reason about coupled spaces.
Tacit knowledge resists codification: The most valuable expertise exists in experienced architects’ intuitions. Learning from experience might work where explicit encoding fails.
Trust requires validation at every layer: The irrevocability constraint means we can’t afford to be wrong. Trust must be earned through diversity, transparency, and empirical validation.
Co-design across boundaries unlocks capability: The artificial abstractions we created for human comprehension limit what’s achievable. AI can reason across boundaries.
The right abstraction for the problem matters: Static optimization, dynamic adaptation, and co-design under uncertainty need fundamentally different AI approaches.

Where does this point? Toward design tools built for AI collaboration from the start, not retrofitted. Toward abstraction boundaries that become porous, letting us optimize across layers we used to treat as separate. Toward feedback loops that operate at the timescale of decisions. Toward validation that’s continuous rather than staged. And toward expertise that’s amplified and democratized rather than replaced or gatekept.

What We Didn’t Cover

Intellectual honesty requires acknowledging scope. This course focused on AI for performance-oriented system design—making things faster, more efficient, more capable. We deliberately didn’t cover:

AI for security: Hardware security, side-channel analysis, secure enclave design
AI for reliability: Fault tolerance, error correction, graceful degradation
AI for sustainability: Carbon-aware design, lifecycle optimization, e-waste reduction
AI for other system layers: Networking, storage systems, databases, operating systems
The societal implications: Job displacement, concentration of capability, democratization vs. gatekeeping

These aren’t less important—they’re different courses. The frameworks we developed (co-design reasoning, trust through diversity, hybrid analytical-ML approaches) likely transfer, but the specific challenges differ.

The Path Ahead

Architecture 2.0 isn’t a destination. It’s a direction of travel.

The field will evolve iteratively:

Phase 1: AI assists with isolated tasks (generate RTL, optimize code, place macros)
Phase 2: AI reasons across traditional boundaries (co-design workload and hardware, optimize from code to silicon)
Phase 3: AI develops the types of reasoning that define architectural expertise (co-design, predictive, adaptive)

We’re early in Phase 1, with glimpses of Phase 2. Phase 3 remains aspirational.

But the trajectory is clear. The techniques that work aren’t those that replace human expertise. They’re those that amplify it. The progress that matters isn’t autonomous AI design. It’s human-AI collaboration that achieves what neither could alone.

The questions that will determine success aren’t just technical. They’re about:

What workflows make collaboration natural?
How do we maintain trust as systems become more complex?
Where should human judgment remain essential?
How do we train the next generation to work with AI co-designers?

This course was preparation, not conclusion. You’ve seen the landscape. You understand the challenges. You know where current capabilities fall short.

What You Can Do This Week

Don’t wait for the field to evolve. Start now:

Pick one paper from this course and implement something. Not just read—build. Clone a repo, run the code, modify it, break it, understand it deeply.
Find your community. Join the MLSys, ISCA, or MICRO communities. Follow researchers on social media. Attend reading groups. The field moves fast; staying connected matters.
Identify your unique angle. What domain expertise do you have that others don’t? AI for systems needs people who understand specific application domains, not just ML generalists.
Start a research notebook. Document your ideas, experiments, and failures. The PhD students who succeed aren’t smarter—they’re more systematic about learning from what doesn’t work.
Talk to practitioners. The gap between academic papers and production reality is vast. Find engineers at companies doing this work. Understand what problems they actually face.

Now the work begins.

The systems you design over the next decade will determine whether AI truly transforms computer architecture or remains a marginal tool for narrow applications. Whether the expertise accumulated over 50 years of systems research amplifies or atrophies. Whether the next generation of architects works with AI as partners or struggles against AI as competitors.

The choice is yours. The tools are emerging. The challenges are clear.

Architecture 2.0 isn’t about AI replacing architects. It’s about architects who understand AI becoming more capable than any individual could be alone.

Now let’s go build it. 🚀

A personal note to the class:

Thank you for being part of this journey. I know this class was a little different from what you might have expected. We weren’t just reviewing established papers and techniques that have stood the test of time. We were reading work that’s still evolving, sometimes just months old, trying to understand not just where the field is but where it needs to go.

That’s uncomfortable. There aren’t always clean answers. The papers sometimes contradict each other. The techniques that look promising today might be obsolete by the time you graduate. But that discomfort is the point. You’re not here to memorize solutions. You’re here to develop the judgment to navigate uncertainty.

What made this class special was your willingness to engage with that uncertainty. To ask hard questions. To reconsider things that didn’t make sense. To bring your own perspectives from different domains. The best insights often came from some of our discussions, so thank you for that 🙏

The future of computer systems will be shaped by people who can think across boundaries—who understand both algorithms and silicon, both ML techniques and systems constraints, both what’s possible today and what the field needs tomorrow. That’s what we were practicing here.

I hope you leave with more questions than answers. That’s how it should be.

—Vijay

This synthesis reflects on the thirteen-week journey through AI for software, architecture, and chip design. For detailed technical content, see the individual week posts linked throughout. For course materials and future updates, see the course schedule.

A note on timing: This was written in Fall 2024. The field of AI for systems is evolving rapidly—some techniques discussed here may already be outdated, while new capabilities may have emerged that weren’t possible when this was written. The conceptual frameworks (types of reasoning, trust through diversity, hybrid approaches) should age better than specific technical approaches. If you’re reading this in the future, use the frameworks to evaluate whatever new techniques have emerged.