Week 13: The Trust Problem - Can We Verify What We Can No Longer Fully Understand?

November 26, 2024 • By Vijay Janapa Reddi and Arya Tschand • @profvjreddi & @aryatschand • 29 min read • Hardware, Verification, Chip-design, Formal-methods

Last week, we saw how physical design constraints have become primary architectural limitations. At modern process nodes, you can’t design chips without understanding what will physically realize. But there’s a final question that must be answered before committing millions of dollars to fabrication:

Does the chip actually work correctly?

This isn’t about performance, power, or area. This is about functional correctness—does the hardware implement the specification? When you send a multiply instruction, does it actually multiply? When you request data from memory, do you get the right data back? Will the chip do what you designed it to do, every single time, across all possible inputs and states?

This is the domain of verification. And it might be the hardest problem in chip design.

The Irrevocability Constraint Returns

In Week 11, we introduced the irrevocability constraint: once you tape out a chip, you can’t patch the hardware. Software engineers fix bugs with updates. Hardware engineers recall products and lose hundreds of millions of dollars.

The Pentium FDIV bug in 1994 cost Intel $475 million—a floating-point divide instruction occasionally returned incorrect results. The bug existed in silicon. No software patch could fix it. Intel had to replace millions of processors.

The Intel Meltdown and Spectre vulnerabilities affected billions of processors worldwide (Intel’s advisory). These weren’t implementation mistakes—they were architectural design flaws in speculative execution. The hardware worked exactly as designed. The design was the problem. Software mitigations helped, but the fundamental vulnerability remains in silicon.

AMD’s Bulldozer false advertising case (settled for $12.5 million) hinged on whether their chip met the architectural specification. The hardware worked, but did it work as advertised? Courts disagreed with AMD’s interpretation of what constitutes a “core.”

These examples share a common thread: verification failures. Either the implementation didn’t match the specification (Pentium), the specification didn’t account for security properties (Meltdown/Spectre), or the specification was ambiguous (Bulldozer).

Verification is the process of proving that your design is correct. Not testing that it seems to work. Not benchmarking that it performs well. Proving that it implements the specification correctly across all possible behaviors.

Why Verification Is Different from Testing

Software engineers are familiar with testing: write test cases, run them, check outputs. If tests pass, ship the code. If bugs appear in production, patch them.

Hardware verification cannot work this way.

The State Space Explosion

Consider a simple 32-bit processor with:

32 general-purpose registers (32 bits each)
1KB of cache (8,192 bits)
A dozen pipeline stages with various state bits
Branch predictor state
Memory management unit state

How many possible states does this system have? Each bit can be 0 or 1, so roughly 2^(32×32 + 8192 + pipeline state + predictor state + MMU state). Let’s conservatively estimate 2^10,000 possible states.

To put this in perspective, the number of atoms in the observable universe is approximately 10^80 ≈ 2^266. Your simple processor has more possible states than there are atoms in 10^3,700 universes.

You cannot test all states. You cannot even test a meaningful fraction of them. If you tested one billion states per second, continuously for the lifetime of the universe (13.8 billion years), you would test approximately 2^88 states. That’s 0.000…% (with thousands of zeros) of the state space.

This is the state space explosion: the number of possible states grows exponentially with system complexity. Testing can only sample an infinitesimally small portion of the space.

The Coverage Problem

“Okay,” you might think, “but we don’t need to test every state. We can focus on important states and likely execution paths.”

This is the approach of coverage-directed verification: define metrics for what constitutes “good” test coverage (code coverage, branch coverage, state coverage, assertion coverage) and generate tests to maximize these metrics.

The problem: How do you know you covered the important cases?

The Pentium FDIV bug occurred for very specific input combinations—certain dividend and divisor values. These weren’t obviously “important” cases. They were edge cases in a specific lookup table optimization. Traditional coverage metrics didn’t catch them because no one knew to look for them.

The Meltdown and Spectre vulnerabilities exploited timing side-channels in speculative execution. Standard functional coverage doesn’t even consider timing-based information leakage across privilege boundaries. You can’t cover what you didn’t think to measure.

Coverage metrics can only find bugs you anticipated. The dangerous bugs are the ones you didn’t anticipate—the emergent behaviors from complex interactions between components, the subtle race conditions, the security vulnerabilities arising from architectural features working “as intended.”

The Oracle Problem

Even if you could test every state, you need an oracle—a way to determine if the output is correct.

For simple combinational logic (like an adder), the oracle is straightforward: mathematically compute the correct answer and compare. But for complex sequential systems with thousands of interacting components, what’s the “correct” behavior?

“The chip should match the specification,” you say. But specifications are written in natural language or abstract models. They’re inherently ambiguous. Different engineers interpret them differently. When hardware behavior differs from your expectation, is it a hardware bug or a specification ambiguity?

The oracle problem becomes acute when we introduce AI into the design process. If an LLM generates RTL from a specification, and that RTL behaves unexpectedly, which is wrong: the generated RTL, your interpretation of the specification, or the specification itself?

Testing Proves Presence of Bugs, Not Absence

As Edsger W. Dijkstra famously wrote in 1970: “Program testing can be used to show the presence of bugs, but never to show their absence.”

This is even more true for hardware. You can test for weeks and find no bugs. That doesn’t mean the chip is correct. It means you haven’t found bugs yet. You can’t prove correctness through testing.

Formal verification offers a different approach: mathematical proof that the design satisfies the specification, for all possible inputs and states. Not testing a sample—proving the entire space.

But formal verification has its own challenges.

The Formal Verification Challenge

Formal verification uses mathematical methods to prove properties about systems. Instead of testing specific cases, you write assertions (properties the system should satisfy) and use automated tools to prove or disprove them.

For example, instead of testing that 2 + 2 = 4 and 3 + 5 = 8 and hoping that covers addition, you prove mathematically that for all possible inputs a and b, your adder produces a + b.

This sounds perfect. Why doesn’t everyone just do formal verification?

The Scalability Problem

Formal verification tools must explore the complete state space—that’s the point. But remember the state space explosion? The same problem that makes testing insufficient makes formal verification computationally intractable.

Modern model checkers use sophisticated techniques (symbolic execution, abstraction, bounded model checking) to make verification tractable. But they still face fundamental complexity limits. As system size grows, verification time grows exponentially.

In practice, formal verification works well for:

Small, critical modules (ALUs, caches, coherence protocols)
Specific properties (this signal never violates this timing constraint)
Bounded checks (the system is correct for the first N cycles)

But verifying an entire modern processor—billions of transistors, complex interaction between cores, memory hierarchy, I/O, coherence protocols—exhaustively? That remains computationally infeasible.

The Specification Problem

Formal verification proves that an implementation matches a specification. But who verifies the specification itself?

You’re replacing one problem (does the RTL work correctly?) with another (does the specification correctly capture what we want?). The specification is typically written in a formal language (like SystemVerilog Assertions or PSL), which means it can also have bugs.

Consider verifying a cache coherence protocol. Your specification says: “If processor A writes to address X, and processor B subsequently reads address X, processor B must see A’s write.” Sounds clear. But what if processors A and B are in different clock domains? What if there’s a DMA transfer happening simultaneously? What if address X is in a region marked as non-coherent? The specification must account for all these subtleties—and writing complete, unambiguous specifications is itself a massive intellectual challenge.

In practice, verification doesn’t eliminate the need for human expertise. It shifts where you need expertise: from writing correct implementations to writing correct specifications.

The Abstraction Gap

Formal verification typically works at the RTL level—behavioral descriptions in SystemVerilog or VHDL. But recall Week 12’s constraint inversion: physical effects dominate modern chip behavior.

You can formally verify that your RTL is correct. But does the synthesized netlist behave the same way? Does the physical layout introduce timing violations or crosstalk that breaks verified properties? Do process variations in manufacturing cause some chips to violate assertions that held in simulation?

There’s a gap between the abstraction level where verification happens and the physical reality where chips run. Verification at RTL doesn’t guarantee correctness in silicon.

Equivalence checking bridges part of this gap: proving that synthesis transformations preserve functionality. But it can’t account for physical effects that emerge only in layout or manufacturing.

The Human Expertise Bottleneck

Even with automated tools, formal verification requires deep human expertise:

Assertion engineering: Writing the properties to verify. What should you prove? How do you express complex temporal properties? Which corner cases must be covered?
Debugging counterexamples: When verification fails, tools produce counterexamples (traces showing how assertions are violated). Understanding these traces requires intimate knowledge of both the design and the formal semantics.
Abstraction design: Simplifying designs to make verification tractable without losing essential behaviors. This is as much art as science.
Tool configuration: Choosing verification strategies, setting bounds, managing memory usage, tuning heuristics. The tools are powerful but complex.

These skills are rare and expensive. Not every company can afford teams of formal verification experts. Not every project has time for the iterative refinement verification requires.

The Emergence of LLM-Assisted Verification

This is where this week’s papers enter. If formal verification requires scarce human expertise, and if LLMs have shown promise at code-related tasks, can LLMs help with verification?

The answer is nuanced. LLMs won’t replace verification engineers any time soon. But they can potentially amplify human expertise and lower the barrier to formal verification.

Paper 1: Using LLMs to Facilitate Formal Verification of RTL

The first paper explores using LLMs to help with the assertion engineering problem: given RTL code and a natural language description of desired properties, can an LLM generate formal verification assertions?

Assertion engineering is the bottleneck. Hardware designers know what their designs should do. You can say in plain English: “The FIFO should never overflow” or “Requests must be processed in order” or “This signal can’t glitch during reset.”

But translating those intentions into SystemVerilog Assertions requires specialized knowledge. Here’s what that translation looks like:

Natural language: “The FIFO should never overflow”

SystemVerilog Assertion:

1  property fifo_no_overflow;
2    @(posedge clk) disable iff (rst)
3    (full && wr_en) |-> (fifo_count == DEPTH);
4  endproperty
5  assert property (fifo_no_overflow);

Let’s break this down line by line:

Line 1: We declare a property with a name. Properties are reusable verification checks.
Line 2: @(posedge clk) means “check this on every rising clock edge.” The disable iff (rst) part says “don’t check this property when reset is active”—you need to know to add this or your verification will fail during reset.
Line 3: This is the actual check. (full && wr_en) is the condition: “if the FIFO is full AND someone tries to write.” The |-> operator means “implies”—if the left side is true, the right side must also be true. So if the FIFO is full and we’re writing, then fifo_count better equal DEPTH (the maximum size).
Line 4: Close the property declaration.
Line 5: Actually assert this property. If it ever fails, verification stops and reports an error.

Natural language: “Every request must eventually get a response”

SystemVerilog Assertion:

1  property req_resp;
2    @(posedge clk) disable iff (rst)
3    req |-> ##[1:16] resp;
4  endproperty
5  assert property (req_resp);

Breaking down the key parts:

Line 3: req |-> ##[1:16] resp means “if there’s a request, then within 1 to 16 clock cycles, there must be a response.” The ##[1:16] is a temporal operator. ##1 means “exactly 1 cycle later.” ##[1:16] means “sometime between 1 and 16 cycles later.” This syntax is completely non-obvious.

You also need to know when to use |-> (overlapping implication, checks in the same cycle) versus |=> (non-overlapping implication, checks in the next cycle). You need to understand clocking blocks, sequence operators, and the difference between immediate versus concurrent assertions.

What if an LLM could bridge this gap?

The promise is significant. Hardware designers without deep verification expertise could write properties in natural language and get formal assertions. If generating assertions becomes easier, engineers might write more of them, covering more corner cases. Instead of manually debugging assertion syntax, designers could iterate quickly through natural language refinements.

But there are real challenges:

Challenge 1: Correctness of Generated Assertions

If an LLM generates an assertion, how do you know it correctly captures your intent? This is the oracle problem reappearing at a higher level.

You wanted to verify property P. The LLM generated assertion A. You need to verify that A actually checks P. But if you could do that verification, you probably could have written A yourself. You’ve shifted the verification problem, not solved it.

The paper addresses this through human review, counterexample guided refinement, and cross validation. Experts review generated assertions for correctness. When verification fails, you examine the counterexamples, refine the natural language description, and regenerate. You can also generate multiple assertions from slightly different prompts and check for consistency.

None of these eliminate the problem entirely, but they help. And they still require human expertise, just potentially less of it, or applied more efficiently.

Challenge 2: Completeness

Even if generated assertions are correct, are they complete? Do they cover all the properties you care about? Or do they only verify the properties you thought to ask about?

This is the coverage problem reappearing. An LLM can help you write assertions for properties you specify. It can’t help you think of properties you didn’t specify.

Or can it? The paper explores whether LLMs, trained on large codebases including verification code, might suggest properties based on the RTL structure. “I notice this is a FIFO design. Common properties to verify for FIFOs include: overflow prevention, underflow prevention, ordering guarantees, full/empty signal correctness.”

This is fascinating but uncertain. Can LLMs learn the tacit knowledge of “what properties are important to verify for this type of design”? That’s a form of Week 7’s tacit architectural knowledge—expertise that hardware verification engineers have internalized but never explicitly written down.

Challenge 3: The Abstraction Gap Persists

LLM-generated assertions verify RTL. They don’t address the abstraction gap to physical implementation. You still need to trust that verified RTL properties hold in synthesized netlists and manufactured silicon.

But the same is true of human-written assertions. This isn’t a limitation of LLM-assisted verification specifically—it’s a fundamental challenge of the entire verification paradigm.

Paper 2: SLDB - Benchmarking LLM-Aided Chip Design

The second paper tackles a different aspect of the AI-for-verification problem: How do we evaluate whether LLMs are getting better at chip design tasks?

This connects directly to Week 11’s benchmarking challenges. We discussed why hardware benchmarks are harder than software benchmarks: multi-stage feedback loops, quality metrics beyond correctness, the irrevocability constraint, and the need to evaluate performance, power, and area, not just functionality.

SLDB (System-on-Chip LLM Design Benchmark) attempts to create a comprehensive benchmark for evaluating LLMs across the entire chip design flow: from specification to RTL generation to verification to synthesis to physical design.

The End-to-End Challenge

Most existing benchmarks focus on single stages:

RTL generation: Given a specification, generate correct Verilog
Verification: Given RTL, write assertions
Synthesis optimization: Given RTL, optimize for PPA (performance, power, area)

But real chip design is end-to-end. Decisions at one stage affect others. RTL that’s easy to verify might be hard to synthesize efficiently. Designs that look good at RTL might fail timing closure at physical design.

SLDB argues that we need benchmarks that evaluate the entire flow. Can an LLM-assisted design process take a high-level specification all the way to a validated, physically-realizable design that meets PPA targets?

This is ambitious. And it highlights a fundamental question:

What are we actually trying to evaluate?

Are we evaluating:

LLM capabilities in isolation: Can the model generate correct RTL? Can it write valid assertions? Can it optimize synthesis parameters?
LLM-assisted workflows: Can a human designer working with an LLM produce better designs faster than working alone?
Autonomous AI design agents: Can an AI system design chips end-to-end without human intervention?

These are very different evaluation targets requiring different benchmarks.

The Metrics Problem

Even if we agree on what to evaluate, what metrics matter?

For verification specifically:

Assertion correctness: Do generated assertions match the intended properties?
Assertion completeness: Do they cover all important properties?
Bug detection rate: When the implementation has bugs, do the assertions catch them?
False positive rate: How often do assertions incorrectly flag correct behavior as bugs?

For the full chip design flow:

Functional correctness: Does the design work at all?
PPA (Performance, Power, Area): How good is the design?
Time to completion: How long did the design process take?
Human effort: How much human expertise was required?

SLDB proposes a suite of metrics spanning these dimensions. But as with Week 12’s placement evaluation challenges, the question remains: Do these metrics predict real-world success?

A design might score well on academic benchmarks but fail in production because:

Benchmarks use simplified specifications; real specs are ambiguous and evolving
Benchmarks assume well-defined problems; real designs involve extensive exploration
Benchmarks measure final quality; real design processes require iteration and debugging
Benchmarks don’t capture the tacit knowledge transfer and team collaboration that characterize actual chip design

The Diversity Challenge

SLDB aims to be comprehensive, covering diverse design types: processors, accelerators, memory controllers, interconnects, mixed-signal interfaces.

But comprehensiveness creates its own challenge: How do you avoid overfitting?

If LLMs train on large codebases including many processor designs, and your benchmark includes processor designs, the model might have seen similar examples during training. Is it generalizing or memorizing?

This is the same challenge we saw in Week 3’s SWE-bench: distinguishing between “learned general principles” and “memorized patterns.”

For hardware, the challenge is compounded by the fact that design patterns are more constrained. There are only so many ways to design a cache. There are established idioms for FIFOs, arbiters, state machines. If an LLM has seen thousands of FIFO designs, and you ask it to generate a FIFO, should we be impressed if it produces a correct one?

The benchmark must include designs that are:

Realistic: Representative of actual hardware design challenges
Novel: Sufficiently different from training data to test generalization
Verifiable: With ground truth for automated evaluation

Balancing these requirements is difficult. Make designs too novel, and you’re testing the model’s ability to handle under-specified problems (which is interesting but different from measuring competence). Make designs too similar to training data, and you’re measuring memorization.

What Role Can LLMs Actually Play?

Both papers, read together, point to a more nuanced picture than “AI will automate chip design.”

LLMs can lower barriers to entry for formal verification. They can quickly generate initial assertions that humans review and refine. They might suggest properties commonly verified for similar designs, generate boilerplate code, or adapt existing verification IP to new contexts. They can highlight inconsistencies in natural language specifications.

What they can’t do: replace human expertise entirely. Verification still requires deep understanding of the design and its properties. Generated assertions must be validated, and that validation requires expertise. LLMs don’t reason about physical effects. They don’t decide what properties matter. And they can’t write correct, complete specifications on their own.

The Trust Problem: Verification in an AI-Assisted World

This brings us to the deepest question of Week 13—and perhaps of the entire course:

If AI agents help design chips, and we use AI-assisted tools to verify those chips, how do we establish trust?

The Circular Verification Problem

Traditionally, verification provides confidence through separation of concerns:

Designers create the implementation
Verification engineers (different people) write the assertions and run verification
Tools (deterministic algorithms) check that assertions hold

This separation reduces the risk of systematic errors. If both the design and verification are done by the same person with the same misunderstandings, bugs slip through. Independent verification teams catch these.

But if LLMs assist with both design (generating RTL) and verification (generating assertions), and if both use similar models trained on similar data, do we lose this independence?

Consider:

LLM generates RTL from a specification
LLM generates verification assertions from the same specification
Formal verification tools check that the RTL satisfies the assertions
Verification passes ✓

What have we actually verified?

We’ve verified that the generated RTL is consistent with the generated assertions. But if both the RTL and the assertions misinterpret the specification in the same way, verification will pass despite the design being wrong.

This is the circular verification problem: using AI to check AI’s work doesn’t eliminate errors if both make correlated mistakes.

Trust Through Diversity

One approach to this problem is diversity: use different tools, different models, different approaches, and cross-check for consistency.

Generate RTL with one LLM, assertions with a different LLM
Use multiple verification tools (commercial and open-source) with different algorithms
Combine formal verification with simulation-based coverage analysis
Employ adversarial approaches: train models to find bugs in designs produced by other models

Diversity doesn’t eliminate the trust problem, but it reduces the risk of correlated failures. If multiple independent approaches agree, confidence increases.

This echoes Week 4’s discussion of code optimization: you don’t trust optimized code because the optimizer is theoretically correct. You trust it because multiple optimizers produce similar results, and the optimized code is tested extensively on diverse workloads.

Trust Through Transparency

Another approach is transparency: make the AI’s reasoning visible so humans can audit it.

If an LLM generates an assertion, it could also generate:

An explanation of what property the assertion checks
Examples of behaviors that would violate the assertion
Connections to the original specification showing which requirements are covered

This makes the generated artifact auditable. Verification engineers can review not just the assertion code, but the reasoning behind it.

This is similar to Week 12’s discussion of DREAMPlace versus RL-based placement. Analytical methods are transparent: you can inspect the objective function and understand what’s being optimized. Learned policies are black boxes: you see the output but not the reasoning.

For verification, transparency may be essential for trust. But it’s unclear whether LLMs can provide it. Generating explanations is easy; generating correct explanations is harder.

Trust Through Bounded Domains

A pragmatic approach is to bound the problem: use AI assistance only for well-defined, limited-scope tasks where validation is straightforward.

For example:

Use LLMs to generate assertions for simple, well-understood design patterns (FIFOs, arbiters)
Restrict LLM assistance to suggesting properties, not finalizing them
Apply AI-assisted verification to non-critical modules while using traditional verification for critical paths

This limits the potential impact of AI errors. If an LLM-generated assertion for a peripheral controller is wrong, the consequences are less severe than if it’s wrong for the cache coherence protocol.

This is the approach most companies take in practice: gradual adoption in lower-risk areas, with human oversight for critical components.

Trust Through Empirical Validation

Ultimately, trust in hardware comes from silicon validation: building the chip and testing it extensively in real systems.

No amount of verification eliminates this final step. Formal verification increases confidence, but the ultimate test is: does the chip work when manufactured and deployed?

This is sobering. After all the AI-assisted design, the LLM-generated verification, the formal proofs—you still need to build the chip and test it. And if it fails, you’ve lost millions of dollars and many months.

The irrevocability constraint remains the final arbiter of trust.

Closing the Loop: Verification as Feedback

Week 12 ended by emphasizing feedback loops: how can we close the long feedback loops that currently span months in chip design?

Here’s the problem with verification today. You design something. You write the RTL. You spend weeks verifying it. Then you discover a bug. Now you have to go back, fix the design, and verify again. If the bug is fundamental, you might need to rethink the architecture itself. That cycle can take months.

The later you find a bug, the more expensive it is to fix. Find it during RTL verification? You edit code and reverify. Find it after physical design? You might need to redo weeks of layout work. Find it in silicon? You’re looking at a costly recall or living with the bug.

Verification is part of this loop. In traditional flows:

graph LR A[Architecture] --> B[RTL Design] B --> C[Functional Verification] C --> D[Synthesis] D --> E[Physical Design] E --> F[Timing Verification] F --> G[Tape Out] C -.bugs found.-> H[Debug RTL] H --> B F -.timing fails.-> I[Fix Physical
Design] I --> E I -.can't fix.-> J[Redesign RTL] J --> B style C fill:#fff3e0 style F fill:#ffebee style G fill:#A51C30,color:#fff

Notice the arrows going backward in the diagram. Those backward arrows represent rework. The more steps you go forward before discovering a problem, the more steps you have to redo.

Verification happens at discrete stages. You finish writing RTL, then you verify it. You finish physical design, then you verify timing. But what if you’re making fundamental mistakes early on that you won’t discover until much later?

What if verification could provide earlier feedback? What if, instead of verifying after you’re done, verification happened continuously while you design? You’d catch problems immediately instead of weeks later.

Early-Stage Verification

AI-assisted verification could potentially enable:

Specification-level verification: Before writing RTL, verify properties of the architectural specification itself. Check for inconsistencies, ambiguities, unrealizable requirements. “You specified that the cache must always hit in one cycle, but you also specified that the cache can be larger than 1MB. These requirements conflict for your target technology.”

Incremental RTL verification: As designers write RTL, continuously generate and check assertions for the completed portions. Catch bugs immediately rather than in dedicated verification phases. This is like continuous integration for hardware design.

Physical-awareness in functional verification: Generate assertions that account for physical constraints. “This signal path is too long to meet timing at your target frequency. The functional specification assumes single-cycle latency, but the physical reality will require pipelining.”

These are speculative—current tools don’t do this well. But they represent the potential of tighter integration between design and verification, enabled by AI assistance.

Verification-Guided Design

Another inversion: what if verification results guide design choices?

Traditional flow: design first, verify afterward, fix bugs if found.

Alternative flow: verify continuously during design, and let verification feedback influence design decisions.

For example:

Designer starts writing RTL for a cache controller
AI assistant generates assertions for common cache properties
Designer runs quick formal verification checks as they write code
When assertions fail, AI suggests: “This bug often occurs when… Consider using design pattern X which is easier to verify correct.”
Designer refines the design based on verification feedback in real-time

This is verification in the loop: not as a gate at the end, but as a continuous feedback mechanism shaping design.

Think about how you write code today. Your IDE underlines syntax errors as you type. It warns you about type mismatches before you compile. You see problems immediately, not after you’ve written thousands of lines of code. The feedback loop is seconds, not hours.

Could hardware design tools work the same way? As you write RTL, could the tool continuously check assertions and warn you about potential bugs in real time? Instead of finding out a week later that your FIFO can overflow, you’d see the warning as soon as you write the problematic code.

The challenge: verification is computationally expensive. Real-time feedback requires fast verification, which requires either:

Very good approximations (fast but potentially unsound)
Clever incremental approaches (only reverify what changed)
Focusing on simple properties first (complete verification still happens later)

AI assistance might help with all three: learning which approximations are safe, predicting which changes require reverification, suggesting which properties to check first.

Industry Perspective: ChipStack and AI-Assisted Workflows

Kartik Hegde is the co-founder of ChipStack, a company focused on AI-assisted chip design and verification workflows. ChipStack’s approach emphasizes human-AI collaboration: AI handles routine tasks and suggests improvements, while human experts make critical decisions and provide oversight. This pragmatic approach reflects the current reality: AI augmentation rather than automation.

Kartik Hegde’s perspective from ChipStack provides a reality check on the possibilities and limitations of AI-assisted verification.

The industry challenge is clear: verification timelines are growing faster than design timelines. As designs become more complex, verification effort grows superlinearly. Companies increasingly spend more time verifying than designing. Sometimes 60 to 70 percent of the design cycle is verification.

AI assistance could help by automating routine verification tasks like generating standard assertions, writing testbenches, and analyzing coverage metrics. It could accelerate debug by analyzing counterexamples, suggesting root causes, and recommending fixes. It might improve coverage by identifying untested corner cases and generating focused tests. And it could capture knowledge from past bugs to prevent similar issues in future designs.

But the barriers to adoption are significant. Trust and liability matter. If AI generated assertions miss a bug that costs hundreds of millions, who’s responsible? Companies are risk averse, especially for safety critical designs in automotive, medical, or aerospace. Integration with existing flows is hard. Chip companies have decades old EDA toolchains and verification methodologies. Introducing AI assistance requires integration with commercial tools and proprietary internal flows. Verification engineers have deep expertise in formal methods, coverage analysis, and debugging. Integrating AI requires new skills like prompt engineering and understanding ML limitations. That transition takes time. And training or fine tuning LLMs for hardware specific tasks requires significant compute resources and domain expertise. For smaller companies, the investment may not be justified.

The pragmatic approach: start with low risk, high value applications. Use AI to generate initial assertions that humans review. Apply AI assisted debug to reduce time spent analyzing counterexamples. Gradually expand to more complex tasks as confidence builds.

This mirrors the broader pattern throughout the course. AI agents as collaborators, not replacements. Augmenting human expertise rather than eliminating it.

Synthesizing Phase 3: From Code to Silicon

We’ve now completed Phase 3: AI for Chip Design. Let’s step back and see how the pieces fit together.

Week 11: RTL Design and Benchmarking

The challenge: Translating architectural intent into precise hardware descriptions
The AI opportunity: LLMs can generate RTL from specifications
The evaluation problem: How do we benchmark AI for hardware tasks given multi-stage feedback and the irrevocability constraint?

Week 12: Physical Design

The challenge: Physical effects dominate modern chips; architectures must be physically realizable
The AI opportunity: Fast placement and routing to enable rapid architectural exploration
The trust question: How do we validate that AI-assisted physical design will meet timing, power, and area targets?

Week 13: Verification

The challenge: Proving correctness for systems too complex to exhaustively test
The AI opportunity: LLMs can help generate assertions and lower barriers to formal verification
The trust problem: If AI helps design and verify chips, how do we ensure we’re not systematically introducing correlated errors?

The Common Threads

Three themes connect these weeks:

1. The Feedback Loop Crisis

At every stage, slow feedback loops limit iteration:

RTL generation without knowing if it will synthesize efficiently
Architecture design without knowing if it will physically realize
Design without continuous verification feedback

AI’s potential lies not in replacing human designers, but in closing feedback loops—providing faster, earlier insights about whether designs will work.

2. The Trust and Validation Challenge

At every stage, we face the question: how do we know the AI-assisted design is correct?

Generated RTL might be functionally wrong or inefficient
Fast placement approximations might not predict actual timing
Generated assertions might miss critical properties

The irrevocability constraint means we can’t afford to be wrong. Trust must be earned through validation, transparency, and empirical evidence.

3. The Tacit Knowledge Problem

At every stage, expert human knowledge is difficult to capture:

RTL design patterns that lead to efficient synthesis
Physical design intuitions about what floorplans will route well
Verification insights about which properties are important to check

AI might help by learning from examples what human experts know but have never written down. But we must be cautious: memorizing patterns is not the same as understanding principles.

Verification’s Role in AI-Assisted Chip Design

As AI becomes more integrated into chip design, verification takes on new importance. The question shifts from “does this chip work?” to “can we trust an AI-assisted design process to produce chips that work?”

When LLMs generate RTL or assertions, we need methods to verify not just the artifacts themselves, but the generation process. Cross validation using multiple AI models helps. Hybrid approaches combining AI generation with traditional verification can catch errors. Human review remains essential for high risk components.

Verification should happen earlier in the design process, not just at the end. AI assisted tools should consider verifiability during design, even if that means slightly less optimal implementations in other dimensions. And we need meta verification: ways to verify that our verification processes themselves are sound. How do we know the LLM generated assertions actually capture the properties we care about?

The irrevocability constraint ensures that verification will remain the final gatekeeper in chip design, regardless of how sophisticated our AI tools become.

What This Means for Verification

Verification is fundamentally different from testing software. When you write code, you compile it. Either it compiles or it doesn’t. If it doesn’t compile, you fix the syntax error and try again. Minutes later, you’re testing. If there’s a bug, you fix it, recompile, and redeploy. Users might never know there was a problem.

With hardware, once you’ve fabricated the chip, you can’t fix it. You can’t patch silicon. If there’s a bug, your options are:

Live with it (document it as a known issue, maybe provide software workarounds)
Do a costly recall and replace millions of units (like Intel’s $475 million Pentium fix)
Redesign the chip and wait months for new fabrication

That’s why verification is binary. The chip either works or it doesn’t. There’s no “ship it and patch it later” option. There’s no partial credit for “mostly correct.” A single verification failure that makes it to silicon can cost hundreds of millions of dollars and a year of lost time.

This makes AI assistance both promising and risky. A bug that slips through because of an incorrect LLM generated assertion costs just as much as a bug that slips through traditional verification. The bar for adoption is high. You can’t trust an AI generated assertion just because it looks plausible. You have to verify the verification itself.

But if AI can genuinely lower the barrier to formal verification, making it accessible to more engineers and applicable to more designs, that could change what gets verified and how thoroughly. More properties checked. More corner cases covered. Bugs caught in RTL instead of silicon.

Verification can’t be fully automated, but it can be amplified. LLMs won’t replace verification engineers. They might help engineers write more assertions faster, find patterns in counterexamples more quickly, identify coverage gaps more systematically. But the human still makes the final call on whether the verification is sufficient. Because when you tape out a chip, you’re making a one way decision. Get it wrong, and there’s no undo button.

That’s the theme throughout this phase of the course. AI doesn’t replace expertise in chip design. It amplifies it. And in verification, where correctness is absolute and mistakes are irreversible, that distinction matters more than anywhere else.

Key Takeaways

The Verification Challenge: Modern chips have more possible states than atoms in the universe. Testing can only sample an infinitesimal fraction. Formal verification can prove correctness but faces scalability limits.

LLMs for Verification: LLMs can potentially lower barriers to formal verification by generating assertions from natural language properties. But generated assertions must be validated, and validation requires expertise.

The Trust Problem: If AI helps design and verify chips, we risk circular validation—AI checking AI’s work with correlated errors. Trust requires diversity, transparency, and empirical validation.

SLDB Benchmark: Comprehensive benchmarking for LLM-aided chip design must span the entire flow and distinguish between memorization and generalization. The metrics must predict real-world success.

Closing the Loop: Verification should provide earlier feedback, ideally continuously during design. AI assistance could enable verification-in-the-loop, where verification results guide design decisions in real-time.

The Irrevocability Constraint: Remains the final arbiter. No matter how sophisticated the AI assistance or verification, hardware errors are irreversible. This demands rigor and humility.

AI as Amplifier, Not Replacement: The realistic path forward is AI augmenting human expertise in verification—making formal methods more accessible, speeding up assertion writing, and improving debugging—while humans maintain responsibility for correctness.

Discussion Questions

For researchers: How do we create verification benchmarks that test genuine understanding rather than pattern matching? What would it take to formally verify that an LLM-generated assertion correctly captures a natural language property?

For practitioners: Where in your verification flow would AI assistance provide the most value? What would it take for you to trust AI-generated assertions enough to use them in production?

For everyone: If AI increasingly assists with both design and verification, how do we maintain the independence that makes verification trustworthy? Should we intentionally use different AI approaches for design versus verification to reduce correlated errors?

The deeper question: As AI systems become more capable at tasks requiring deep expertise (verification, architectural design, formal reasoning), does the remaining human role become more important (higher-level oversight and judgment) or less important (rubber-stamping AI outputs)? How do we ensure it’s the former?

For detailed readings, slides, and materials for this week, see Week 13 in the course schedule.