LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Imagine you are trying to teach a robot how to write a complex recipe (a testbench) to test a new, very expensive machine (a hardware chip) before it is built.

If the robot makes a mistake in the recipe, the machine might break or behave strangely. To find out, you have to run a simulation. But here's the catch: running this simulation is like baking a cake that takes three hours to bake. You can't just bake a thousand cakes a day to see which one tastes best. It's too slow and expensive.

This is the problem the paper LLM4Cov solves. It teaches a small, smart robot how to write perfect recipes by learning from these slow, expensive "baking sessions" without wasting time.

Here is how they did it, using simple analogies:

1. The Problem: The "Expensive Taste Test"

In the old way, if you wanted to teach an AI to write these recipes, you might try to let it guess, run the simulation, see if it failed, and try again immediately (like online learning). But because the simulation takes so long, the AI would spend 99% of its time waiting for the oven to finish, not learning.

Also, if you just gave the AI a pile of "perfect recipes" written by a human expert to study, it would fail in real life. Why? Because the AI would never learn how to fix a broken recipe. It would only know what a perfect one looks like, not how to recover when things go wrong.

2. The Solution: The "Three-Stage Cooking School"

The authors built a system called LLM4Cov that acts like a smart cooking school. Instead of just memorizing recipes, the student robot learns by practicing on the worst possible scenarios and fixing them.

They use three clever tricks:

Trick A: The "Worst-Case Scenario" Drill

Imagine a cooking class where the teacher usually gives you perfect ingredients. But in this class, the teacher says: "Okay, let's look at the 10 recipes you tried yesterday. Which one was the absolute worst? The one that burned the cake?"

Instead of ignoring that burnt cake, the teacher focuses entirely on it. They say, "Let's take this burnt cake and figure out exactly how to fix it so it becomes a perfect cake."

In the paper: This is called Worst-State-Prioritized Sampling. The AI is forced to look at the test cases that failed the most (lowest coverage) and learn how to fix them. This teaches the AI how to recover from disasters, which is the most valuable skill.

Trick B: The "Staged Apprenticeship"

You can't teach a beginner to fix a burnt cake if they don't even know how to boil water yet. The system uses a Progressive Learning approach:

Stage 1 (The Beginner): The student robot tries to write a recipe. It fails. A super-smart "Master Chef" (a huge, powerful AI) looks at the failure and shows the student how to fix it. The student learns from the Master's corrections.
Stage 2 (The Intermediate): The student gets better. Now, the Master Chef stops helping as much. The student tries to fix its own mistakes. If it succeeds, great! If it fails, the Master Chef steps in again.
Stage 3 (The Master): The student is now so good that it can fix its own mistakes almost as well as the Master Chef. It learns to generate its own "perfect fixes" without needing the big teacher anymore.

This is like a video game where you start on "Easy Mode" with a guide, and slowly the guide disappears as you level up.

Trick C: The "Memoryless" Shortcut

Usually, when a robot tries to fix a recipe, it remembers every single thing it did in the past 100 steps. This makes the instructions huge and confusing.
The authors realized the robot doesn't need the whole history. It just needs to know: "Here is the current broken recipe, and here is the error message."

In the paper: They call this a Memoryless State Transition. It's like telling the robot, "Forget the last hour of chaos. Just look at the mess on the counter right now and clean it up." This makes the learning much faster and more focused.

3. The Amazing Result

The most surprising part of the paper is the size of the robot they used.

They used a small AI model (only 4 billion parameters).
They compared it to giant AI models (30 billion to 500 billion parameters).

The Result: The small, specialized robot trained with this "Worst-Case" method actually beat the giant, general-purpose robots.

The small robot achieved a 69.2% success rate in creating perfect test recipes.
The giant robots (without this special training) only got around 60%.

The Big Takeaway

You don't need a massive, expensive brain to solve hard problems. You just need the right training method.

By teaching a small AI to focus on its biggest failures, fixing them step-by-step, and ignoring unnecessary history, it becomes a master of hardware verification. It's like taking a small, sharp knife and sharpening it perfectly, rather than trying to use a giant, dull chainsaw.

In short: LLM4Cov teaches AI to learn from its mistakes in the most efficient way possible, turning a slow, expensive process into a fast, high-quality learning experience.

1. Problem Statement

The paper addresses the critical challenge of automated hardware verification, specifically the generation of high-coverage testbenches (verification programs) for hardware designs before fabrication.

The Challenge: Hardware verification relies on cycle-accurate simulators to provide feedback (coverage metrics). Unlike software, hardware bugs cannot be patched post-deployment, making verification execution-intensive and costly.
The Limitation of Current LLMs: While Large Language Models (LLMs) show promise in tool-augmented agents, training them for hardware verification is difficult because:
- Execution Cost: Simulator runs are slow and expensive (seconds to hours), making online Reinforcement Learning (RL) impractical.
- Distribution Shift: Static datasets used for fine-tuning suffer from a "state-dependent distribution shift." A student model encounters different intermediate failure states than those found in teacher-generated datasets, leading to poor generalization.
- Non-Differentiability: The feedback signal (simulator coverage) is non-differentiable and sparse, making standard gradient-based optimization impossible without a structured framework.

2. Methodology: LLM4Cov Framework

The authors propose LLM4Cov, an offline, execution-aware agentic learning framework that converts expensive simulator feedback into stable offline supervision. The core philosophy is to model verification as memoryless state transitions guided by deterministic evaluators.

A. Formalization: Memoryless State Transitions

Instead of conditioning the LLM on the full history of interactions (which creates long contexts and redundancy), the framework defines the state $s_t$ at step $t$ as:
$s_t = (R, x_t, o_t)$
Where:

$R$ : The fixed hardware design repository (source files + specs).
$x_t$ : The current testbench code.
$o_t$ : The simulator feedback (status, coverage, logs).

The model learns to generate the next testbench $x_{t+1}$ based only on the current state $s_t$ , discarding raw interaction history. This formulation was empirically shown to outperform "vanilla" agents that use full history.

B. Core Techniques

The framework employs three key mechanisms to enable scalable learning under execution constraints:

Coverage-Guided Agentic Rejection Fine-Tuning:
- Trajectory Synthesis: Trajectories are generated by sampling intermediate states and transitions. The framework categorizes these into Full-Teacher (teacher generates all), Imitation-style (student generates states, teacher generates fixes), and Self-Sampling (student generates both).
- Rejection Sampling: For a generated transition, the simulator is run. If the new coverage $Cov(o_{t+1})$ does not improve the previous coverage $Cov(o_t)$ by a threshold $\tau_\Delta$ , the data point is rejected.
- Focus on Recovery: The system prioritizes retaining low-coverage drafts and their successful revisions, concentrating supervision on "recovery behaviors" rather than already successful cases.
Worst-State-Prioritized Sampling:
- Instead of sampling uniformly from all visited states, the framework identifies the worst-performing state (lowest coverage) from a set of candidates.
- Transitions are synthesized specifically from these failure-prone states. This ensures the model learns how to recover from difficult scenarios rather than just refining easy cases.
Verification-Conditioned Progressive Learning:
- Training is organized into stages aligned with the evolving student model's capabilities.
- Stage 0: Warm-up using full-teacher traces to establish syntax validity.
- Stage 1: Imitation-style traces (student states + teacher fixes) to learn recovery from student-induced failures.
- Stage 2: Self-sampling traces (student states + student fixes) to refine autonomous repair strategies.
- Key Insight: Unlike naive data augmentation (mixing all data), progressive learning ensures the supervision distribution matches the current student's state distribution, preventing signal dilution.

3. Key Contributions

First Execution-Aware Framework for Hardware Verification: LLM4Cov is the first framework to systematically convert dense but costly simulator coverage feedback into offline supervision for agentic testbench generation.
Memoryless State Formulation: Demonstrates that discarding interaction history in favor of explicit state representation ( $R, x, o$ ) improves performance and reduces context overhead.
Novel Data Curation Strategy: Introduces Worst-State-Prioritized Sampling and Coverage-Guided Rejection, which filter data to focus exclusively on high-value recovery transitions.
Progressive Agentic Learning: Proposes a staged training pipeline that dynamically aligns data synthesis with the student model's evolving distribution, outperforming static fine-tuning approaches.
New Benchmark (CVDP-ECov): Adapts the existing CVDP benchmark to a more realistic setting where the LLM has access to the full hardware repository (not just specs) during generation.

4. Experimental Results

The authors evaluated LLM4Cov on the CVDP-ECov benchmark (83 hardware repositories) using a 4B-parameter model (Qwen3-4B) trained via their pipeline.

Performance:
- The 4B model achieved a 69.2% Coverage Pass Rate under agentic evaluation.
- It outperformed its 30B-parameter teacher model (63.9%) by 5.3%.
- It demonstrated competitive performance against models 50x–100x larger (e.g., 400B and 235B parameter models).
Comparison:
- Significantly outperformed hardware-specific models (e.g., VeriCoder, CodeV-R1) and general-purpose coding LLMs.
- Showed that specialized agentic learning is more efficient than brute-force model scaling.
Ablation Studies:
- Memoryless vs. Vanilla: Memoryless formulation consistently yielded better or equal results.
- State Selection: Worst-state prioritization significantly outperformed Best-State, Uniform, and Median selection strategies.
- Progressive vs. Naive: Stage-conditioned progressive training consistently outperformed naive data augmentation (mixing all stages), confirming the importance of distribution alignment.

5. Significance and Impact

Efficiency in Hardware Design: The work proves that small, specialized models can achieve state-of-the-art results in hardware verification if trained with execution-aware, feedback-driven pipelines, reducing the reliance on massive, expensive models.
Paradigm Shift: It moves the field from "static dataset fine-tuning" to "execution-grounded agentic learning," addressing the critical gap between symbolic generation and real-world correctness in non-differentiable environments.
Scalability: By solving the distribution shift problem through progressive learning and worst-state sampling, the framework offers a scalable path for training agents in domains where tool feedback is expensive and sparse.

In summary, LLM4Cov demonstrates that by rigorously modeling the verification loop as a memoryless state transition problem and curating data based on execution feedback and failure modes, compact models can master complex hardware verification tasks, outperforming much larger generalist models.