QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

Imagine you are building a incredibly complex, high-speed train system (a computer chip). Before you let any passengers on, you need to make sure the train never derails, never hits a signal, and always stops at the right station.

In the world of chip design, engineers write these safety rules in a very strict, technical language called SystemVerilog Assertions (SVAs). Think of SVAs as the "laws of physics" for the chip. If the chip breaks these laws, it's a disaster.

The Problem: The Language Barrier

The trouble is, human engineers think in plain English (e.g., "Make sure the counter stops when the button is pressed"), but the computer only understands the strict SVA code.

For years, people tried to use General-Purpose AI (like the smart chatbots you know) to translate English into these safety laws. But it was like asking a brilliant literature professor to perform brain surgery. They knew the words, but they didn't know the rules of the chip. They often wrote laws that sounded right but were actually nonsense, or laws that were too simple to catch real errors.

Also, there was a huge problem: Data Scarcity. To teach an AI to be a surgeon, you need thousands of real surgeries to study. But in chip design, there are very few examples of perfect English-to-SVA translations available.

The Solution: QiMeng-CodeV-SVA

The researchers in this paper built a new, specialized AI called CodeV-SVA. They didn't just give it a textbook; they built a massive, custom training camp using a clever three-step process.

Here is how they did it, using some everyday analogies:

1. The "RTL-Grounded" Factory (The Raw Materials)

Instead of waiting for humans to write perfect examples, the team looked at RTL code.

Analogy: Imagine you have a million blueprints for different houses (RTL code). You don't have the safety manuals yet, but you have the blueprints.
The Trick: They used a smart AI to look at these blueprints and guess what the safety rules should be. It's like looking at a blueprint of a bridge and asking, "What are the rules to keep this bridge from falling?"
The Result: They generated hundreds of thousands of potential safety rules.

2. The "Bidirectional" Mirror Test (The Quality Control)

This is the paper's most creative idea. How do you know if the AI's guess is actually correct?

The Problem: Sometimes an AI writes a rule that is technically "true" but useless. For example, if the rule is "The sky is blue OR the sky is not blue," it's always true, but it doesn't tell you anything about the bridge.
The Solution (Bidirectional Translation):
1. Take the AI's generated rule (SVA).
2. Ask the AI to translate it back into plain English.
3. Ask the AI to translate that English back into a new rule.
4. The Check: Does the new rule match the original rule?
The Metaphor: Imagine you tell a friend a secret. They whisper it to a second friend, who whispers it back to you. If the story comes back exactly the same, you know the message was clear. If the story changes (e.g., "The bridge is safe" becomes "The bridge is always safe"), you know the first friend misunderstood the nuance.
The Filter: They threw away any rules that got "mangled" in this translation loop. Only the perfect, clear rules survived.

3. The "Reasoning" Coach (The Final Polish)

Before training the final model, they added a "thinking step."

Analogy: Instead of just giving the answer, the AI was forced to write out its homework steps first. "First, I see a clock signal. Second, I see a reset button. Therefore, the rule must be..."
This helped the AI understand why a rule was correct, not just memorize the pattern.

The Result: A Specialist vs. A Generalist

They trained their new AI, CodeV-SVA, on this massive, high-quality dataset.

The Old Way: Using a general AI (like GPT-5 or DeepSeek) was like using a Swiss Army Knife to perform heart surgery. It worked okay, but it wasn't precise.
The New Way: CodeV-SVA is like a specialized heart surgeon. Even though it's smaller and cheaper to run than the giant general AIs, it performs better at writing these specific chip safety rules.

In the final tests:

CodeV-SVA beat the world's most expensive, powerful general AIs.
It caught more errors and wrote more accurate safety laws.
It proved that you don't need a giant brain if you have the right training data and a clever way to filter out the bad stuff.

Why This Matters

This paper shows that for highly specialized jobs (like designing computer chips), we don't need to wait for AI to become "super-intelligent" in everything. Instead, we can build specialized tools by teaching them with high-quality, self-generated data and rigorous "mirror tests." It's a smarter, cheaper, and more effective way to build the future of hardware.

1. Problem Statement

The paper addresses the critical challenge of SystemVerilog Assertion (SVA) generation for hardware verification. While SVAs are essential for formal verification, manually crafting them is labor-intensive and requires deep expertise.

Limitations of Current Approaches:
- Rule-based methods: Depend on "golden" RTL designs and simulation traces, making them hard to apply to real-world, unverified designs.
- General-purpose LLMs: Models like DeepSeek-V3.1 or GPT-5 struggle with the specialized NL2SVA (Natural Language to SVA) task due to a lack of domain-specific knowledge and insufficient training data.
- Data Scarcity: High-quality, human-written SVA corpora are extremely limited (e.g., only ~4K in textbooks, ~5K in open-source repositories), whereas RTL code datasets are massive ( $10^5$ scale).
- Validation Difficulty: There is no reliable automated method to verify if a generated SVA semantically matches a natural language (NL) description. Formal verification tools can pass trivial assertions (e.g., assert property (1'b1)), and "LLM-as-a-judge" often fails to detect subtle syntax errors or logical misalignments.

2. Methodology

The authors propose CodeV-SVA, a specialized training pipeline involving a novel RTL-Grounded Bidirectional Data Synthesis Framework. The process consists of four main stages:

Stage 1: SVA Synthesis from Real-World RTL Code

To overcome data scarcity, the authors utilize large-scale open-source RTL code (from the CodeV dataset) as Design-Under-Tests (DUTs).

Process: A general-purpose LLM (DeepSeek-V3.1) analyzes RTL specifications to generate NL properties and corresponding SVAs.
Filtering: A formal verification tool (JasperGold) checks these generated SVAs against the RTL. Only SVAs that formally pass verification are retained, creating an initial seed dataset of 159K verified SVA instances.

Stage 2: Bidirectional Selection for NL-SVA Pairs

This is the core innovation to ensure semantic alignment between NL and SVA.

The Problem: A verified SVA might be logically correct but semantically weak (e.g., it passes verification but doesn't capture the specific intent of the NL description).
The Solution (Bidirectional Translation):
1. Translate the verified SVA back into Natural Language ( $SVA \to NL$ ).
2. Translate the resulting NL back into a new SVA ( $NL \to SVA'$ ).
3. Equivalence Check: Use formal tools to verify if the original SVA and the regenerated SVA' are logically equivalent.
Outcome: Only pairs where the bidirectional translation preserves logical equivalence are kept. This filters out "cheating" assertions and ensures the NL and SVA are tightly coupled. This reduces the dataset to 105K high-quality pairs.

Stage 3: Further Data Quality Refinement

Additional techniques are applied to refine the dataset:

LLM-as-a-Judge with Expert Priors: Human experts categorize common error types (logical misalignment, signal inconsistency, etc.), and an LLM filters out data containing these errors.
Difficulty Filtering: A weaker LLM (Qwen3-8B) attempts to generate SVAs for the NL descriptions. If it succeeds easily (generating equivalent SVAs), the data point is considered "trivial" and removed to improve training efficiency.
Reasoning Augmentation: A strong reasoning model (DeepSeek-R1) is used to generate long reasoning trajectories for the final SVA answers. This creates a dataset with Chain-of-Thought (CoT) reasoning, which is crucial for complex logic generation.
Final Dataset Size: 83K high-quality NL-SVA pairs with reasoning trajectories.

Stage 4: Supervised Fine-Tuning (SFT)

The synthesized dataset is used to fine-tune open-source base models (Qwen3-8B and Qwen3-14B). The training objective minimizes the log probability of the output tokens (reasoning + SVA) given the RTL and NL input.

3. Key Contributions

RTL-Grounded Data Synthesis: A framework that leverages abundant open-source RTL code to synthesize high-quality SVA data, solving the scarcity of SVA corpora.
Bidirectional Data Selection: A novel method using $SVA \to NL \to SVA$ translation and formal equivalence checking to automatically filter out semantically misaligned data without human intervention.
CodeV-SVA Models: Specialized LLMs (8B and 14B parameters) trained on this synthesized data that outperform massive general-purpose models.
Open-Source Commitment: The authors plan to release the dataset, models, and training pipeline to the community.

4. Experimental Results

The models were evaluated on the FVEval-NL2SVA benchmark (Human-written and Machine-generated subsets) using Func.@k (functional correctness pass rate).

Performance vs. General-Purpose LLMs:
- CodeV-SVA-14B achieved 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine (Func.@1).
- This surpasses or matches state-of-the-art general-purpose models like GPT-5 (71.8% / 81.8%) and DeepSeek-R1 (74.6% / 81.0%), despite having significantly fewer parameters (14B vs. 671B+).
Performance vs. Base Models:
- CodeV-SVA-8B showed massive improvements over its base model Qwen3-8B, increasing Func.@1 by 39.7% (Human) and 37.4% (Machine).
Ablation Studies:
- Bidirectional Selection was the most impactful component, improving Func.@1 by 12.3% on the Human benchmark.
- Reasoning Augmentation provided a clear performance boost, while difficulty filtering improved training efficiency.
End-to-End Verification:
- Integrated into the AssertionForge framework, CodeV-SVA-8B generated significantly more verifiable SVAs than GPT-4o and DeepSeek-R1. On the complex OPENMSP430 design, it generated 2.5x more verifiable assertions than GPT-4o.

5. Significance

Cost Efficiency: The paper demonstrates that a specialized, smaller model (14B) trained on high-quality synthesized data can outperform massive, proprietary general-purpose LLMs (GPT-5, DeepSeek-R1) in specialized hardware tasks. This makes advanced hardware verification accessible and affordable.
Data-Centric AI: It validates that for specialized domains like EDA (Electronic Design Automation), the quality and synthesis method of training data are more critical than simply scaling model size.
Reliability: The bidirectional selection method provides a robust, automated way to ensure semantic correctness in generated code, addressing a major bottleneck in LLM-based hardware verification.
Community Impact: By open-sourcing the data and pipeline, the work lowers the barrier for researchers to develop specialized hardware verification models.