QiMeng-CRUX: Narrowing the Gap Between Natural Language and Verilog via Core Refined Understanding eXpression for Circuit Design

The Big Problem: The "Lost in Translation" Gap

Imagine you are a brilliant Architect (the AI) who speaks perfect Blueprint Language (Verilog code). However, your clients (the users) speak Casual English.

When a client says, "I need a machine that counts how many people are in the room," they might say it in a messy, confusing way:

"Hey, make a thing that looks at the door, counts the ones coming in, maybe subtracts the ones leaving, oh and by the way, the door is red, and if it's raining, don't count the umbrellas."

If you ask a standard AI to turn this messy sentence into a strict, mathematical blueprint (Verilog code), it often gets confused. It might miss the "red door" detail, forget the "umbrellas," or build a machine that doesn't fit the room. The gap between loose human speech and rigid computer code is too wide.

The Solution: The "CRUX" Translator

The authors of this paper created a new tool called CRUX (Core Refined Understanding eXpression).

Think of CRUX as a Professional Project Manager who sits between the Client and the Architect.

Instead of the Architect trying to guess what the Client meant, the Client talks to the Project Manager first. The Project Manager takes the messy, rambling request and rewrites it into a Perfectly Structured Brief before handing it to the Architect.

This "Perfect Brief" (the CRUX space) has three specific sections:

The Interface (The Doors & Windows): Exactly what goes in and what comes out. (e.g., "3 wires go in, 2 wires go out").
The Core Function (The Engine): What the machine actually does. (e.g., "Count the '1's").
The Key Considerations (The Fine Print): The tricky details that usually get missed. (e.g., "If the input is zero, output zero," or "Watch out for the bubble on the second input").

By forcing the AI to write this "Perfect Brief" first, the final code becomes much more accurate.

How They Trained the AI (The Two-Stage Gym)

The team didn't just tell the AI to "do better." They built a two-stage training camp for it:

Stage 1: The Classroom (Joint Expression Modeling)

The Setup: They took thousands of examples of messy user requests and their correct code.
The Trick: They simulated "real-world" messiness. They intentionally hid parts of the instructions or scrambled the order (like a client forgetting to mention the door color).
The Lesson: They taught the AI: "First, look at this messy request. Write the 'Perfect Brief' (CRUX). Then, write the code based on that brief."
Result: The AI learned to organize its thoughts before it started building.

Stage 2: The Coach's Whistle (Dual-Space Optimization)

The Problem: Sometimes the AI writes a "Perfect Brief" that looks good but is actually slightly wrong, leading to bad code.
The Fix: They introduced a Reward System (like a video game coach).
- If the code works? Good job!
- If the "Perfect Brief" helped the AI figure out the code easily? Double Good job!
The Goal: The AI learned that a clear, structured "Brief" is just as important as the final code. It learned to prioritize clarity over just guessing.

The Results: Why It Matters

The paper tested this new AI (called QiMeng-CRUX) against other top models on difficult hardware design tasks.

The Analogy: Imagine a race where everyone has to build a complex Lego castle based on a vague description.
- Old AI: Tries to guess the castle, often building a tower where a wall should be.
- QiMeng-CRUX: Stops, writes a checklist of exactly what bricks are needed, organizes them, then builds.
The Outcome: QiMeng-CRUX won the race. It built the castles (circuits) correctly much more often than the others, especially when the instructions were tricky or incomplete.

The "Magic" Bonus: It Works on Others Too

The coolest part? The "Perfect Brief" (CRUX) is so good that you can give it to other AI models that weren't even trained on it.

Analogy: It's like giving a messy recipe to a novice chef, who fails. But if you give that same messy recipe to a Master Chef (who wrote the CRUX), and then hand the Master Chef's organized notes to the novice, the novice suddenly becomes a great cook.
This proves that the CRUX method isn't just a trick for one specific AI; it's a universal way to make any AI better at understanding hardware design.

Summary

QiMeng-CRUX solves the problem of "messy instructions" by forcing the AI to act like a Project Manager first. It organizes the chaos into a clear, structured plan (CRUX) before writing any code. This simple step of "thinking before acting" bridges the gap between human language and computer logic, resulting in fewer mistakes and better hardware designs.

1. Problem Statement

The paper addresses the significant challenge of generating hardware description language (HDL) code, specifically Verilog, using Large Language Models (LLMs). While LLMs excel in general programming, hardware design faces unique hurdles:

Ambiguity and Unstructured Input: User descriptions of hardware designs are often free-form, ambiguous, redundant, and lack the strict structure required for formal circuit modeling.
The Transformation Gap: Converting open-ended natural language into the highly constrained, syntactically rigid, and semantically precise Verilog space is difficult. Critical details (e.g., sequential logic, finite state machines, specific port widths) are often implicit or scattered in the text.
Performance Limitations: Existing methods relying solely on direct natural language-to-code generation suffer from semantic drift, misaligned design intent, and incorrect implementations, particularly in complex tasks like "Spec-to-RTL" (converting specifications directly to register-transfer level code).

2. Methodology

The authors propose QiMeng-CRUX, a framework centered on a structured intermediate representation called CRUX (Core Refined Understanding eXpression). The approach consists of two main components: the CRUX space definition and a two-stage training framework.

A. The CRUX Space

CRUX acts as a semantic bridge between natural language and Verilog. It decomposes a user's intent into three structured components:

Module Interface: Explicitly defines input/output ports, signal directions, and bit-widths (structural foundation).
Core Functions: Captures the essential circuit behavior logic, control flow, and data flow (functional goals).
Key Considerations: Highlights subtle but critical implementation details, constraints, and edge cases (e.g., FSM state transitions, K-map logic, timing constraints) to ensure synthesizability and accuracy.

B. Two-Stage Training Framework

The model is trained using a pipeline designed to first learn the mapping to CRUX and then optimize the generation of both CRUX and code.

Stage I: Joint Expression Modeling (Supervised Fine-Tuning - SFT)
- Data Construction: The authors augment the existing CodeV dataset by creating RealSpec. This involves simulating real-world user inputs by introducing variations, ambiguities, and "interface degradation" (randomly omitting port details) to force the model to infer missing information.
- Corpus Categorization: Data is split into Easy Questions (clear text), Normal Data, and Special Non-Text (tasks requiring diagrams, FSMs, or K-maps).
- CRUX Derivation: For complex tasks, specialized models (DeepSeek-R1, Qwen2.5-Coder) are used to extract and refine the CRUX components from descriptions and reference code.
- Training: The model is fine-tuned on triplets $(R, X, V)$ , where $R$ is the RealSpec, $X$ is the CRUX, and $V$ is the target Verilog code. The model learns to generate the structured CRUX first, followed by the Verilog code.
Stage II: Dual-Space Optimization (Reinforcement Learning)
- Algorithm: Uses GRPO (Group Relative Policy Optimization), a variant of PPO.
- Dual-Objective Reward: Unlike standard RL that only rewards final code correctness, this stage optimizes two interconnected spaces:
  1. Code-Reward: Measures functional correctness by comparing generated code against reference code using automated testbenches.
  2. CRUX-Reward: Measures the quality of the intermediate CRUX. It calculates the conditional log-likelihood of the reference code given the generated CRUX. A high score indicates the CRUX effectively narrows the solution space toward valid implementations.
- Goal: This encourages the model to generate CRUX that not only describes the intent but actively guides the generation of correct code.

3. Key Contributions

CRUX Intermediate Space: Introduction of a structured, three-part intermediate representation (Interface, Core Function, Key Considerations) that explicitly captures design intent and filters noise, bridging the gap between unstructured text and formal code.
RealSpec Dataset Construction: A novel data augmentation strategy that simulates realistic, noisy, and incomplete user prompts (including interface degradation) to improve model robustness.
Dual-Space Optimization: A novel RL training paradigm that treats the intermediate representation (CRUX) as a primary optimization target alongside the final code, rather than just a byproduct.
State-of-the-Art Performance: The proposed model, QiMeng-CRUX, achieves SOTA results across multiple benchmarks, outperforming both general-purpose code models and specialized Verilog models.

4. Experimental Results

The model was evaluated on VerilogEval-v1/v2 and RTLLM-v1/v2 benchmarks.

Overall Performance: QiMeng-CRUX (7B parameter model) outperformed larger foundation models (e.g., GPT-4o, DeepSeek-V3) and specialized reasoning models in several categories.
Key Metrics:
- RTLLM-v2: Achieved a pass@1 of 63.8% (T=0), surpassing the previous SOTA (OriGen-7B at 50.9%) by 12.9% and reaching performance comparable to the massive DeepSeek-R1-671B.
- VerilogEval-v2 (Spec-to-RTL): Improved pass@1 from 49.3% (baseline) to 64.7% (T=0) and 64.4% (T=0.8).
- Code-Completion: Improved pass@1 from 58.3% to 68.0%.
Ablation Studies:
- RealSpec: Improved robustness to prompt variations.
- CRUX: Provided significant gains in complex tasks (Spec-to-RTL) where understanding core intent is critical.
- CRUX-Reward: The dual-space RL optimization further refined performance, particularly in pass@1, by encouraging more precise solution spaces.
Transferability: Using CRUX as a prompt for other models (without retraining) significantly improved their performance, proving CRUX is a semantically meaningful and transferable guidance mechanism.

5. Significance

Paradigm Shift: Moves away from direct "text-to-code" generation toward "text-to-structured-intent-to-code," acknowledging that hardware design requires explicit structural reasoning.
Efficiency: Unlike reasoning models (CoT) that require massive context windows (16k+ tokens) and long inference times, QiMeng-CRUX achieves superior results with a compact context (max 4k tokens) and single-shot generation.
Practical Impact: The ability to handle "Spec-to-RTL" tasks with high accuracy makes this approach highly relevant for automating real-world hardware design, reducing the manual effort required to translate vague requirements into synthesizable Verilog.
Generalizability: The CRUX concept demonstrates that structured intermediate representations can effectively bridge the gap between natural language and other highly constrained domain-specific languages beyond just Verilog.