HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation

Imagine you are trying to build a complex Lego castle based on a vague description from a friend. You have two types of builders available:

The Fast Apprentice: A quick, cheap, and energetic builder who can throw together ideas fast but sometimes makes small mistakes or misses details.
The Master Architect: A brilliant, highly experienced expert who builds perfect structures but is slow, expensive, and only available for a short time.

HDLFORGE is a smart management system that figures out how to use these two builders to get the best castle possible without wasting money or time.

Here is how the system works, broken down into simple steps:

1. The Two-Stage Strategy (The "Apprentice First" Rule)

Most systems either hire the Master Architect for every job (which is too slow and expensive) or stick with the Fast Apprentice forever (which leads to a wobbly castle).

HDLFORGE uses a Two-Stage Approach:

Stage A (The Apprentice): The system starts by asking the Fast Apprentice (a medium-sized AI) to build the Verilog code (the "blueprint" for a computer chip). It tries to fix its own mistakes using quick, cheap checks.
Stage B (The Master Architect): The system only calls in the Master Architect (a huge, powerful AI) if the Apprentice is clearly stuck or if the blueprint looks too dangerous. This ensures you only pay the "Master's fee" when absolutely necessary.

2. The "Smell Test" (The Escalation Controller)

How does the system know when to call the Master? It doesn't just guess. It uses a Calibrated Score, like a dashboard in a car.

Before asking for help, the system runs a series of quick "smoke tests" (like checking if the engine starts, if the lights work, and if there are any weird noises).

If the score is good, the Apprentice keeps working.
If the score is bad (e.g., the code won't compile or has too many errors), the system says, "Okay, the Apprentice is stuck. Time to call the Master."

This is the "Adaptive Escalation." It's like a manager who lets a junior employee try to solve a problem first, but immediately steps in with a senior expert if the junior is spinning their wheels.

3. The "Bug Hunter" (The Formal Agent)

This is the paper's coolest trick. When the Apprentice makes a mistake, the system doesn't just say "Fix it." It uses a special tool called a Counterexample-Guided Formal Agent.

Think of this as a Time-Traveling Detective:

When the Apprentice builds a wall that falls down, the Detective looks at exactly why it fell.
Instead of just fixing that one wall, the Detective writes a tiny, reusable "trap" (a micro-test) that catches that specific type of falling wall forever.
If the Apprentice tries to build the same bad wall again later, the trap snaps shut immediately, saving time.

This turns a single failure into a permanent lesson, so the system learns faster and makes fewer mistakes over time.

4. The Results: Fast, Cheap, and Accurate

The researchers tested this on three major benchmarks (like standardized tests for chip design). Here is what they found:

Speed vs. Accuracy: They got the best of both worlds. They were almost as accurate as using the Master Architect for every single task, but they finished the job 50% faster on average.
Portability: The "Manager" (the escalation controller) is so smart it can be attached to any existing AI coding system. It's like a universal remote control that can upgrade any old TV to a smart TV without needing to rebuild the TV itself.
Bug Detection: Because of the "Time-Traveling Detective" (the micro-tests), the system found and fixed bugs much faster than other methods, especially tricky ones like reset errors or logic loops.

The Big Picture

HDLFORGE is like a smart factory manager. It knows that not every problem requires a PhD-level engineer. It lets the junior staff try first, uses quick checks to see if they are struggling, and only brings in the expensive experts when needed. Plus, it keeps a "lesson learned" book for every mistake so the team gets smarter with every single project.

The result? We get high-quality computer chip designs faster and cheaper than ever before.

Here is a detailed technical summary of the paper "HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation."

1. Problem Statement

The adoption of Large Language Models (LLMs) for Hardware Description Language (HDL) code generation has improved productivity but faces significant challenges:

Accuracy vs. Cost Trade-off: Current systems typically fix the backbone model size. Using small models is fast but often produces syntax errors, functional bugs, and hallucinations. Using ultra-large models improves accuracy but incurs high computational costs and latency.
Inefficient Resource Usage: Existing multi-agent frameworks often iterate with the same model scale or lack a mechanism to dynamically decide when to invoke more expensive resources.
Verification Bottlenecks: Detecting bugs often requires running full official testbenches, which is time-consuming. Iterative repair without targeted feedback leads to slow convergence.

The core problem is how to optimize the trade-off between generation speed (latency) and code correctness (accuracy) without sacrificing the capabilities of large models, while minimizing the computational cost of verification.

2. Methodology: HDLFORGE Architecture

HDLFORGE is a two-stage multi-agent cascade framework that adaptively escalates from a compact, medium-sized model to a powerful, ultra-large model only when necessary. It integrates a counterexample-guided formal agent to accelerate bug detection.

A. Two-Stage Cascade

Stage A (Primary Solver): Uses a medium-sized LLM (e.g., Qwen-7B) as the default coder. It operates in a generate–judge–repair loop using lightweight diagnostics.
Stage B (Final Attempt): Invoked only when Stage A fails to converge. It uses an ultra-large cloud-based model (e.g., Claude 3.5 Sonnet) to generate a high-quality candidate based on the context of Stage A's failures.
Escalation Logic: A calibrated controller decides whether to escalate based on a score $Z$ derived from inexpensive diagnostic signals (compilation, linting, smoke tests, trace stability, and budget usage).

B. The Seven-Agent Workflow

The system coordinates seven specialized agents:

Planner ( $A_{plan}$ ): Generates multiple high-level implementation strategies ( $P$ ) from the natural language spec.
Coder ( $A_{code}$ ): Generates candidate Verilog implementations based on the plans.
Judge & Smoke ( $A_{judge}$ ): Performs rapid "smoke" checks (compilation, linting, and short simulations against a micro-test set) to filter out obviously broken candidates.
Simulation ( $A_{sim}$ ): Runs the official testbench on the top-ranked candidate.
Tracer ( $A_{trace}$ ): Upon failure, constructs an Abstract Syntax Tree (AST) and performs backtracing to identify a "suspect cone" of signals likely causing the error.
Reflexion ( $A_{refl}$ ): Analyzes failure context (waveforms, suspect cone, plan invariants) to propose targeted repairs rather than regenerating the whole module.
Formal Amplifier ( $A_{form}$ ): A key innovation. It runs Bounded Model Checking (BMC) to find counterexamples. These traces are converted into deterministic micro-tests (reusable, short testbenches) that are added to the smoke-test pool ( $U_k$ ). This prevents the system from revisiting the same bugs.

C. Adaptive Escalation Controller

The controller computes a diagnostic vector $s = [s_{comp}, s_{lint}, s_{smoke}, s_{trace}, s_{budget}]$ .

Signals:
- $s_{comp}$ : Compilation success.
- $s_{lint}$ : Code quality (normalized warning count).
- $s_{smoke}$ : Fraction of cycles matching expected values in short simulations.
- $s_{trace}$ : Stability of failure locations across attempts.
- $s_{budget}$ : Remaining attempts allowed.
Decision: A logistic regression model predicts the probability of Stage A success. If the score $Z$ falls below a threshold $\tau$ or the attempt limit is reached, the system escalates to Stage B.

D. Portability

The escalation logic is designed as a portable controller that can wrap existing Verilog LLM pipelines (like AutoVCoder or VerilogCoder) without modifying their internal prompts or retrieval mechanisms.

3. Key Contributions

HDLFORGE Framework: A novel two-stage architecture that dynamically trades latency for accuracy by escalating from a 7B model to an ultra-large model only when diagnostics indicate Stage A is struggling.
Portable Escalation Controller: A plug-and-play decision layer that improves the speed–accuracy trade-off of existing Verilog generators without retraining their backbone models.
CEGIS-Style Micro-Test Amplifier: A formal agent that converts bounded-model-checking counterexamples into reusable micro-tests. This significantly reduces the number of repair iterations and wall-clock time by catching specific bug patterns early.
Closed-Loop Multi-Agent System: A design where agents interact solely through tool-level signals (scores, traces, tests), creating a modular and efficient workflow distinct from prior monolithic or loosely coupled approaches.

4. Experimental Results

The framework was evaluated on VerilogEval Human, VerilogEval V2, and RTLLM benchmarks.

Accuracy vs. Latency:
- HDLFORGE-Qwen (Stage A: Qwen-7B, Stage B: Claude 3.5) achieved 91.2% Pass@1 on VerilogEval Human and 91.8% on V2. This is a dramatic improvement over other 7B-based systems (e.g., AutoVCoder at 48.5%) and comes with roughly 50% lower median latency compared to single-stage large models.
- HDLFORGE-GPT4o (Stage A: GPT-4o, Stage B: Claude 3.5) achieved 95.5% Pass@1 on VerilogEval Human and 99.8% Pass@5 on RTLLM, outperforming systems like MAGE and CoopetitiveV that rely on repeated calls to large models.
Portability: Wrapping existing systems (AutoVCoder, VerilogCoder) with the HDLFORGE controller improved their Pass@1 by 3–5 percentage points with less than a 10% increase in mean time-to-pass.
Ablation Studies: Removing any agent (Judge, Tracer, Reflexion, or Micro-tests) resulted in a 4–5% drop in Pass@1 and increased median latency, proving that each component is essential and non-redundant.
Bug-Injection Benchmark: The micro-test amplifier increased bug detection rates from ~64% (baselines) to 95%, reduced repair iterations from 7.0 to 3.0, and lowered wall-clock time by ~17%.

5. Significance

HDLFORGE represents a paradigm shift in AI-driven hardware design by moving away from "one-size-fits-all" model scaling.

Efficiency: It demonstrates that high accuracy can be achieved without always invoking the most expensive models, making advanced HDL generation more accessible and cost-effective.
Verification Integration: By integrating formal methods (BMC) directly into the generation loop via micro-tests, it addresses the "hallucination" problem in hardware design more effectively than pure simulation.
Scalability: The portable controller design allows the framework to enhance legacy or third-party Verilog generation tools, offering a path to upgrade existing industrial pipelines without full re-engineering.

In summary, HDLFORGE provides a robust, adaptive, and efficient solution for generating correct Verilog code, balancing the need for speed in early iterations with the precision of large models for difficult tasks.