QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Imagine you are hiring a team of brilliant architects (the AI models) to build houses. But there's a catch: you don't just want any house; you want them to build the exact same house using three completely different sets of blueprints and construction tools: Qiskit (like building with Lego), Cirq (like building with wooden blocks), and PennyLane (like building with clay).

The big question the paper asks is: Can these AI architects build the same correct house using all three different toolkits, or do they just get confused when the tools change?

Here is the story of QuanBench+, broken down into simple parts:

1. The Problem: "The Toolbox Trap"

Previously, researchers tested AI on quantum code (the language of future super-computers) using only one toolkit at a time.

The Flaw: If an AI failed, nobody knew if it was because the AI didn't understand the math of quantum physics, or if it just didn't know how to use that specific toolkit.
The Analogy: It's like testing a chef only on Italian recipes. If they fail at making pasta, is it because they can't cook, or because they've never seen a pasta machine?

QuanBench+ fixes this by giving the AI the same cooking challenge but asking them to solve it using three different sets of kitchen tools. This separates "cooking skill" (quantum reasoning) from "tool familiarity" (knowing the specific software).

2. The Test: 42 Quantum Challenges

The researchers created a test with 42 tasks, ranging from:

Quantum Algorithms: Solving complex puzzles.
Gate Decomposition: Breaking big moves into tiny steps.
State Preparation: Setting up the ingredients before cooking.

They asked various top-tier AI models to write code for these tasks in all three languages (Qiskit, Cirq, PennyLane).

3. The Results: The "Easy," "Medium," and "Hard" Modes

The results were revealing. The AI models didn't perform equally across all three toolkits.

Qiskit (The Lego Set): This was the easiest for the AI. It's like the most popular toy; the AI has seen it a million times in its training data. The best AI got about 60% of the tasks right on the first try.
Cirq (The Wooden Blocks): This was medium difficulty. The AI did okay, scoring around 55%.
PennyLane (The Clay): This was the hardest. The AI struggled the most here, scoring only about 43%.

The Big Takeaway: The AI isn't a master quantum physicist yet. It's more like a student who has memorized the answers for one specific textbook (Qiskit) but gets lost when the teacher switches to a different textbook (PennyLane). The "intelligence" is still tied to knowing the specific rules of the tool, not the underlying logic.

4. The "Do-Over" Button (Feedback Repair)

The researchers didn't just let the AI fail and move on. They gave it a second chance.

The Setup: If the AI wrote code that crashed or gave the wrong answer, the computer told the AI, "Hey, this broke. Here is the error message. Try again."
The Result: This "Do-Over" button worked wonders!
- In the easy mode (Qiskit), scores jumped from 60% to 83%.
- In the hard mode (PennyLane), scores rose from 43% to 67%.

The Metaphor: It's like a student taking a test. If they get a question wrong, and the teacher says, "You missed the sign here, try again," the student often fixes it. However, if the student still gets it wrong after the hint, it usually means they didn't understand the concept, not just the syntax.

5. The Final Verdict

The paper concludes with two main points:

Progress is Real: AI is getting better at writing quantum code. With a little help (feedback), they can fix many mistakes.
The Gap Remains: We are not there yet. The AI still relies heavily on memorizing specific software rules rather than truly "thinking" in quantum mechanics. If you change the tools, the AI often stumbles.

In a nutshell:
Imagine teaching a robot to drive. Right now, the robot is great at driving a Toyota (Qiskit) because it has seen thousands of them. It's okay at driving a Ford (Cirq). But if you put it in a Ferrari (PennyLane), it panics.

QuanBench+ is the driving test that proves the robot needs to learn the principles of driving (physics and logic), not just memorize the buttons on one specific car dashboard. We are getting there, but we still have a long road ahead.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated strong capabilities in generating classical code (e.g., HumanEval), their performance in quantum code generation remains largely unexplored across diverse software ecosystems. Current benchmarks are predominantly single-framework (e.g., only Qiskit or only PennyLane), creating a critical evaluation gap:

Confounding Factors: It is difficult to distinguish whether a model's failure stems from a lack of quantum reasoning (conceptual errors in algorithms) or a lack of framework familiarity (incorrect API usage, missing imports, or simulator misuse).
Probabilistic Nature: Unlike classical code, quantum programs produce probabilistic measurement statistics. Standard deterministic evaluation metrics (like exact output matching) are insufficient; correctness must be defined via distributional agreement.
Need for Generalization: Developers need to know if models can generate correct quantum logic that is portable across different abstractions (Qiskit, Cirq, PennyLane), rather than just memorizing one specific API.

2. Methodology

The authors introduce QuanBench+, a unified benchmark designed to isolate framework familiarity from quantum reasoning by holding the task intent constant while varying the target framework.

A. Benchmark Construction

Task Set: Derived from the original QuanBench, the benchmark consists of 42 aligned tasks categorized into:
- Quantum Algorithms (31 tasks)
- Gate Decomposition (5 tasks)
- State Preparation (6 tasks)
Frameworks: Tasks are adapted for Qiskit, PennyLane, and Cirq.
Prompt Standardization: Prompts are modified to enforce framework-specific imports and conventions while preserving the functional goal. Models are instructed to output code only, without explanations.
Canonical Solutions: A unified set of reference solutions is created for all frameworks to ensure fair grading.

B. Evaluation Metrics

The paper moves beyond simple pass/fail checks to address the probabilistic nature of quantum outputs:

Pass@k: The primary metric, measuring the probability that at least one of the top- $k$ generated solutions is correct. The paper reports Pass@1 and Pass@5.
KL-Divergence Acceptance: For tasks with probabilistic outputs, correctness is determined by comparing the model's output distribution ( $Q$ $Q$ ) against the canonical reference distribution ( $P$ $P$ ). A solution is accepted if the Kullback-Leibler (KL) divergence $D_{KL}(P \| Q)$ $D_{K L} (P ∥ Q)$ is below a calibrated threshold of 0.05.
- Note: The authors explicitly exclude Process Fidelity (unitary overlap) as a primary metric. They argue that different circuit structures can be functionally equivalent (producing the same measurement statistics) despite having low unitary overlap, making fidelity a source of false negatives.
Feedback-Based Repair (Pass@1 FB): A multi-turn evaluation where the model is allowed to revise its code (up to 5 times) after receiving runtime exceptions or incorrect output distributions.

C. Experimental Setup

Models: Evaluated a diverse set of frontier and open-weight LLMs (e.g., GPT-5.1, Gemini-3-Pro, Claude-3.7, DeepSeek-R1, Llama-4).
Environment: Controlled Python environment (Python 3.10) with specific framework versions (Qiskit v0.46.0, Cirq v1.6.1, PennyLane v0.43.1).
Conditions: Tested under One-Shot (greedy decoding for Pass@1, sampling for Pass@5) and Prefill (providing imports/signatures) vs. No-Prefill conditions.

3. Key Contributions

Unified Multi-Framework Benchmark: The first benchmark to evaluate quantum code generation across Qiskit, PennyLane, and Cirq with aligned tasks, enabling the separation of reasoning errors from framework errors.
Probabilistic Evaluation Standard: Established a rigorous grading pipeline using executable functional tests and KL-divergence thresholds, moving away from fidelity-based metrics that penalize syntactically different but functionally equivalent circuits.
Feedback Loop Analysis: Introduced a systematic evaluation of iterative repair, quantifying how much performance can be recovered through error feedback.
Comprehensive Empirical Study: Provided a granular analysis of model performance, error types (semantic vs. syntactic), and the impact of prompt scaffolding (prefill).

4. Key Results

The experiments reveal three dominant patterns:

A. Framework Asymmetry (RQ1)

Difficulty Hierarchy: Qiskit is consistently the easiest framework (highest scores), followed by Cirq, with PennyLane being the most difficult.
Top Scores (One-Shot Pass@1):
- Qiskit: 59.5% (Gemini-3-Pro)
- Cirq: 54.8% (Gemini-3-Pro)
- PennyLane: 42.9% (GPT-5.1)
Implication: Performance is heavily dependent on framework-specific familiarity. No single model dominates across all three, suggesting current LLMs rely on API recall rather than portable quantum reasoning.

B. Impact of Prefill (RQ2)

Providing framework-specific boilerplate (imports, signatures) significantly helps smaller and mid-tier models by reducing interface friction.
However, prefill does not solve deep semantic reasoning errors. Stronger models show diminishing returns from prefill, indicating that the "hard" failures are conceptual, not just syntactic.

C. Feedback-Based Repair (RQ3)

Iterative repair dramatically improves performance across all frameworks.
Top Scores with Repair (Pass@1 FB):
- Qiskit: 83.3% (GPT-5.1)
- Cirq: 76.2% (Gemini-3-Pro)
- PennyLane: 66.7% (GPT-5.1)
Error Analysis: Feedback effectively fixes surface-level errors (syntax, missing imports, runtime exceptions). However, the residual errors after repair are overwhelmingly semantic (wrong logic, incorrect algorithmic structure), accounting for ~75% of remaining failures.

5. Significance and Conclusion

Current State: Modern LLMs can generate plausible quantum code, but reliable, multi-framework quantum code generation remains unsolved. Success is currently driven more by exposure to specific framework APIs than by generalizable quantum reasoning.
Future Directions: The paper argues that future progress requires more than just scaling model size. It necessitates:
- Better exposure to diverse quantum software data.
- Enhanced compositional reasoning capabilities.
- Tighter alignment with framework-specific execution patterns.
Resource: QuanBench+ provides a reproducible, practical foundation for evaluating the next generation of quantum-aware LLMs, moving the field beyond single-framework silos.

In summary, QuanBench+ demonstrates that while LLMs are making progress in quantum programming, they are still heavily "framework-bound." The gap between generating code that looks right and code that works correctly across different ecosystems remains a significant challenge, particularly for probabilistic correctness and deep algorithmic reasoning.