⚛️ quantum physics

A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG

This paper introduces PennyLang, a high-quality dataset of 3,347 PennyLane-specific quantum code samples, along with an automated construction framework and a RAG-based evaluation demonstrating that retrieval augmentation significantly enhances LLM performance in quantum code generation by increasing success rates and reducing hallucinations.

Original authors: Abdul Basit, Nouhaila Innan, Muhammad Haider Asif, Minghao Shao, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

Published 2026-04-20

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Abdul Basit, Nouhaila Innan, Muhammad Haider Asif, Minghao Shao, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a brilliant but very new apprentice how to build a complex machine using a very specific, rare set of tools called PennyLane.

The problem? The apprentice (an Artificial Intelligence, or "Large Language Model") is incredibly smart and has read millions of books about general tools. But when it comes to these specific PennyLane tools, it's a bit lost. It doesn't have a good manual, and the few examples it found online are scattered, messy, or written in a language it doesn't quite understand.

This paper is about building that perfect manual and giving the apprentice the right tools to learn.

Here is the breakdown of what the researchers did, using some everyday analogies:

1. The Problem: The "Lost in Translation" Moment

Think of quantum computing as a new, high-tech language. There are two main "dialects" (frameworks) people use to speak it: Qiskit and PennyLane.

Qiskit is like a popular, well-established city. It has a huge library, a dedicated tutor (AI assistants), and thousands of students.
PennyLane is like a brilliant, specialized village. It's amazing for mixing quantum physics with machine learning, but it's a bit more isolated. It lacks a dedicated "tutor" AI, and the books (datasets) are scattered in attics, basements, and different languages.

The researchers realized that if they want AI to help write code for this "village," they first need to gather all the scattered notes, books, and code snippets into one organized library.

2. The Solution: Building "PennyLang" (The Master Library)

The team created a massive, high-quality dataset called PennyLang.

The Collection Process: Imagine a team of librarians going into three different places:
1. GitHub: The "public garage" where developers share their code. They sifted through thousands of files to find only the ones actually using PennyLane.
2. Textbooks: They scanned two major quantum computing books, pulling out the code examples and writing clear notes next to them.
3. Official Documentation: They went to the official PennyLane website and grabbed every tutorial, turning the instructions into clear, step-by-step guides.
The Cleanup: They didn't just dump everything in a pile. They acted like editors:
- They removed duplicates (like having two copies of the same recipe).
- They fixed the formatting (making sure all the code looks neat and follows the rules).
- They added "context." Instead of just giving the AI a code snippet, they added a note saying, "Here is the code, and here is what it does in plain English."

The result? A library of 3,347 perfectly organized, annotated examples.

3. The Method: The "RAG" Strategy (The Smart Search Engine)

Now, how do you teach the AI using this library? You don't just force the AI to memorize the whole library (which is too big and confusing). Instead, you use a strategy called RAG (Retrieval-Augmented Generation).

The Analogy:
Imagine the AI is a chef trying to cook a specific dish (write quantum code).

Without RAG: The chef tries to cook from memory. If they haven't seen this specific recipe before, they might guess the ingredients and end up with a burnt mess (this is called a "hallucination").
With RAG: Before the chef starts cooking, they are allowed to walk over to the library, search for the specific recipe they need, and bring that page back to the kitchen. They read the recipe while they cook.

In this paper, the researchers built a system where the AI can instantly search the PennyLang library, find the most relevant examples, and use them to help write the code.

4. The Results: The "Aha!" Moment

The researchers tested this system on different AI models, from open-source ones (like Qwen and LLaMa) to the big commercial ones (like GPT-4 and Claude).

The Open-Source Models (The Eager Students): These models were like students who hadn't studied quantum physics much yet. When they were allowed to use the "PennyLang Library" (RAG), their performance skyrocketed.
- Example: One model went from getting 8% of the answers right (guessing blindly) to 41% right just by having the library to reference. That's a huge jump!
The Commercial Models (The Experts): The big, expensive models were already very good because they had already "read" a lot of the internet during their training. They didn't need the library as much. In fact, if you gave them too much information (the whole library at once), it sometimes confused them. They worked best when given just the right amount of help (75% of the context), not the whole thing.

5. Why This Matters

This paper is a game-changer for two reasons:

It levels the playing field: It gives open-source AI models a massive boost, making them almost as good as the expensive ones for this specific task.
It opens the door for everyone: By creating a clean, organized dataset, they are making it much easier for developers to use AI to write quantum code. It's like giving everyone a clear map instead of a messy pile of notes.

In a nutshell: The researchers took a messy, scattered collection of quantum code, organized it into a perfect library, and showed that when AI is allowed to "look up" answers in this library while it works, it becomes much smarter, makes fewer mistakes, and writes better code. They are essentially building the "Google" for PennyLane quantum programming.

1. Problem Statement

While Large Language Models (LLMs) have revolutionized classical code generation, their application to quantum software development remains limited, particularly within the PennyLane ecosystem.

The Gap: Unlike IBM's Qiskit, which has dedicated AI assistants and structured datasets, PennyLane (a framework for hybrid quantum-classical computing and quantum machine learning) lacks high-quality, curated datasets for training LLMs.
Data Scarcity: Existing quantum code is scattered across GitHub, textbooks, and documentation without consistent formatting, contextual annotations, or instruction-response structures required for effective LLM training.
Consequence: This scarcity leads to poor performance in LLMs when generating PennyLane code, characterized by high rates of hallucination, syntax errors, and a lack of functional correctness in complex hybrid workflows.

2. Methodology

The authors propose a comprehensive pipeline involving dataset construction, refinement, and a Retrieval-Augmented Generation (RAG) evaluation framework.

A. Dataset Construction (PennyLang)

The team curated PennyLang, a dataset of 3,347 high-quality, PennyLane-specific code samples.

Sources:
- GitHub: 1,952 samples from unofficial repositories and 1,321 from the official PennyLaneAI repository (filtered for main branches, Python files, and permissive licenses).
- Textbooks: 21 samples extracted from two quantum computing books, manually verified for accuracy.
- Documentation: 53 samples from official PennyLane tutorials and API references.
Refinement & Annotation:
- Cleaning: Removal of duplicates, non-PennyLane snippets, and metadata (licenses, author info).
- Formatting: Standardized to PEP 8 guidelines.
- Enrichment: Used GPT-4o to convert raw code snippets into structured Instruction-Query pairs with contextual descriptions and explanatory comments.
- Tokenization: Applied left-padding and attention masking strategies compatible with causal decoder-only transformers (e.g., GPT-4o, LLaMa).

B. Evaluation Framework (RAG Pipeline)

To assess the dataset's utility, the authors implemented a Retrieval-Augmented Generation (RAG) pipeline using LangChain and Chroma vector database.

Process: User queries are embedded, and relevant code snippets are retrieved from PennyLang using Maximum Marginal Relevance (MMR) search. These snippets are concatenated into the prompt to augment the LLM's context.
Benchmarking:
- Test Suite: 264 diverse test cases covering circuit construction, VQE (Variational Quantum Eigensolver), optimization, and hybrid algorithms.
- Models Evaluated: 7 LLMs (Open-source: Qwen 2.5 7B, LLaMa 4; Commercial: GPT-4o Mini, Claude 3.5 Sonnet, Gemini 2.5 Flash, Claude Haiku, GPT-5 Mini).
- Metrics: Pass@k (k=1 to 5), measuring the percentage of test cases where at least one of $k$ generated solutions executed successfully and passed functional checks.
- Conditions: Evaluated under 0% RAG (vanilla), 50%, 75%, and 100% context retrieval coverage.

3. Key Contributions

PennyLang Dataset: The first large-scale, open-source, instruction-tuned dataset specifically for PennyLane quantum programming, containing 3,347 curated samples with rich context.
Systematic Curation Pipeline: A reproducible methodology for cleaning, annotating, and formatting quantum code from disparate sources into LLM-ready instruction-response pairs.
RAG Evaluation Framework: A robust benchmark demonstrating how retrieval strategies impact code generation, including an ablation study comparing Vanilla RAG vs. GraphRAG (finding Vanilla RAG superior for this specific task).

4. Key Results

The evaluation revealed distinct performance patterns based on model capacity and retrieval strategy:

Significant Gains for Open-Source Models:
- Qwen 2.5 7B: Success rate (Pass@5) jumped from 8.7% (no RAG) to 41.7% (full context), with 75% context yielding the best result (45.1%).
- LLaMa 4 Maverick: Improved from 78.8% to 84.8% with RAG.
- Insight: Smaller models heavily rely on external context to bridge knowledge gaps in quantum syntax and logic.
Diminishing Returns for Commercial Models:
- Models like Claude 3.5 Sonnet and GPT-5 Mini already achieved high baseline performance (e.g., ~95% Pass@5) without retrieval.
- Adding full-context retrieval sometimes caused slight performance degradation, likely due to context saturation or noise, suggesting their pre-training already covers significant PennyLane knowledge.
Optimal Retrieval Strategy:
- 75% Context Coverage consistently outperformed 100% (Full) coverage across most models. This indicates that retrieving a focused, moderate amount of relevant examples reduces noise and improves grounding better than exhaustive retrieval.
- Vanilla RAG outperformed GraphRAG, suggesting that for executable code generation, direct similarity-based retrieval of relevant snippets is more effective than modeling complex inter-document relations.

5. Significance

Bridging the Ecosystem Gap: This work democratizes AI-assisted quantum development for PennyLane, a framework critical for Quantum Machine Learning (QML), by providing the necessary data infrastructure previously available only for Qiskit.
Practical RAG Insights: The study provides empirical evidence that "more context" is not always better. For code generation, curated, partial context (75%) is superior to full-context retrieval, offering a guideline for optimizing RAG systems in technical domains.
Reproducibility: By releasing the dataset and evaluation framework, the authors enable the community to train specialized, lightweight models for quantum programming and further advance the field of AI-driven quantum software engineering.

Availability: The PennyLang dataset and evaluation code are publicly available on GitHub at https://github.com/eBrain4Everyone/Pennylang.