Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation
This paper introduces a Qiskit-based adaptation of Microsoft's QuantumKatas as a comprehensive benchmark for evaluating LLMs on quantum computing tasks, revealing that while models excel at implementing known algorithms, they struggle with problem encoding and that chain-of-thought prompting yields mixed results across different model architectures.
Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer
Imagine you have a giant library of 350 riddles designed to teach someone how to speak "Quantum," a strange new language used to program quantum computers. For years, these riddles were written in a language called Q# (Microsoft's dialect).
This paper is about two main things:
Translating the Library: The authors took those 350 riddles and translated them into Qiskit, which is the most popular "dialect" (framework) used by quantum programmers today.
Testing the Students: They used this translated library as a giant exam to test 16 different Artificial Intelligence (AI) models to see how good they are at solving these quantum riddles.
Here is a breakdown of what they found, using simple analogies:
1. The Exam: "QuantumKatas"
Think of the QuantumKatas as a video game with 26 different levels, ranging from "Tutorial" (very easy) to "Boss Battle" (very hard).
The Levels: Some levels ask the AI to perform simple tricks, like flipping a coin (a basic gate). Others ask the AI to solve complex puzzles, like finding a hidden needle in a haystack using a specific algorithm (Grover's search) or fixing a broken machine (error correction).
The Translation: The authors didn't invent new riddles; they just translated the existing ones from Microsoft's Q# language to IBM's Qiskit language. This ensures the difficulty is fair and the concepts are the same.
The Grading: They didn't just ask the AI to write code; they ran the code in a simulator (a virtual quantum computer) to see if it actually worked. If the math didn't match, the AI failed.
2. The Students: 16 AI Models
They tested 16 different AI "students."
The "Elite" Students (Frontier Models): These are the big, expensive, proprietary models (like GPT-5.5, Claude Opus, Gemini 3.1).
The "Open" Students (Open-Source Models): These are free models that anyone can download (like Llama, Mistral, Gemma).
The Results:
The Gap: The Elite students scored much higher than the Open students. On average, the Elite students got about 75% of the riddles right, while the Open students only got about 49% right. It's like a difference between an honors student and a passing student.
Size Doesn't Always Win: Interestingly, having a "bigger brain" (more parameters) didn't guarantee a better score. Some smaller, smarter-tuned models outperformed massive ones. It's not just about how big the brain is, but how it was trained.
3. The Study Hints (Prompting Strategies)
The researchers tried different ways to ask the questions to see if it helped the AI perform better.
The "Show Me" Method (Few-Shot): They gave the AI a few examples of solved riddles before asking it to solve a new one. This was the most reliable method for almost everyone. It's like showing a student a solved math problem before giving them a test.
The "Think Aloud" Method (Chain-of-Thought): They asked the AI to explain its reasoning step-by-step before writing the code.
The Twist: This worked great for the "Reasoning-Tuned" models (the ones specifically trained to think hard), boosting their scores.
The Downside: For most other models, thinking out loud actually made them worse. It's like asking a student to talk through every step of a puzzle, and they get so distracted by talking that they forget the solution.
The "Just Do It" Method (Zero-Shot): Just asking the question with no examples. This worked best for the absolute smartest models (like GPT-5.5), who didn't need help.
4. Where Did They Struggle?
The AI students were good at some things and terrible at others:
The Strong Suit: They were great at reciting known algorithms. If the riddle asked, "Write the code for Simon's Algorithm," they got it right 82% of the time. It's like memorizing a recipe and cooking it perfectly.
The Weak Spot: They struggled with problem encoding. If the riddle said, "Take this messy real-world problem (like a logic puzzle) and turn it into a quantum recipe," they failed often (only 34% success). It's like being great at following a recipe but terrible at inventing a new dish from scratch.
The "Measurement" Trap: They also had a hard time with tasks involving "measurement" (checking the result of a quantum state). This seems to be a specific blind spot for current AI.
5. The Verdict
AI is getting good, but not perfect: The best AI can solve about 83% of these quantum riddles. That's impressive for such a hard subject, but it's not perfect yet.
The "Translation" Problem: The AI is better at copying known patterns than translating a new, messy problem into quantum code.
One Size Does Not Fit All: You shouldn't use the same "study hint" (prompt) for every AI. Some need examples, some need to think aloud, and some just need to be left alone.
In short: The authors built a standardized "Quantum Driver's Test" in the most popular language. They found that while AI is getting very good at driving on known roads (standard algorithms), it still struggles to navigate when the map is missing (solving new problems). The "Elite" AI models are currently the best drivers, but the gap between them and the "Open" models is significant.
Technical Summary: Qiskit QuantumKatas for LLM Evaluation
Problem Statement
While Large Language Models (LLMs) have demonstrated strong code generation capabilities in general programming and data science, their proficiency in specialized scientific computing—specifically quantum computing—remains underexplored. Quantum computing presents a unique challenge due to its non-classical computational paradigm, requiring an understanding of superposition, entanglement, and measurement. Existing benchmarks for quantum tasks are often limited in scale, lack pedagogical structure, or focus on multiple-choice knowledge rather than code generation. There is a need for a large-scale, structured benchmark that enables fine-grained analysis of LLMs' ability to generate functional quantum code within the most widely adopted framework, Qiskit.
Methodology
The authors introduce Qiskit QuantumKatas, a benchmark adapting Microsoft's established QuantumKatas curriculum (originally in Q#) into Qiskit. The methodology involves:
Dataset Construction:
Translation: 350 distinct programming tasks were translated from Q# to Qiskit, preserving the original pedagogical progression from basic gates to advanced algorithms.
Verification: A deterministic evaluation pipeline was built using classical circuit simulation (Qiskit's AerSimulator and Statevector). Each task includes a natural language prompt, a canonical solution, and a test function that verifies correctness via statevector comparison or measurement outcome analysis.
Categorization: Tasks are organized into 26 categories (e.g., BasicGates, Grover's Algorithm, Quantum Error Correction) and three pedagogical tiers: Introductory (95 tasks), Intermediate (132 tasks), and Advanced (123 tasks).
Evaluation Framework:
Models: 16 LLMs were evaluated, comprising 6 frontier (proprietary) models (e.g., GPT-5.5, Claude Opus 4.7) and 10 open-source models (ranging from 8B to 675B parameters).
Prompting Configurations: Each model was tested across 7 prompting strategies: three zero-shot variants (default, minimal, detailed), three few-shot variants (1-shot, 3-shot, 5-shot using examples from introductory categories), and one chain-of-thought (CoT) configuration.
Execution: The study involved 39,200 model runs. Solutions were parsed, syntax-checked, and executed in isolated subprocesses with a 30-second timeout. Pass@1 (single-attempt) results were reported at temperature 0 to ensure reproducibility.
Key Contributions
Benchmark Adaptation: A complete translation of the 350-task QuantumKatas curriculum from Q# to Qiskit, making a proven pedagogical resource accessible for evaluating the dominant quantum framework.
Evaluation Infrastructure: A robust, deterministic evaluation pipeline featuring classical simulation for verification, multi-provider support, and configurable prompting strategies.
Empirical Analysis: The largest systematic evaluation of LLMs on quantum code generation to date, providing baseline results, error taxonomies, and fine-grained performance profiling across 26 categories.
Open Release: The dataset, evaluation framework, and baseline results are released to support reproducible research.
Results
The evaluation yielded several critical findings regarding LLM capabilities in quantum computing:
Model Performance Gap:
Best-configuration pass rates ranged from 32.3% (Granite 4.1 8B) to 83.1% (GPT-5.5).
A persistent 26.1 percentage point gap exists between frontier models (avg 75.3%) and open-source models (avg 49.3%).
Model scale is not a perfect predictor of performance; for instance, the 675B-parameter Mistral Large 3 (48.6%) underperformed the 31B-parameter Gemma 4 (68.0%).
Prompting Strategy Effects:
Few-shot prompting (specifically 5-shot) was the most reliable strategy on average (57.8% mean), outperforming zero-shot and chain-of-thought.
Chain-of-Thought (CoT) exhibited a bimodal effect: it was the best strategy for three models (two explicitly reasoning-tuned: GPT-5.3-Codex and Gemini 3.1 Pro), but degraded performance for the majority of other models (e.g., a 11.1 pp drop for Claude Sonnet 4.6). This suggests CoT is not universally beneficial for quantum code generation.
Task Difficulty and Capabilities:
Algorithm Implementation vs. Problem Encoding: Models perform well on implementing known algorithms (e.g., Simon's Algorithm: 82.1%, BasicGates: 81.6%) but struggle significantly with encoding classical problems into quantum primitives (e.g., SolveSATWithGrover: 34.4%, DistinguishUnitaries: 40.0%).
Error Analysis: The dominant failure mode is logic errors (43.0%, primarily AssertionError), where code runs but produces incorrect quantum states. This indicates that quantum reasoning, rather than syntax or API usage, is the primary bottleneck.
Measurement Reasoning: Categories involving measurement outcomes and basis selection (e.g., Measurements, Teleportation) consistently showed lower pass rates, highlighting a specific weakness in reasoning about classical-quantum interfaces.
Significance
The paper claims that the Qiskit QuantumKatas benchmark provides a rigorous, pedagogically structured tool for assessing LLMs in a specialized scientific domain. Its significance lies in:
Differentiation: The benchmark effectively differentiates model capabilities across a wide performance spectrum, avoiding ceiling or floor effects.
Granularity: The 26-category structure allows for fine-grained analysis, revealing that LLMs can translate documented algorithmic structures into code more readily than they can formulate new quantum solutions for classical problems.
Educational and Developmental Utility: The results suggest that while frontier models are becoming viable for automated tutoring and code completion in introductory quantum topics, they are not yet trustworthy for advanced problem formulation or complex arithmetic.
Future Direction: The study highlights that scaling alone may not bridge the gap in specialized domains; targeted training and improved reasoning capabilities are likely necessary to address the specific challenges of problem encoding and measurement reasoning.
The authors emphasize that the benchmark serves as a foundation for future research, including noise-aware tasks, research-level algorithms, and the development of domain-specific training data to close the performance gap between frontier and open-source models.