Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

This paper introduces a Qiskit-based adaptation of Microsoft's QuantumKatas as a comprehensive benchmark for evaluating LLMs on quantum computing tasks, revealing that while models excel at implementing known algorithms, they struggle with problem encoding and that chain-of-thought prompting yields mixed results across different model architectures.

Original authors: Juan Cruz-Benito, Ismael Faro

Published 2026-05-27
📖 5 min read🧠 Deep dive

Original authors: Juan Cruz-Benito, Ismael Faro

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a giant library of 350 riddles designed to teach someone how to speak "Quantum," a strange new language used to program quantum computers. For years, these riddles were written in a language called Q# (Microsoft's dialect).

This paper is about two main things:

  1. Translating the Library: The authors took those 350 riddles and translated them into Qiskit, which is the most popular "dialect" (framework) used by quantum programmers today.
  2. Testing the Students: They used this translated library as a giant exam to test 16 different Artificial Intelligence (AI) models to see how good they are at solving these quantum riddles.

Here is a breakdown of what they found, using simple analogies:

1. The Exam: "QuantumKatas"

Think of the QuantumKatas as a video game with 26 different levels, ranging from "Tutorial" (very easy) to "Boss Battle" (very hard).

  • The Levels: Some levels ask the AI to perform simple tricks, like flipping a coin (a basic gate). Others ask the AI to solve complex puzzles, like finding a hidden needle in a haystack using a specific algorithm (Grover's search) or fixing a broken machine (error correction).
  • The Translation: The authors didn't invent new riddles; they just translated the existing ones from Microsoft's Q# language to IBM's Qiskit language. This ensures the difficulty is fair and the concepts are the same.
  • The Grading: They didn't just ask the AI to write code; they ran the code in a simulator (a virtual quantum computer) to see if it actually worked. If the math didn't match, the AI failed.

2. The Students: 16 AI Models

They tested 16 different AI "students."

  • The "Elite" Students (Frontier Models): These are the big, expensive, proprietary models (like GPT-5.5, Claude Opus, Gemini 3.1).
  • The "Open" Students (Open-Source Models): These are free models that anyone can download (like Llama, Mistral, Gemma).

The Results:

  • The Gap: The Elite students scored much higher than the Open students. On average, the Elite students got about 75% of the riddles right, while the Open students only got about 49% right. It's like a difference between an honors student and a passing student.
  • Size Doesn't Always Win: Interestingly, having a "bigger brain" (more parameters) didn't guarantee a better score. Some smaller, smarter-tuned models outperformed massive ones. It's not just about how big the brain is, but how it was trained.

3. The Study Hints (Prompting Strategies)

The researchers tried different ways to ask the questions to see if it helped the AI perform better.

  • The "Show Me" Method (Few-Shot): They gave the AI a few examples of solved riddles before asking it to solve a new one. This was the most reliable method for almost everyone. It's like showing a student a solved math problem before giving them a test.
  • The "Think Aloud" Method (Chain-of-Thought): They asked the AI to explain its reasoning step-by-step before writing the code.
    • The Twist: This worked great for the "Reasoning-Tuned" models (the ones specifically trained to think hard), boosting their scores.
    • The Downside: For most other models, thinking out loud actually made them worse. It's like asking a student to talk through every step of a puzzle, and they get so distracted by talking that they forget the solution.
  • The "Just Do It" Method (Zero-Shot): Just asking the question with no examples. This worked best for the absolute smartest models (like GPT-5.5), who didn't need help.

4. Where Did They Struggle?

The AI students were good at some things and terrible at others:

  • The Strong Suit: They were great at reciting known algorithms. If the riddle asked, "Write the code for Simon's Algorithm," they got it right 82% of the time. It's like memorizing a recipe and cooking it perfectly.
  • The Weak Spot: They struggled with problem encoding. If the riddle said, "Take this messy real-world problem (like a logic puzzle) and turn it into a quantum recipe," they failed often (only 34% success). It's like being great at following a recipe but terrible at inventing a new dish from scratch.
  • The "Measurement" Trap: They also had a hard time with tasks involving "measurement" (checking the result of a quantum state). This seems to be a specific blind spot for current AI.

5. The Verdict

  • AI is getting good, but not perfect: The best AI can solve about 83% of these quantum riddles. That's impressive for such a hard subject, but it's not perfect yet.
  • The "Translation" Problem: The AI is better at copying known patterns than translating a new, messy problem into quantum code.
  • One Size Does Not Fit All: You shouldn't use the same "study hint" (prompt) for every AI. Some need examples, some need to think aloud, and some just need to be left alone.

In short: The authors built a standardized "Quantum Driver's Test" in the most popular language. They found that while AI is getting very good at driving on known roads (standard algorithms), it still struggles to navigate when the map is missing (solving new problems). The "Elite" AI models are currently the best drivers, but the gap between them and the "Open" models is significant.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →