📄 radiology and imaging

Retrieval-Augmented Claude Opus 4.7 and GPT-5.5 Surpass Human Performance on the Nuclear Cardiology Board Preparation Exam (and Claude Drafts a Paper About it)

Next-generation large language models, specifically Claude Opus 4.7 and GPT-5.5, equipped with retrieval-augmented generation using domain-specific nuclear cardiology resources, achieved mean accuracy rates of approximately 86% on the ASNC Board Preparation Exam, surpassing both the estimated passing threshold and the average performance of human fellows-in-training.

Original authors: Killekar, A., Shanbhag, A., Miller, R. J., Dey, D., Bourque, J., Phillips, L., Chareonthaitawee, P., Slomka, P.

Published 2026-05-13

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Killekar, A., Shanbhag, A., Miller, R. J., Dey, D., Bourque, J., Phillips, L., Chareonthaitawee, P., Slomka, P.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a high-stakes final exam for doctors who specialize in looking at pictures of the heart using special radioactive tracers. This is the "Nuclear Cardiology Board Exam." For years, Artificial Intelligence (AI) has been trying to take this test, but it kept failing, scoring lower than the average medical student.

This paper tells the story of how two new, super-smart AI models finally passed the test with flying colors, beating the average human student.

The Setup: The Exam and the "Cheat Sheet"

The exam has 168 questions. Some are just text (like a trivia quiz), but about 27 of them require looking at complex medical images of hearts.

In the past, when AI tried to take this test "cold" (without any help), the best it could do was get about 63% right. That's a failing grade. The average human medical student (a "fellow-in-training") scored 78%.

For this new study, the researchers gave the AI a massive "cheat sheet." This wasn't just a quick Google search; it was a Retrieval-Augmented Generation (RAG) system. Think of it like giving the AI a perfect, searchable digital library containing the official textbooks, atlases, and medical guidelines for nuclear cardiology. When the AI sees a question, it instantly dives into this library, finds the exact page with the answer, and uses that to formulate its response.

The Contenders

The researchers tested two new, next-generation AI models:

Claude Opus 4.7: A model that uses a local, transparent search system (like a librarian who shows you exactly which books it pulled off the shelf).
GPT-5.5: A model that uses a cloud-based search system (like a librarian who finds the books for you but doesn't show you the process).

The Results: AI Beats the Average Student

When these two AIs took the exam five times each, the results were surprising:

The Scores: Both models scored around 86% to 87%.
The Comparison: This is significantly higher than the average human student's score of 78%. In fact, if you lined up the 13 human students and the 2 AIs, the AIs would rank in the top 5, beating 8 or 9 of the humans.
The Speed of Progress: This is a massive jump. Just 18 months ago, the best AI scored 63%. Now, with the "cheat sheet" (RAG), they jumped 23 percentage points.

The Two Weak Spots

Even though the AIs won, they had two specific struggles:

The "Image" Problem: The AIs were great at text questions (scoring nearly 89%), but they stumbled on the image questions. They got about 73–77% right on images. Humans were still better at this, scoring 81.5%.
- Analogy: Imagine the AI is a brilliant professor who can recite the entire textbook from memory but still gets confused when looking at a blurry X-ray. It knows the theory perfectly but is still learning how to "see" the picture.
The "Safety" Glitch (GPT-5.5 only): GPT-5.5 refused to answer about 7% of the questions. It would say, "I'm sorry, I can't help with that," even though the questions were just standard medical exam questions about heart drugs or radiation safety.
- Analogy: It's like a very cautious librarian who refuses to hand you a book about "how to build a bomb" even if you are asking a physics student for a legitimate exam question about nuclear energy. The AI's safety filters were too sensitive, causing it to miss points. Claude Opus 4.7 didn't have this problem; it answered everything.

What the Authors Actually Say (and Don't Say)

The paper is very careful about what this means:

What it IS: It proves that with the right reference materials, AI can learn the facts and rules of nuclear cardiology better than the average trainee. The authors suggest these tools could be used as educational aids to help students study or as reference tools to double-check facts in a reading room.
What it IS NOT: The authors explicitly state that passing a multiple-choice test does not mean the AI is ready to be a doctor. Real medicine involves talking to patients, handling uncertainty, and making complex judgment calls that a multiple-choice exam cannot measure. The AI is a powerful reference book, not a replacement for a human physician.

The Bottom Line

In the span of a year and a half, AI has gone from failing the nuclear cardiology board exam to beating the average human student, provided it has access to the right textbooks. However, it still struggles with interpreting medical images, and one of the models is too "scared" to answer certain legitimate questions. While it's a huge leap forward for medical education tools, the paper concludes that these machines are assistants, not replacements, for human doctors.