Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

This landscape commentary evaluates the GPT-5 family against GPT-4o, revealing substantial improvements in expert-level textual reasoning and multimodal synthesis that approach state-of-the-art performance in tasks like mammography, while highlighting that generalist models still lag behind specialized systems in perception-critical domains such as neuroradiology.

Alexandru Florea, Shansong Wang, Mingzhe Hu, Qiang Li, Zach Eidex, Luke del Balzo, Mojtaba Safari, Xiaofeng Yang

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a brilliant, well-read student how to be a doctor. This student has read every medical textbook, every journal, and every case study ever written. But there's a catch: they've never actually looked at a real X-ray, a microscope slide, or a mammogram. They only know what the words say about those images.

This paper is a report card for the newest version of that student: GPT-5. The researchers wanted to see if this "super-student" has graduated from just memorizing facts to actually thinking like a doctor when looking at pictures and text together.

Here is the breakdown of their findings, using some everyday analogies:

1. The Test: A Medical "Obstacle Course"

The researchers didn't just ask GPT-5 simple questions. They put it through a grueling obstacle course with three different types of challenges:

  • The Textbook Exam: Standard medical questions (like the USMLE board exams) to see if it knows the facts.
  • The Detective Work: Complex cases where the student has to read a patient's story, look at a lab report, and then look at an image to solve a mystery.
  • The Visual Eye-Test: Looking at specific medical images (brain MRIs, microscope slides of cells, and breast X-rays) and answering questions about what they see.

2. The Results: The "Smart Generalist" vs. The "Specialist"

🧠 The Textbook Genius (Text-Only Tasks)

The Verdict: GPT-5 is a phenomenal scholar.
When the test was just about reading and reasoning with words, GPT-5 crushed it. It scored over 95% on medical board exams.

  • The Analogy: If the previous model (GPT-4o) was a smart college graduate, GPT-5 is a tenured professor who can explain complex medical concepts without stumbling. It didn't just memorize the answers; it figured out the logic behind them.

🔍 The Detective (Combining Text + Images)

The Verdict: GPT-5 is getting much better at connecting the dots.
When the test required looking at a picture and reading a story to make a diagnosis, GPT-5 showed a massive leap forward.

  • The Analogy: Imagine a detective who used to only read police reports. Now, GPT-5 can look at the crime scene photo and read the witness statement to figure out what happened. In one test, it correctly identified a rare, life-threatening condition (a torn esophagus) by combining a CT scan image with the patient's vomiting history. It acted like a seasoned clinician, not just a search engine.

👁️ The Visual Specialist (Looking at Images Alone)

The Verdict: GPT-5 is good, but not a specialist yet.
This is where the "super-student" hit a wall.

  • Brain Scans (Neuroradiology): GPT-5 got about 44% of the answers right. It's better than a layperson, but a real brain surgeon would get 90%+. It's like a general mechanic trying to fix a Formula 1 engine; they know how engines work, but they miss the tiny, specific details.
  • Microscopes (Pathology): It did okay, but sometimes it got confused by the tiny details of blood cells or tissue samples.
  • Mammograms (Breast X-rays): This was the hardest test. GPT-5 got about 50-60% right. Specialized AI systems built just for breast cancer get over 80-90%.
  • The Analogy: Think of GPT-5 as a general practitioner (a family doctor). They are great at handling most things and can spot when something is wrong. But if you need a micro-surgeon to remove a tiny tumor or a radiologist to spot a tiny speck of calcium in a breast, you still need a human expert or a machine built specifically for that one job. GPT-5 is too "general" to be the best at the most "specialized" visual tasks.

3. The "Mini" Surprise

Interestingly, the smaller versions of GPT-5 (called "Mini" and "Nano") sometimes did better than the big "GPT-5" on specific image tests.

  • The Analogy: It's like a giant, cautious elephant (GPT-5) vs. a nimble mouse (GPT-5 Mini). The elephant is incredibly smart and powerful, but sometimes it overthinks or is too careful when looking at small details. The mouse is faster and sometimes just "guesses" the right answer more often on small, specific tasks.

4. The Bottom Line: A Powerful Assistant, Not a Replacement

The paper concludes that GPT-5 is a huge step forward. It is the first time a general AI has shown it can truly "think" across text and images in a way that mimics a doctor's brain.

  • What it can do: It's an amazing assistant. It can read a patient's history, look at a scan, and say, "Hey, this looks suspicious, here is why, and here is what we should check next." It helps doctors think faster and more holistically.
  • What it can't do yet: It is not ready to replace a specialist. If you need a machine to look at a mammogram and decide if it's cancer with 99% certainty, you still need a tool built specifically for that, not a general-purpose AI.

In short: GPT-5 is like a brilliant medical student who is ready to start their residency. They are smart, they reason well, and they are improving fast. But they still need a supervising doctor (or a specialized tool) to handle the most critical, high-stakes visual decisions.