Medical concept understanding in large language models is fragmented

This paper reveals that while large language models excel in medical applications, their understanding of medical concepts is significantly fragmented, with strong performance on identity and hierarchy masking substantial gaps in grasping concept meaning.

Deng, L., Chen, L., Liu, M.

Published 2026-03-05
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Question: Do AI Doctors Actually "Get It"?

Imagine you have a brilliant new student, let's call him "AI Alex." Alex has read every medical textbook, watched every surgery video, and memorized millions of patient records. When you ask him, "What's the best treatment for a broken leg?" he answers perfectly. When you ask him to diagnose a rare rash, he gets it right 90% of the time.

You might think, "Wow, Alex is a genius doctor!"

But this paper asks a deeper, more unsettling question: Does Alex actually understand what a "broken leg" is, or is he just really good at guessing the right answer based on patterns?

The researchers wanted to see if Large Language Models (LLMs) like the ones powering modern AI truly understand medical concepts, or if they are just "parrots" that sound smart without knowing the underlying logic.

The Test: The Three-Layer Cake

To find out, the researchers didn't just give Alex a medical exam. Instead, they built a special test based on the Human Phenotype Ontology (HPO). Think of the HPO as the ultimate, perfectly organized library card catalog for human body symptoms.

They broke down "understanding" into three layers, like a three-layer cake:

  1. Layer 1: The Name Game (Concept Identity)

    • The Test: Can Alex realize that "Loss of smell" and "Anosmia" are the exact same thing?
    • The Analogy: Imagine someone asks, "Is a 'soda' the same as a 'pop'?" A smart person knows they are synonyms.
    • The Result: Alex is great at this! He got it right about 90% of the time. He knows the different names for the same thing.
  2. Layer 2: The Family Tree (Concept Hierarchy)

    • The Test: Can Alex understand that "Anosmia" is a type of "Abnormality of the sense of smell"?
    • The Analogy: If you know "Golden Retriever" is a type of "Dog," and "Dog" is a type of "Animal," you understand the family tree.
    • The Result: Alex is okay, but not perfect. His score dropped to about 84%. He sometimes gets confused about how things fit together in the big picture.
  3. Layer 3: The Deep Meaning (Concept Meaning)

    • The Test: Can Alex pick the correct definition of a symptom from a list of 20 very similar-sounding definitions?
    • The Analogy: This is like asking, "What exactly is a 'broken leg'?" and giving you 20 definitions where 19 are slightly wrong (e.g., "a leg that hurts" vs. "a bone that has fractured").
    • The Result: This is where Alex really struggled. His score dropped to 72%. Even worse, if you tricked him with a hint that was slightly wrong, he got confused and his score plummeted.

The Big Discovery: The "Fragmented" Mind

The most shocking finding is that Alex's understanding is fragmented.

Imagine a puzzle. If Alex truly understood a medical concept, he would have the whole puzzle piece: the name, the family tree, and the definition all connected.

But the researchers found that for many concepts, Alex only has parts of the puzzle:

  • Sometimes he knows the name but not the definition.
  • Sometimes he knows the definition but gets the family tree wrong.
  • Sometimes he gets all three right.

The Stats:

  • 57.7% of the time, Alex understood everything perfectly (The whole puzzle piece).
  • 41.3% of the time, he understood some parts but missed others (A broken puzzle piece).
  • 1.1% of the time, he knew nothing about the concept.

The "Thinking" Mode Twist

The researchers also tested what happens when they tell Alex to "think step-by-step" before answering (a feature called "Reasoning Mode").

  • Good News: It helped him with some hard questions.
  • Bad News: It sometimes made him worse at other questions, and it didn't fix the fundamental gaps in his understanding.

The "Poisoned" Hint Experiment

In the "Meaning" test, the researchers tried a trick. They gave Alex a hint that said, "These two words are not related," even though they actually were.

  • Result: Alex believed the lie! His performance dropped significantly.
  • What this means: Alex doesn't have a solid, unshakeable internal truth about what a medical term means. He is easily swayed by what he is told in the moment. He is like a student who memorized the answer key but doesn't understand the math; if you tell him the answer key is wrong, he panics.

The Takeaway: Why This Matters

The paper concludes that just because an AI can pass a medical exam doesn't mean it understands medicine.

  • The Danger: If an AI is used to help doctors, and it has "fragmented" understanding, it might give a correct answer by luck, but fail catastrophically in a slightly different situation because it doesn't truly grasp the logic.
  • The Solution: We can't just rely on AI to "learn" medicine from reading books. We need to build AI systems that are explicitly connected to medical "maps" (ontologies) so they have a solid foundation, not just a good memory.

In short: The AI is a very talented mimic, but it's not yet a true medical thinker. It knows the words, but it's still learning the meaning.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →