Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction

This study evaluates the reproducibility and robustness of three distinct large language models in extracting mobility functional status from clinical narratives, demonstrating that while prompt variations and higher temperatures can significantly degrade stability, self-consistency via majority voting offers an effective mitigation strategy to enhance reliability without sacrificing predictive performance.

Liu, X., Garg, M., Jeon, E., Jia, H., Sauver, J. S., Pagali, S. R., Sohn, S.

Published 2026-04-05
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have hired three different "super-intelligent assistants" to read thousands of messy doctor's notes and find specific information: Can the patient walk? Can they move objects? Can they change positions?

This paper is like a rigorous "stress test" to see how reliable these assistants are when you ask them the same question over and over, or when you ask the question in slightly different ways.

Here is the breakdown of the study using simple analogies:

1. The Three Assistants (The Models)

The researchers tested three different types of AI models, each with a different "personality" and brain structure:

  • The Generalist (Llama 3.3): A massive, dense brain that knows a little bit about everything. It's like a very well-read librarian who has read every book in the library.
  • The Specialist Team (Llama 4): This model uses a "Mixture of Experts" (MoE). Imagine a huge team of 16 specialists where, for every question, a manager only picks 2 of them to answer. It's fast and efficient, but the "manager" might pick different people for the same question if you ask twice.
  • The Medical Doctor (MedGemma): This is a general AI that went to medical school. It has been specifically trained on medical records, so it speaks the language of doctors better than the others.

2. The Two Big Problems They Tested

The researchers wanted to know two things: Reproducibility and Robustness.

A. Reproducibility: "The Broken Record Test"

  • The Scenario: You ask the assistant the exact same question, using the exact same words, 100 times in a row.
  • The Question: Does it give you the exact same answer every time?
  • The Analogy: Imagine asking a friend, "Is the sky blue?" 100 times. If they say "Yes" 99 times and "No" once, they aren't very reliable.
  • The Finding: When the researchers turned up the "randomness dial" (called Temperature) to make the AI more creative, the assistants started giving different answers.
    • The Generalist was okay; it changed its mind a little.
    • The Specialist Team (MoE) was chaotic. Because its "manager" kept picking different experts, it gave wildly different answers even with the same question.
    • The Medical Doctor was very consistent, but only if you kept the randomness dial turned all the way down.

B. Robustness: "The Paraphrase Test"

  • The Scenario: You ask the same question, but you change the wording slightly. Instead of "Is the patient walking?", you say "Does the patient ambulate?" or "Can they get around on their feet?"
  • The Question: Does the assistant understand that these are the same question and give the same answer?
  • The Analogy: Imagine asking a waiter, "Is the soup hot?" and then "Is the soup warm?" If the waiter says "Yes" to the first and "No" to the second, the waiter is confused and unreliable.
  • The Finding: This was the biggest shock. Even though the questions meant the same thing, the assistants often gave different answers.
    • The Specialist Team (MoE) failed miserably here. Small changes in wording made it flip-flop completely.
    • The Generalist and Medical Doctor were much better at understanding the intent behind the words, not just the specific words used.

3. The "Temperature" Trap

In AI, Temperature is like a spice level.

  • Low Temperature (0.0): The AI is boring, strict, and deterministic. It gives the same answer every time.
  • High Temperature (1.0): The AI is spicy, creative, and random. It might find a clever new way to answer, but it might also hallucinate or change its mind.

The Big Discovery: The researchers found that Accuracy \neq Reliability.
Sometimes, turning up the "spice" (Temperature) made the AI slightly more accurate at finding the right medical fact. BUT, it made the AI much less consistent.

  • Analogy: Imagine a chef who cooks a perfect steak 90% of the time, but the other 10% of the time, they burn it or serve it raw. If you need to feed 1,000 patients, you don't want that chef, even if their "average" steak is delicious. You want the chef who gives you a "good enough" steak 100% of the time.

4. The Solution: The "Committee Vote" (Self-Consistency)

The researchers tried a clever trick to fix the inconsistency. Instead of asking the AI once, they asked it 10 times and took a majority vote.

  • The Analogy: If you ask one person for directions, they might be wrong. If you ask 10 people and take the answer that 7 of them agree on, you are almost certainly right.
  • The Result: This "Committee Vote" made the AI much more stable and reliable, almost eliminating the chaos caused by the "Specialist Team" model. The only downside? It costs more time and computer power (like hiring 10 people instead of one).

The Bottom Line for Real Life

If you are building a medical system to read patient notes:

  1. Don't just look at accuracy. A model that is 95% accurate but changes its mind every time you ask is dangerous.
  2. Keep the "Temperature" low. For medical tasks, you want the AI to be boring and consistent, not creative.
  3. Watch out for "MoE" models. The model that uses a team of experts (Llama 4) was surprisingly sensitive to small changes in how you asked the question.
  4. Use the "Committee Vote" if you can. If you can afford the extra computer time, asking the AI multiple times and voting is a great safety net.

In short: In medicine, stability is just as important as smarts. You want an AI that is a reliable, boring robot, not a brilliant but unpredictable artist.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →