Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have hired three different "super-intelligent assistants" to read thousands of messy doctor's notes and find specific information: Can the patient walk? Can they move objects? Can they change positions?

This paper is like a rigorous "stress test" to see how reliable these assistants are when you ask them the same question over and over, or when you ask the question in slightly different ways.

Here is the breakdown of the study using simple analogies:

1. The Three Assistants (The Models)

The researchers tested three different types of AI models, each with a different "personality" and brain structure:

The Generalist (Llama 3.3): A massive, dense brain that knows a little bit about everything. It's like a very well-read librarian who has read every book in the library.
The Specialist Team (Llama 4): This model uses a "Mixture of Experts" (MoE). Imagine a huge team of 16 specialists where, for every question, a manager only picks 2 of them to answer. It's fast and efficient, but the "manager" might pick different people for the same question if you ask twice.
The Medical Doctor (MedGemma): This is a general AI that went to medical school. It has been specifically trained on medical records, so it speaks the language of doctors better than the others.

2. The Two Big Problems They Tested

The researchers wanted to know two things: Reproducibility and Robustness.

A. Reproducibility: "The Broken Record Test"

The Scenario: You ask the assistant the exact same question, using the exact same words, 100 times in a row.
The Question: Does it give you the exact same answer every time?
The Analogy: Imagine asking a friend, "Is the sky blue?" 100 times. If they say "Yes" 99 times and "No" once, they aren't very reliable.
The Finding: When the researchers turned up the "randomness dial" (called Temperature) to make the AI more creative, the assistants started giving different answers.
- The Generalist was okay; it changed its mind a little.
- The Specialist Team (MoE) was chaotic. Because its "manager" kept picking different experts, it gave wildly different answers even with the same question.
- The Medical Doctor was very consistent, but only if you kept the randomness dial turned all the way down.

B. Robustness: "The Paraphrase Test"

The Scenario: You ask the same question, but you change the wording slightly. Instead of "Is the patient walking?", you say "Does the patient ambulate?" or "Can they get around on their feet?"
The Question: Does the assistant understand that these are the same question and give the same answer?
The Analogy: Imagine asking a waiter, "Is the soup hot?" and then "Is the soup warm?" If the waiter says "Yes" to the first and "No" to the second, the waiter is confused and unreliable.
The Finding: This was the biggest shock. Even though the questions meant the same thing, the assistants often gave different answers.
- The Specialist Team (MoE) failed miserably here. Small changes in wording made it flip-flop completely.
- The Generalist and Medical Doctor were much better at understanding the intent behind the words, not just the specific words used.

3. The "Temperature" Trap

In AI, Temperature is like a spice level.

Low Temperature (0.0): The AI is boring, strict, and deterministic. It gives the same answer every time.
High Temperature (1.0): The AI is spicy, creative, and random. It might find a clever new way to answer, but it might also hallucinate or change its mind.

The Big Discovery: The researchers found that Accuracy $\neq$ Reliability.
Sometimes, turning up the "spice" (Temperature) made the AI slightly more accurate at finding the right medical fact. BUT, it made the AI much less consistent.

Analogy: Imagine a chef who cooks a perfect steak 90% of the time, but the other 10% of the time, they burn it or serve it raw. If you need to feed 1,000 patients, you don't want that chef, even if their "average" steak is delicious. You want the chef who gives you a "good enough" steak 100% of the time.

4. The Solution: The "Committee Vote" (Self-Consistency)

The researchers tried a clever trick to fix the inconsistency. Instead of asking the AI once, they asked it 10 times and took a majority vote.

The Analogy: If you ask one person for directions, they might be wrong. If you ask 10 people and take the answer that 7 of them agree on, you are almost certainly right.
The Result: This "Committee Vote" made the AI much more stable and reliable, almost eliminating the chaos caused by the "Specialist Team" model. The only downside? It costs more time and computer power (like hiring 10 people instead of one).

The Bottom Line for Real Life

If you are building a medical system to read patient notes:

Don't just look at accuracy. A model that is 95% accurate but changes its mind every time you ask is dangerous.
Keep the "Temperature" low. For medical tasks, you want the AI to be boring and consistent, not creative.
Watch out for "MoE" models. The model that uses a team of experts (Llama 4) was surprisingly sensitive to small changes in how you asked the question.
Use the "Committee Vote" if you can. If you can afford the extra computer time, asking the AI multiple times and voting is a great safety net.

In short: In medicine, stability is just as important as smarts. You want an AI that is a reliable, boring robot, not a brilliant but unpredictable artist.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated high accuracy in Clinical Information Extraction (IE), their deployment in healthcare is hindered by insufficient evaluation of reliability. Specifically, two critical dimensions are often overlooked:

Reproducibility: The stability of model outputs when the same prompt and text are queried repeatedly (intra-prompt stability).
Robustness: The stability of outputs when the prompt is varied through natural, non-adversarial paraphrasing (inter-prompt stability).

Current clinical workflows require outputs that are not only accurate but also consistent to support auditing, downstream analysis, and user trust. This study addresses the gap in understanding how model architecture, decoding parameters (temperature), and prompt variations affect these reliability metrics in the context of extracting Mobility Functional Status based on the International Classification of Functioning, Disability and Health (ICF) framework.

2. Methodology

A. Dataset and Task

Task: Binary classification (presence/absence) of four mobility classes derived from ICF codes:
1. Changing and Maintaining Body Position.
2. Carrying, Moving and Handling Objects.
3. Walking and Moving.
4. Moving Around Using Transportation.
Data: 800 annotated clinical note sections (200 per class) sampled from three healthcare providers in Rochester, MN. The dataset is balanced (50% positive, 50% negative).

B. Models Evaluated

Three open-weight LLMs representing distinct architectural choices were tested:

Llama 3.3 70B: A dense, general-purpose model (State-of-the-art as of late 2024).
Llama 4 (Scout-17B-16E): A Mixture-of-Experts (MoE) general-purpose model. It uses a router to activate sparse experts, introducing potential "routing fluctuation" instability.
MedGemma 27B: A domain-tuned medical model (specialist) adapted via continued pre-training on medical datasets.

C. Experimental Design

The study employed a controlled factorial design with three experiments:

Experiment 1 (Intra-Prompt Reproducibility):
- Setup: Fixed base prompt executed 100 times per model/function.
- Variable: Temperature swept from 0.0 to 1.0 (increments of 0.1).
- Goal: Measure run-to-run stability under identical conditions.
Experiment 2 (Inter-Prompt Robustness):
- Setup: 10 semantically equivalent prompt paraphrases created for each task. Each paraphrase executed 10 times (to average out sampling noise).
- Variable: Temperature swept from 0.0 to 1.0.
- Goal: Measure stability against natural instruction rewording.
Experiment 3 (Mitigation via Self-Consistency):
- Setup: Applied majority voting over 10 generated outputs to create an ensemble prediction.
- Goal: Evaluate if self-consistency improves stability without retraining.

D. Evaluation Metrics

Performance: Mean F1-score (averaged across runs).
Stability: Fleiss' Kappa ( $\kappa$ ) calculated across the 200-bit prediction vectors from repeated generations or paraphrases.
Statistical Analysis: Three-way ANOVA (factors: Model, Temperature, Mobility Class) with Tukey HSD post-hoc tests to determine significant effects.

3. Key Results

A. Temperature and Reproducibility

General Trend: Increasing temperature generally degrades reproducibility ( $\kappa$ ), but the rate of decay is model-dependent.
Model Differences:
- Llama 3.3 (Dense): Shows the most gradual decay in $\kappa$ as temperature rises. In some tasks, higher temperatures slightly improved F1-score, suggesting limited stochasticity can help escape suboptimal deterministic traps.
- Llama 4 (MoE): Exhibits the steepest degradation in reproducibility. Stochastic decoding amplifies run-to-run variability significantly, likely due to routing fluctuations.
- MedGemma: Shows strong task dependence; it remains stable for "Walking" tasks but degrades sharply for "Body Position" tasks as temperature increases.
Recommendation: For clinical deployment, Temperature 0.0 is recommended for all models to maximize reproducibility, as F1 gains at higher temperatures rarely justify the loss in stability.

B. Prompt Robustness (Paraphrasing)

General Trend: Prompt paraphrasing causes a substantial drop in stability compared to identical prompts.
Model Separation:
- Llama 4 (MoE): Demonstrated markedly lower robustness across three of the four mobility classes. Its $\kappa$ remained extremely low across all temperatures for tasks like "Carrying Objects" and "Transportation."
- Llama 3.3 & MedGemma: Remained comparatively robust to paraphrasing, maintaining high $\kappa$ values.
Statistical Significance: ANOVA confirmed that Model is a dominant factor for robustness. Post-hoc tests showed Llama 4 is significantly less robust than both Llama 3.3 and MedGemma ( $p < 0.001$ ).

C. Impact of Self-Consistency (Majority Voting)

Stability: Majority voting significantly improved Fleiss' Kappa across all models and tasks, effectively flattening the negative impact of temperature. The improvement was most dramatic for Llama 4.
Performance:
- For Llama 3.3, ensembling preserved or slightly amplified the F1-score gains seen at higher temperatures.
- For Llama 4, ensembling acted as a stabilizer, recovering performance drift observed in single-run outputs.
- For MedGemma, gains were incremental as baseline performance was already high.
Trade-off: While effective, this approach increases inference cost and latency proportionally to the number of samples.

4. Key Contributions

Unified Reliability Framework: Proposes a rigorous framework that jointly evaluates predictive performance (F1) and stability ( $\kappa$ ) across temperature sweeps and prompt variations, moving beyond accuracy-only metrics.
Architecture-Specific Insights: Identifies that MoE architectures (Llama 4) may suffer from unique instability due to routing fluctuations, making them less robust to prompt variations than dense or domain-tuned models in clinical IE tasks.
Empirical Evidence on Prompt Sensitivity: Demonstrates that even benign, semantically equivalent paraphrasing can drastically reduce extraction stability, particularly for MoE models.
Practical Mitigation Strategy: Validates self-consistency (majority voting) as an effective, model-agnostic inference-time intervention to recover stability, offering a trade-off between compute cost and reliability.

5. Significance and Implications

Clinical Deployment: The study argues that accuracy is insufficient for clinical AI. Systems must be auditable and stable. The findings suggest that MedGemma at Temperature 0.0 is a strong candidate for a single-configuration deployment due to its balance of performance and stability.
Model Selection: Organizations deploying LLMs for clinical extraction should explicitly test for prompt robustness. The choice of model (Dense vs. MoE) significantly impacts reliability, not just accuracy.
Operational Guidelines:
- Use Temperature 0.0 for deterministic, auditable workflows.
- If higher temperatures are necessary (e.g., for specific performance gains), majority voting should be employed to mitigate variance.
- Avoid relying solely on a single prompt; standardization of prompts is necessary but insufficient without robust model selection.

This work provides a critical foundation for the safe, reliable, and auditable integration of LLMs into clinical information extraction pipelines.