Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

Imagine you are a hospital administrator trying to hire a team of doctors to diagnose a tricky medical case. You don't just want one doctor; you want to know if a whole group of different doctors (some from big cities, some from small towns, some trained differently) would all agree on the same answer.

This paper is like a massive experiment where the researchers gathered 34 different "AI doctors" (Large Language Models) and asked them 169 difficult radiology questions (like "What is wrong with this X-ray?").

They tested these AI doctors in two different ways:

The "Solo" Mode (Zero-Shot): The AI just looks at the question and guesses based on what it memorized during training.
The "Research Assistant" Mode (Agentic Retrieval): Before answering, the AI is forced to open a trusted medical textbook, find the relevant facts, read a summary, and then answer.

Here is what they found, explained with some everyday analogies:

1. The "Group Think" Effect (Stability)

The Analogy: Imagine a room full of 34 people trying to guess the weight of a watermelon.

Solo Mode: Everyone guesses wildly different numbers. Some say 5 lbs, others say 500 lbs. The group is chaotic.
Research Assistant Mode: You give everyone the exact same ruler and the same textbook page about watermelons. Suddenly, everyone's guesses cluster much closer together.

The Finding: When the AI models were given the same "textbook" (retrieved evidence), they stopped guessing wildly and started agreeing with each other much more often. The "noise" in the room went down.

2. The "Echo Chamber" Trap (Consensus vs. Correctness)

The Analogy: Imagine a group of tourists trying to find the best restaurant.

Solo Mode: They split up and find 10 different places.
Research Assistant Mode: They all read the same travel guide. Now, 30 of them agree on one specific restaurant.

The Catch: Does agreeing mean they are right? Not always.

Sometimes the travel guide was right, and the group found the best restaurant.
But sometimes, the travel guide had a typo or a bad review, and because everyone read the same bad guide, all 30 tourists agreed on the worst restaurant.

The Finding: The AI models agreed much more often when they used the "Research Assistant." Usually, this agreement meant they were right. But occasionally, they all agreed on a wrong answer because they were all looking at the same misleading evidence. This is called a "coordinated failure."

3. The "Confident Fool" (Verbosity)

The Analogy: Think of a student taking a test.

Student A: Writes a 5-page essay explaining their answer.
Student B: Just writes "B."

The Finding: You might think the student who wrote 5 pages is smarter and more confident. But the researchers found that length doesn't equal correctness.

The AI models wrote long, detailed answers whether they were right or wrong.
Just because an AI gives you a long, fancy explanation doesn't mean it's telling the truth. It's just "talking a lot."

4. The "Safety Net" (Robustness)

The Analogy: Imagine a bridge.

Solo Mode: If you remove one specific type of bolt, the bridge might collapse.
Research Assistant Mode: The bridge is built so that even if you swap out different types of bolts (different AI models), the bridge still holds up.

The Finding: When using the "Research Assistant" method, it became much harder for the AI to get the answer wrong just because you switched to a different model. The "correctness" became more stable across the whole team. However, there were still rare cases where the whole team collapsed at once (the "coordinated failure" mentioned earlier).

5. The "Real World Stakes" (Severity)

The Analogy: If a doctor makes a mistake, is it just a typo, or does it put a patient in danger?

The researchers asked real human radiologists to grade the mistakes the AI made.
They found that 72% of the AI's mistakes were serious. They weren't just "low severity" errors; they were the kind of mistakes that could lead to delayed treatment or wrong surgeries.

The Finding: Even though the AI models became more stable and agreed more often, the mistakes they did make were still dangerous. Fixing the "agreement" didn't automatically fix the "safety."

The Big Takeaway

This paper teaches us a very important lesson about AI in medicine:

Just because a bunch of AI models agree with each other, doesn't mean they are right.

Using a "Research Assistant" (retrieving facts) helps AI models stop guessing and start agreeing, which is good. But it also creates a risk where they all agree on the wrong thing if the source material is flawed.

The Bottom Line: We can't just look at "accuracy" or "agreement" to trust AI. We need to check if they are stable, if they are robust against changes, and most importantly, what happens if they are wrong. In medicine, a confident, agreed-upon wrong answer is still a disaster.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in clinical decision support, particularly in radiology. While retrieval-augmented generation (RAG) and agentic reasoning pipelines have shown improvements in mean accuracy, their impact on collective reliability under model variability remains unclear. In real-world deployments, organizations may switch between different model vendors, versions, or architectures. A critical gap exists in understanding whether these systems are robust to such changes or if they introduce new failure modes, such as:

Synchronized Errors: Different models converging on the same incorrect answer due to shared retrieval context.
False Consensus: High agreement among models that does not correlate with correctness.
Over-trust: Reliance on proxies like response verbosity (length) as indicators of confidence.

The study posits that evaluating reliability solely through single-model accuracy is insufficient. Instead, reliability must be decomposed into dimensions like decision stability, cross-model robustness, and the coupling between consensus and correctness.

2. Methodology

Experimental Design

Dataset: 169 expert-curated multiple-choice radiology questions from two sources:
- Benchmark-RadQA ( $n=104$ ): Derived from the RadioRAG study (RSNA cases + ExtendedQA).
- Board-RadQA ( $n=65$ ): Board-style questions from the RaR study.
Model Panel: A heterogeneous panel of 34 LLMs spanning proprietary (OpenAI, Anthropic, Google, etc.) and open-weight models (Llama, Qwen, Mistral, DeepSeek, Gemma). The models vary in parameter size (0.5B to 235B+), architecture, and medical adaptation.
Inference Conditions:
1. Zero-Shot: Models receive only the question stem and answer options.
2. Agentic Retrieval-Augmented: Models receive the question plus a structured evidence report. This report is generated by a fixed orchestration pipeline that retrieves information from a curated radiology knowledge base (Radiopaedia) and synthesizes it into a neutral, structured context. Crucially, all 34 models receive the exact same retrieved context for a given question, isolating the effect of the inference strategy from retrieval variability.

Evaluation Metrics

The study moves beyond accuracy to analyze collective behavior using the following metrics:

Inter-model Decision Stability: Measured via Shannon Entropy of the answer distribution across the 34 models. Lower entropy indicates higher concentration of decisions.
Consensus Strength: The Majority Fraction ( $M$ ), defined as the proportion of models selecting the modal (most frequent) answer.
Robustness of Correctness: The fraction of models ( $R$ ) that independently select the correct ground-truth answer. This measures reproducibility across model variability.
Coupling: The correlation between Consensus Strength ( $M$ ) and Robustness ( $R$ ).
Verbosity Analysis: Correlation between response token length and correctness.
Clinical Severity: Independent annotation by three radiologists of incorrect options to assess potential clinical harm (Low, Moderate, High severity).

Statistical Analysis

Paired per-question comparisons (Zero-shot vs. Agentic) using Wilcoxon signed-rank tests.
McNemar's test for single-model accuracy changes.
Spearman correlation for coupling analysis.
Fleiss' $\kappa$ for inter-rater reliability on severity annotations.

3. Key Contributions

Framework for Collective Reliability: Proposes a multidimensional evaluation framework that separates decision coordination (stability/consensus) from validity (correctness/robustness).
Controlled Agentic Evaluation: Demonstrates a method to isolate the impact of shared structured evidence on heterogeneous model panels, holding retrieval and synthesis constant.
Discovery of Synchronized Failures: Identifies that while agentic pipelines generally improve robustness, they can also induce "coordinated incorrect convergence" where diverse models agree on the same wrong answer.
Debunking Verbosity Proxies: Provides empirical evidence that response length is not a reliable indicator of correctness in agentic systems.

4. Key Results

A. Agentic Reasoning Increases Decision Concentration

Entropy Reduction: Agentic inference significantly reduced inter-model decision entropy (Median: $0.48 \to 0.13 $;$ P=5.6 \times 10^{-9}$). Models became more aligned in their choices when provided with structured evidence.
Consensus Strength: The majority fraction increased significantly (Median: $0.85 \to 0.97 $;$ P=2.9 \times 10^{-5}$), indicating stronger agreement across the panel.

B. Improved Robustness of Correctness

Cross-Model Reproducibility: The fraction of models answering correctly increased (Mean Robustness: $0.74 \to 0.81 $;$ P=5.6 \times 10^{-9}$).
Distribution Shift: The proportion of questions with "High Robustness" (where most models are correct) rose from 50% to 72%.
Tail-Risk Events: Despite overall gains, a small subset of questions ( $\approx 7\%$ ) showed a decrease in robustness, with some cases exhibiting "collapse" ( $\Delta R = -0.79$ ), where models synchronized on an incorrect answer.

C. Consensus Does Not Guarantee Correctness

Coupling: Consensus strength and robustness were strongly correlated ( $\rho \approx 0.87-0.88$ ) in both conditions. Correct majorities generally had higher agreement than incorrect ones.
High-Consensus/Low-Robustness Failures: Rare but critical cases occurred where models strongly agreed on a wrong answer (High $M$ , Low $R$ ). These occurred in 1% of zero-shot cases and 2% of agentic cases.
Conclusion: Increased agreement under agentic reasoning amplifies both correct and incorrect majorities; it is not a perfect proxy for validity.

D. Verbosity is a Weak Proxy

Zero-Shot: Correct responses were slightly longer than incorrect ones (280 vs. 256 tokens), but the effect size was negligible ( $\delta=0.04$ ).
Agentic: No meaningful difference in length between correct and incorrect responses (660 vs. 668 tokens; $P=0.833$ ).
Implication: Longer, more detailed outputs in agentic pipelines do not signal higher reliability.

E. Clinical Severity of Errors

Severity Distribution: Of the 572 incorrect outputs, 72% were associated with moderate or high clinical severity (potential for significant patient harm).
Inter-rater Reliability: Agreement among radiologists on severity was low ( $\kappa=0.02$ ), indicating that the clinical impact of errors is highly context-dependent and difficult to standardize.
Orthogonality: Improvements in stability and robustness did not eliminate high-severity error modes.

5. Significance and Implications

Safety-Aware Evaluation: The study argues that accuracy metrics alone are insufficient for clinical AI. Evaluators must assess stability (do models agree?), robustness (do they agree on the right answer?), and tail risks (do they synchronize on wrong answers?).
Agentic Trade-offs: While agentic retrieval-augmented reasoning generally improves collective reliability and reduces decision dispersion, it introduces a risk of synchronized failure. If the retrieved evidence is flawed, diverse models may converge on the same error, creating a false sense of security through high consensus.
Deployment Guidance:
- High consensus should not be blindly trusted as a signal of correctness.
- Response verbosity should not be used as a confidence metric.
- Systems should be evaluated for "collapse" scenarios where model diversity fails to protect against shared-context errors.
Future Directions: The authors call for multimodal evaluations (incorporating images), prospective clinical validation, and retrieval strategies that introduce diversity to prevent synchronized errors.

In summary, the paper demonstrates that while agentic retrieval-augmented reasoning makes LLMs more consistent and generally more robust in radiology QA, it fundamentally reshapes the error landscape, potentially concentrating errors in ways that require new safety monitoring frameworks beyond simple accuracy.