Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Imagine you are a detective trying to solve a mystery, but the clues you have are incredibly tricky. Two suspects look almost identical, wear the same clothes, and stand in the same spot. However, one is a harmless tourist, and the other is a dangerous criminal. If you pick the wrong one, the consequences are huge.

This is exactly the problem doctors face with certain diseases. Sometimes, a skin mole that looks like a harmless birthmark is actually melanoma (skin cancer). Sometimes, a chest X-ray that looks like fluid in the lungs is actually pneumonia (an infection), or vice versa. The pictures look nearly the same, but the treatment is completely different.

This paper is a pilot study asking a big question: Can AI "agents" (smart computer programs) figure out these tricky cases without any special training, just by looking at the picture?

Here is the breakdown of their experiment and findings, using some everyday analogies:

1. The Problem: The "Twin" Confusion

The researchers focused on two pairs of "medical twins":

Melanoma vs. Atypical Nevus: A dangerous skin cancer vs. a weird-looking but harmless mole.
Edema vs. Pneumonia: Fluid overload in the lungs vs. a lung infection.

In a real hospital, doctors use patient history, blood tests, and experience to tell them apart. But the researchers wanted to see if an AI could do it just by looking at the image, with no extra help and no prior training on these specific cases. This is called a "Zero-Shot" setting.

2. The Old Way: The Overconfident Single Detective

Usually, when you ask a standard AI to diagnose a picture, it acts like a single, overconfident detective.

It looks at the image.
It picks a suspect (e.g., "It's definitely pneumonia!").
It immediately starts making up reasons to support its choice, even if the evidence is shaky.
The Flaw: Because the images are so similar, the AI often guesses wrong and then confidently lies to itself to justify the wrong guess. This is called "hallucination."

3. The New Solution: The "Courtroom" System (CARE)

The researchers built a new system called CARE (Contrastive Agent Reasoning). Instead of one detective, they set up a mini-courtroom with three roles:

The Prosecutor (Agent A): Their only job is to argue why the image is Disease A (e.g., Melanoma). They must find evidence to support this, ignoring everything else.
The Defense Attorney (Agent B): Their only job is to argue why the image is Disease B (e.g., Atypical Nevus). They must find evidence to support this.
The Judge (Agent C): This agent doesn't argue. It looks at the original photo and listens to both sides. Its job is to fact-check. It asks: "Prosecutor, you said the mole is chaotic, but looking at the photo, it's actually very symmetrical. That's a lie." or "Defense, you said the lung opacity is only on the right, but the photo shows it on both sides."

The Judge then weighs the arguments, throws out the fake evidence, and makes the final call.

4. The Results: Better, But Not Perfect

The researchers tested this on thousands of images. Here is what happened:

The Single Detective (Standard AI): Got about 66% of the skin cancer cases right. It was often confused and made up fake reasons.
The Courtroom System (CARE): Got about 77% of the cases right.
Why it worked: By forcing the AI to argue both sides and then fact-checking against the actual image, the system caught its own mistakes. It stopped making up fake evidence because the "Judge" caught it.

However, there is a catch: Even with the courtroom system, the AI was still only right about 77% of the time. For a doctor to trust an AI with a patient's life, they usually need it to be right 95%+ of the time. So, while the new method is a huge improvement, it is not ready to replace doctors yet.

5. The Takeaway

Think of this study as a proof-of-concept. It shows that if you give AI a structure to disagree with itself and check its own work against the picture, it becomes much smarter.

The Good News: We found a way to make AI less overconfident and less likely to lie about what it sees.
The Bad News: Medical images are incredibly complex. Even with a "courtroom" of AIs, they still make too many mistakes to be used in a real hospital today.

In short: The researchers built a team of AI lawyers and a judge to solve medical puzzles. They did a better job than a single AI, but they still aren't good enough to be hired as doctors just yet. We need to keep training them and giving them better tools before we let them make life-or-death decisions.

1. Problem Statement

The paper addresses a critical gap in the application of Multimodal Large Language Models (MLLMs) to medical imaging: distinguishing visually confounded diseases in a zero-shot setting.

The Challenge: Many distinct diseases share highly overlapping visual features but require drastically different clinical management (e.g., antibiotics for pneumonia vs. diuretics for pulmonary edema; excision for melanoma vs. surveillance for atypical nevi).
The Limitation of Current Agents: Standard single-agent MLLMs often fail in high-ambiguity scenarios. They tend to prematurely commit to a single hypothesis and generate "hallucinated" or overconfident evidence to support that choice, lacking the ability to reason through uncertainty without specific fine-tuning.
The Goal: To evaluate whether current MLLM-based agents can differentiate these difficult pairs without task-specific training (zero-shot) and to propose a framework that improves this capability.

2. Methodology: Contrastive Agent REasoning (CARE)

The authors propose CARE, a novel, training-free multi-agent framework designed to explicitly structure disagreement and verify evidence against the image. The system operates via a three-role architecture:

Role-Conditioned Evidence Generation:
- Two specialized agents are assigned opposing diagnostic hypotheses (e.g., Agent A argues for Melanoma, Agent B argues for Atypical Nevus).
- Constraint: Agents are strictly prohibited from making a final diagnosis. Instead, they must generate visual evidence solely supporting their assigned hypothesis.
- Effect: This forces the generation of hypothesis-consistent but potentially image-inconsistent claims (hallucinations), creating a "contrastive" dataset of arguments.
Visual-Grounded Adjudication:
- A third agent, the Judge, receives the original image and the two sets of generated evidence ( $E_A$ and $E_B$ ).
- Function: The Judge performs three tasks:
  1. Cross-checking: Verifies claims against the actual pixel data of the image.
  2. Identification: Flags unsupported or contradictory claims (e.g., an agent claiming "chaotic asymmetry" when the image shows symmetry).
  3. Adjudication: Weighs the remaining valid, image-grounded arguments to make the final binary decision.
Theoretical Rationale:
- Unlike single agents that maximize $p(y|x)$ , CARE explicitly generates competing explanations.
- The final decision is based on the difference in visual consistency scores: $\hat{y} = \arg\max (S(x, E_y) - S(x, E_{\neg y}))$ .
- This approach mitigates premature commitment by distributing reasoning across competing hypotheses.

3. Experimental Setup

Datasets: Two binary classification tasks were curated from public datasets, enforcing an "Exclusive-OR" (XOR) condition (no co-occurrence of diseases):
1. Dermoscopy: Melanoma vs. Atypical Nevus (509 images from Derm7pt).
2. Chest X-Ray: Pulmonary Edema vs. Pneumonia (1,739 images from MIMIC-CXR).
Baselines: The study benchmarked various models including CLIP-based vision-language models (SigLIP2, BiomedCLIP), open-source MLLMs (InternVL, Qwen, Gemma), and closed-source models (Gemini-3-Flash, Gemini-3-Pro).
Ablation Studies: The authors compared CARE against:
- Self-Check: Single agent performing multiple passes of self-reflection.
- Majority-Vote: Aggregating predictions from multiple independent samples.
- Blind-CARE: A variant where the Judge sees only text arguments, not the image (to test the necessity of visual grounding).

4. Key Results

Performance Gains:
- Melanoma vs. Atypical Nevus: CARE achieved 77.6% accuracy (Youden Index: 0.552), an 11.1 percentage point improvement over the baseline Gemini-3-Flash (66.5%). This improvement was statistically significant ( $p < 0.0001$ ).
- Edema vs. Pneumonia: CARE achieved 64.6% accuracy, significantly outperforming the baseline (60.2%, $p < 0.001$ ), though it did not surpass the stronger Gemini-3-Pro model (70.9%).
Comparison with Other Methods:
- Simple self-correction (Self-Check) and majority voting showed only marginal improvements, suggesting that the gain comes from the structured contrastive reasoning rather than increased sampling or compute.
- Blind-CARE (Judge without image access) performed better than self-check variants but worse than full CARE, proving that direct visual verification is essential for detecting hallucinations.
Qualitative Findings:
- CARE successfully identified and rejected contradictory findings (e.g., an agent claiming "chaos" in a symmetric lesion).
- It recalibrated the diagnostic weight of ambiguous features (e.g., re-interpreting "fragmented architecture" as supporting melanoma rather than a benign nevus).
- It reduced unsupported claims by cross-referencing multi-view evidence (e.g., distinguishing focal consolidation from diffuse haziness).

5. Significance and Limitations

Significance:
- This is among the first studies to benchmark MLLM agents on visually confounded diseases in a zero-shot setting.
- It demonstrates that structuring disagreement and image-grounded verification are effective strategies for improving diagnostic reasoning without fine-tuning.
- It provides a blueprint for designing multi-agent systems that mimic human expert reasoning (arguing both sides before deciding).
Limitations:
- Label Quality: Ground truth relies on report-derived labels or expert diagnosis rather than independent gold standards (e.g., CT scans), introducing noise.
- Simplification: The XOR setting ignores real-world comorbidities (e.g., a patient having both edema and pneumonia).
- Clinical Readiness: Despite improvements, overall performance remains below the threshold required for clinical deployment.
- Lack of Tools: The study did not integrate external tools (e.g., segmentation models), which could further enhance performance.

Conclusion

The paper concludes that while current MLLM agents are not yet ready for autonomous clinical deployment in high-stakes, visually ambiguous scenarios, the CARE framework offers a promising, training-free pathway to improve diagnostic accuracy. By forcing agents to argue opposing sides and subjecting those arguments to rigorous visual verification, the system effectively reduces hallucinations and improves decision-making in zero-shot settings.

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

1. The Problem: The "Twin" Confusion

2. The Old Way: The Overconfident Single Detective

3. The New Solution: The "Courtroom" System (CARE)

4. The Results: Better, But Not Perfect

5. The Takeaway

1. Problem Statement

2. Methodology: Contrastive Agent REasoning (CARE)

3. Experimental Setup

4. Key Results

5. Significance and Limitations

Conclusion

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation