CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Imagine you are a patient sitting in a doctor's office. You have an X-ray or a CT scan, and you ask, "What's wrong with me?"

In the world of Artificial Intelligence, most current "medical AI" models act like a brilliant but overconfident student who glances at your scan for a split second and immediately blurts out an answer. They might get lucky, but often they are just guessing based on patterns they've seen before, without actually looking at the specific spot on your image. If they get it wrong, they can't explain why, and they might even invent symptoms that aren't there (a problem called "hallucination"). This is risky because, in medicine, you need to know why a doctor made a diagnosis, not just what the diagnosis is.

The paper introduces CARE (Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework). Think of CARE not as a single student, but as a highly organized medical team working together to solve your case.

Here is how CARE works, using a simple analogy:

The Problem: The "Black Box" Doctor

Current AI models are like a Black Box Doctor. You hand them a photo, and they give you an answer. You have no idea if they looked at the right spot, or if they just guessed based on the color of the photo. If they say, "You have pneumonia," you don't know if they actually saw the pneumonia or just saw a dark spot and assumed.

The Solution: The CARE Team

CARE breaks the job down into three specialized roles, mimicking how a real human doctor thinks:

1. The Triage Nurse (Medical Entity Proposal)

Instead of guessing the whole disease immediately, the first AI (the "Triage Nurse") looks at your question and the image and says: "Okay, the patient is asking about their lungs. I should focus on the left and right lungs, not the heart or the bones."

What it does: It identifies the specific body parts or features relevant to the question.
Why it helps: It stops the AI from wasting time looking at the wrong things.

2. The Specialist Technician (Entity Referring Segmentation)

Once the Nurse says, "Look at the left lung," the second AI (the "Technician") steps in. This is an expert at drawing precise outlines. It doesn't just guess; it draws a pixel-perfect mask around the suspicious area in the lung.

What it does: It creates a "highlighter" effect, isolating the exact spot of interest.
Why it helps: It provides hard evidence. It's like the doctor putting a magnifying glass over the specific spot and saying, "Here is the problem area."

3. The Senior Diagnostician (Evidence-Grounded VQA)

Now, the third AI (the "Senior Doctor") gets the full picture. They see the original image, but they also see the "highlighted" area created by the Technician. They reason through the problem: "I see the highlighted area in the left lung. It looks dense and white. Based on my training, this looks like pneumonia."

What it does: It makes the final diagnosis, but it must base it on the evidence provided by the previous steps.
Why it helps: It prevents the AI from making up facts. If the evidence doesn't support the answer, the system is designed to catch it.

The Manager: The Coordinator

To make sure this team works perfectly, CARE has a Manager (called the "Coordinator").

The Job: The Manager decides which tools to use. Do we need to zoom in? Do we need to draw a mask? Or is the whole image enough?
The Safety Net: The Manager also acts as a Quality Control Inspector. After the Senior Doctor gives an answer, the Manager reviews the logic: "Wait, you said it's pneumonia, but your reasoning said the area is clear. That doesn't make sense. Let's re-check."
The Result: If the Manager catches a mistake, they fix it before giving the final answer to you.

Why is this a big deal?

No More Guessing: By forcing the AI to "point" to the evidence before answering, it stops the AI from hallucinating (making things up).
Transparency: You can see the "thought process." You can see exactly which part of the image the AI looked at to make its decision. This is what doctors call "accountability."
Better Performance: The paper shows that this team approach is actually smarter and more accurate than the biggest, most expensive single AI models, even though the CARE team is smaller and uses less computing power.

The Bottom Line

Imagine if your AI doctor didn't just give you a verdict, but instead walked you through the exam room, pointed to the X-ray, said, "See this white spot here? That's what I'm looking at. Based on that, here is my diagnosis."

CARE is that kind of AI. It turns medical diagnosis from a "magic trick" into a transparent, evidence-based process, making it safer and more trustworthy for real-world healthcare.

1. Problem Statement

Current Large Visual Language Models (VLMs) in the medical domain often operate as end-to-end black boxes. They map medical images and questions directly to answers without explicitly localizing or verifying the supporting visual evidence. This architecture leads to several critical issues:

Hallucination & Shortcut Learning: Models may guess answers based on statistical correlations in training data rather than visual evidence, especially under distribution shifts.
Lack of Clinical Accountability: Unlike human clinicians who localize abnormalities, examine them at appropriate scales, and then decide, current VLMs cannot provide a traceable chain of evidence.
Inefficiency of Coupled Approaches: Existing methods that attempt to combine visual grounding and reasoning within a single generalist model often suffer from error propagation (early grounding errors bias subsequent reasoning) and require massive, high-quality paired datasets for training.

2. Methodology: The CARE Framework

The authors propose CARE (Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework). Instead of a monolithic model, CARE decomposes the diagnostic workflow into coordinated submodules that emulate clinical practice.

Core Architecture

The framework consists of three specialized stages and an optional agentic coordinator:

Medical Entity Proposal:
- A compact VLM (e.g., InternVL3-2B) analyzes the user query and the image to propose relevant medical entities (anatomical structures, findings, or devices).
- Training: Fine-tuned using Reinforcement Learning with Verifiable Rewards (RLVR). Instead of binary accuracy, it uses an embedding-similarity reward to ensure semantic alignment between proposed entities and the ground truth, avoiding the "zero-gradient" problem of exact matching.
Entity Referring Segmentation:
- A specialized segmentation model (based on SA-Med-2D) takes the proposed entities and generates pixel-level Region of Interest (ROI) masks.
- It outputs a confidence score ( $C$ ) based on the mask entropy. Low-confidence masks are filtered out to prevent noisy evidence from corrupting the reasoning process.
Evidence-Grounded VQA (EG-VQA):
- A VQA model reasons over the full image augmented by one of three evidence views:
  - Zoom-in: High-resolution crop of the ROI for local detail.
  - Mask: A binary mask acting as a spatial attention prior.
  - Global: A full-image view when local evidence is unnecessary.
- The model is trained with a two-stage process: Supervised Fine-Tuning (SFT) followed by RLVR to optimize for reasoning length and answer accuracy.

Agentic Control Modes

CARE-Flow (Static): A coordinator-free pipeline. It executes all three evidence views (Zoom, Mask, Global) and aggregates the results via majority voting.
CARE-Coord (Dynamic): Introduces a Coordinator VLM (e.g., GPT-5) that acts as a planner and verifier.
- Planning: Decides which tools to invoke and which evidence view is most informative (e.g., skipping segmentation for global questions).
- Iterative Review: Performs a "Chain-of-Thought" check, verifying if the reasoning logic aligns with the final answer. If inconsistencies are found, the coordinator can correct the answer or re-run the expert model.

3. Key Contributions

First Medical Agentic Framework for Accountability: CARE is the first framework to explicitly decouple entity proposal, segmentation, and reasoning into specialist tools, coordinated by an agent, to ensure clinical accountability.
Evidence-Grounded Reasoning Workflow: The system feeds explicit, pixel-level evidence (masks, zoomed crops) back into the reasoning process, significantly reducing hallucinations compared to black-box models.
RLVR for Specialist Alignment: The use of Reinforcement Learning with Verifiable Rewards (specifically the DAPO algorithm) allows small models to learn complex reasoning and evidence alignment without massive amounts of human-annotated reasoning data.
Dynamic Coordination: The introduction of a coordinator that plans tool usage and reviews reasoning traces significantly boosts performance, particularly on Out-of-Distribution (OOD) data.

4. Experimental Results

The framework was evaluated on four standard medical VQA benchmarks: OmniMedVQA-3k, VQA-RAD, SLAKE, and VQA-Med-2019.

Performance vs. SOTA:
- CARE-Flow-B (10B parameters) achieved an average accuracy of 74.91%, outperforming the heavily trained Lingshu-32B (72.29%) by 2.6% and the same-size SOTA by 10.9%.
- CARE-Coord-B further improved performance to 77.54%, surpassing Lingshu-32B by 5.2%.
Parameter Efficiency: CARE achieves state-of-the-art results with significantly fewer parameters (10B) compared to generalist models requiring 32B–78B parameters.
Ablation Studies:
- Removing the coordinator (CARE-Flow) still yields strong results, but adding the coordinator provides a uniform ~3% gain and a >6% gain on OOD data.
- The "Zoom-in" evidence view was found to be the most critical for accuracy.
- The coordinator successfully corrected ~4.8% of errors made by expert models while only incorrectly overwriting ~3.0% of correct answers.
Human Evaluation: In a study with medical professionals, CARE-Coord-B achieved an 82.14% pass rate for reasoning trace quality, significantly higher than the GPT-4o baseline (73.94%), demonstrating superior clinical accountability.

5. Significance

Clinical Reliability: By mimicking the human diagnostic workflow (hypothesize $\to$ localize $\to$ verify), CARE provides a transparent, evidence-based reasoning chain, which is essential for trust in medical AI.
Scalability: The decoupled architecture allows for the integration of state-of-the-art specialist tools (e.g., better segmentation models) without retraining the entire reasoning pipeline.
Reduced Hallucination: The explicit requirement for visual evidence and the iterative review process by the coordinator significantly mitigate the "confident hallucination" problem common in current VLMs.
Future Direction: The paper suggests that agentic frameworks with explicit evidence grounding are the necessary path forward for deploying medical AI in real-world clinical settings where accountability is paramount.