Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging

Imagine you have a brilliant, world-class medical student who is incredibly smart but also a bit of a "one-trick pony." They can look at an X-ray and write a report, but they do it in one go, without checking their work, without asking for help, and without realizing when they've mixed up "left" and "right" or missed a tiny crack in a bone. If they make a mistake, they just hand you the report, and you have to hope it's right.

This paper introduces a new system called R4 (Router, Retriever, Reflector, Repairer) that turns that single, brilliant student into a highly efficient medical team. Instead of one person doing everything alone, R4 breaks the job down into four specialized roles that work together to ensure the final report is perfect.

Here is how the team works, using a simple analogy of a Newsroom trying to publish a breaking story about a patient's health:

1. The Router (The Editor-in-Chief)

What it does: Before the work even starts, this agent looks at the patient's file. Is this a heart patient? A cancer patient? Is it a chest X-ray or a CT scan?
The Analogy: Imagine a news editor receiving a story. If it's about a hurricane, they don't send a sports reporter; they send a weather expert. The Router looks at the patient's history and the image type, then says, "Okay, this is a heart case. Let's wake up the Cardiology Specialist version of our AI, not the general one." It sets the stage so the right expert is working on the right problem.

2. The Retriever (The Researcher & Draft Writer)

What it does: This agent doesn't just guess. It looks at a "memory bank" of past successful cases that are similar to the current one. It then tries to write the report multiple times (like drafting three different versions of a story). At the same time, it draws boxes around the problem areas on the X-ray (like circling the storm on a map).
The Analogy: The Retriever is like a journalist who says, "I remember a similar story from last year; let's use that as a guide." They write three different drafts of the article and draw three different maps. They don't settle for the first thing that comes to mind; they generate options.

3. The Reflector (The Fact-Checker)

What it does: This is the most critical safety step. It takes every draft and every map and ruthlessly critiques them. It looks for specific, dangerous errors:

Did we say "no pneumonia" when there actually is pneumonia? (Negation error)
Did we say "left lung" when it's actually the "right lung"? (Laterality error)
Did we claim something is broken without any evidence? (Unsupported claim)
The Analogy: The Reflector is the strict editor with a red pen. They read the drafts and say, "Wait, you said the patient is fine, but the X-ray shows a shadow! You also mixed up left and right! This draft is dangerous." They create a "To-Do List" of errors that must be fixed.

4. The Repairer (The Fixer)

What it does: This agent takes the "To-Do List" from the Reflector and fixes the report and the maps. It rewrites the text and redraws the boxes to be more accurate. It does this in a loop: Fix -> Check -> Fix again until the Fact-Checker is happy.
The Analogy: The Repairer is the writer who goes back to their desk, reads the editor's notes, and rewrites the story. They don't just make a small tweak; they overhaul the draft until the Fact-Checker gives it a "Green Light."

The Secret Sauce: The "Memory Bank"

The paper also mentions that this system gets smarter over time. Every time the team successfully fixes a difficult case, they save that "perfect final version" into their memory bank. Next time a similar patient comes in, the Retriever can pull that perfect example out and say, "Hey, we solved this exact problem before! Let's use that as a guide."
The Analogy: It's like a newsroom that keeps a "Hall of Fame" of their best articles. When a new storm hits, they don't start from scratch; they look at how they handled the last big storm and use that experience to do it even better.

Why is this a big deal?

No Re-training: Usually, to make an AI smarter, you have to feed it millions of new pictures and re-teach it (which is expensive and slow). R4 doesn't do that. It just uses better processes (the team workflow) to get better results.
Safety: Medical AI often makes "hallucinations" (making things up). By having a dedicated "Fact-Checker" (Reflector) and a "Fixer" (Repairer), the system catches these errors before they reach the doctor.
Precision: It doesn't just write words; it also draws boxes around the problems. The system improved its ability to point exactly at the disease on the X-ray by a significant margin.

In short: Instead of relying on one smart AI to get it right the first time, R4 builds a team of specialists that plan, draft, critique, and fix the work, learning from past successes to ensure the final medical report is safe, accurate, and trustworthy.

1. Problem Statement

Medical image analysis, particularly in radiology, increasingly relies on Large Vision-Language Models (VLMs). However, current state-of-the-art systems suffer from several critical limitations:

Black-Box Nature: Most systems operate as monolithic, single-pass models that lack control over reasoning processes, error detection, and safety checks.
Lack of Spatial Grounding: While they generate free-text reports, they often fail to provide accurate spatial localization (bounding boxes) for abnormalities, leading to "hallucinations" or unsupported claims.
Brittleness and Safety: Single-pass generation is prone to subtle clinical errors (e.g., negation errors, laterality swaps, missing findings) and cannot easily adapt to heterogeneous clinical contexts (e.g., oncology vs. cardiology) without retraining.
Static Prompting: A "one-size-fits-all" prompt is insufficient for diverse patient histories and imaging modalities.

The authors propose that agentic decomposition—breaking the workflow into coordinated, specialized steps—is necessary to transform brittle VLMs into reliable, self-improving clinical tools without requiring gradient-based fine-tuning.

2. Methodology: The R4 Framework

The authors introduce R4, an agentic framework consisting of four coordinated agents that process medical images ( $x$ ), queries ( $q$ ), patient history ( $h_{pat}$ ), and metadata ( $z$ ). The system produces a structured output comprising a free-text report ( $r$ ) and a set of bounding boxes ( $B$ ).

Core Components:

Router Agent:
- Function: Acts as a high-level controller. It analyzes the input context (image, history, metadata) to select the optimal LLM specialization (e.g., chest-radiology vs. oncology focus) and prompting mode (zero-shot, few-shot, or Chain-of-Thought).
- Mechanism: Uses heuristic rules or a lightweight VLM call to output a routing decision ( $s, m, F$ ) that tailors the downstream processing to the specific clinical case.
Retriever Agent:
- Function: Generates candidate outputs using a pass@ $k$ strategy.
- Mechanism:
  - Queries a persistent Exemplar Memory ( $M$ ) to retrieve top- $k$ few-shot examples that match the current task, specialization, and patient context.
  - Generates $k$ candidate drafts ( $d_j$ ) and corresponding bounding boxes ( $B_j$ ) in parallel.
  - The bounding box generator is a sub-agent that localizes abnormalities based on the draft text and image.
Reflector Agent:
- Function: Critiques the draft-report and bounding-box pairs to identify clinical failure modes.
- Mechanism: Analyzes the pair $(x, d_j, B_j)$ $(x, d_{j}, B_{j})$ and outputs a structured list of issues ( $I_j$ $I_{j}$ ) categorized into specific error types:
  - Negation: Incorrect handling of "no" or "absence."
  - Laterality: Left/Right confusion.
  - Unsupported: Claims not backed by image evidence.
  - Contradiction: Internal logical inconsistencies.
  - Missing Findings: Omission of key abnormalities.
  - Localization Errors: Misaligned bounding boxes.
Repairer Agent:
- Function: Iteratively refines the output based on the Reflector's critique.
- Mechanism:
  - Selects the best initial draft based on a penalty score derived from the Reflector's issues.
  - Enters a Reflect-Repair loop: The Reflector re-evaluates the updated draft, and the Repairer generates a new version ( $d^{(t+1)}, B^{(t+1)}$ ) constrained by the identified issues.
  - This loop continues until no material issues remain or a maximum iteration limit ( $T$ ) is reached.

Self-Improvement via Exemplar Curation:

The system maintains a persistent memory ( $M$ ) of high-quality cases.
After a successful repair cycle, the final output is curated into a new exemplar (with tags for findings, negation, laterality, etc.) and added to the memory.
This allows the system to self-improve over time by retrieving better context for future cases, without updating the underlying VLM parameters.

3. Key Contributions

Agentic Architecture for Medical Imaging: A novel framework that explicitly integrates patient history and metadata into a routing mechanism, moving beyond static prompting to dynamic, specialization-aware configuration.
Joint Text-Box Optimization: Unlike prior works that focus solely on text or coarse alignment, R4 couples global report generation with quantitative localization. The Reflector and Repairer agents simultaneously critique and refine both the narrative and the spatial annotations.
Parameter-Efficient Self-Improvement: The system introduces a lightweight mechanism for continuous learning via exemplar memory curation. It improves performance over time by retrieving better few-shot examples, avoiding the computational cost of gradient-based fine-tuning.
Structured Reflection Loop: The implementation of a dedicated "Reflect" agent that targets specific, clinically critical error modes (negation, laterality, etc.) rather than generic text quality.

4. Experimental Results

The framework was evaluated on Chest X-ray tasks using two datasets: VinBigData (for bounding box detection) and IU Chest X-rays (for report generation).

Baselines: Compared against strong single-pass VLMs (MedGemma, LLaVA-Med, LLaMA-3, Gemini-2.5-Flash, etc.) using Zero-Shot, Few-Shot, and Chain-of-Thought prompting.
Metrics:
- Report Generation: BLEU, ROUGE-L, BERTScore, and an LLM-as-a-Judge score (evaluating coverage, consistency, diagnostic accuracy, style, and clarity).
- Localization: Mean Average Precision at IoU 0.5 (mAP50).

Key Findings:

Performance Gains: R4 consistently outperformed single-VLM baselines.
- LLM-as-a-Judge Scores: Improved by +1.7 to +2.5 points (e.g., R4Agent-Gemini achieved 8.02 vs. 5.58 for the baseline).
- Localization (mAP50): Improved by +2.5 to +3.5 absolute points (e.g., R4Agent-Gemini reached 10.97 vs. 7.49 for the baseline).
Clinical Quality: While single-pass models often achieved high stylistic scores, they struggled with diagnostic accuracy and safety. R4 significantly reduced clinical errors (negation, laterality) and improved diagnostic consistency.
Pass@ $k$ Analysis: Increasing the number of candidate passes ( $k$ ) showed diminishing returns after $k=2$ , but consistently improved both report correctness and spatial grounding.
No Fine-Tuning: All gains were achieved using frozen VLM backbones, demonstrating the efficacy of agentic control over raw model scaling.

5. Significance

This work represents a paradigm shift in medical AI deployment:

Reliability over Scale: It demonstrates that agentic control (routing, reflection, repair) can yield more reliable and clinically safe outputs than simply scaling up model parameters or fine-tuning.
Clinical Integration: By addressing specific failure modes like laterality and negation, and providing explicit bounding boxes, the system bridges the gap between AI generation and the rigorous safety standards required in radiology.
Adaptability: The self-improving memory mechanism allows the system to adapt to new institutions, modalities, or reporting conventions without retraining, making it a practical solution for heterogeneous real-world clinical environments.

In conclusion, R4 transforms general-purpose VLMs into specialized, self-correcting medical agents, offering a scalable path toward trustworthy AI in clinical imaging.