Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

Imagine you are a junior doctor trying to learn how to diagnose complex eye diseases from X-ray images.

The Old Way (Current AI):
Right now, most medical AI agents are like a junior doctor who has been given a rigid, unchangeable checklist.

"Step 1: Look at the image."
"Step 2: Measure the size."
"Step 3: Compare to the average."

If the hospital changes its X-ray machine, or if a patient has a rare condition that doesn't fit the checklist, the AI gets stuck. It can't think outside the box. It's like a chef who only knows how to make a sandwich because that's the only recipe they were given, even if they are handed a pizza. If the ingredients change, the chef panics.

The New Way (MACRO):
The paper introduces MACRO, a medical AI that learns more like a human expert than a robot. Instead of sticking to a static checklist, MACRO learns by doing, failing, and then teaching itself new tricks.

Here is how MACRO works, using a simple analogy:

1. The "Mental Notebook" (Experience-Grounded Memory)

Imagine MACRO has a magical notebook. Every time it successfully diagnoses a patient, it doesn't just throw the case away. It writes down: "Hey, for this type of blurry eye image, I found that doing A, then B, then C worked perfectly."

If a new patient comes in with a similar blurry image, MACRO opens its notebook, finds that past success, and says, "I remember how to handle this!" This helps it adapt to new situations immediately, rather than starting from scratch.

2. The "Shortcut Discovery" (Self-Skill Discovery)

This is the coolest part. Imagine MACRO is solving a puzzle. It notices that every time it solves a specific type of problem, it has to do the same three steps over and over again:

Clean the image.
Highlight the edges.
Measure the shape.

In the old way, the AI would have to remember to do all three steps every single time. But MACRO is smart. It realizes, "Wait, I do these three steps together so often, they should be one single step!"

So, it creates a new tool called "Clean-Highlight-Measure" and adds it to its toolbox. Now, instead of taking three steps, it just clicks one button. It's like a carpenter who, after sawing, sanding, and painting a specific type of chair 100 times, invents a single "Chair-Maker" machine to do it all at once.

3. The "Coach" (Reinforcement Learning)

MACRO has a virtual coach (the training loop). When MACRO tries a new shortcut and it works, the coach gives it a high-five (a reward). When it tries a shortcut and fails, the coach says, "Try something else." Over time, MACRO builds a massive library of these "super-tools" (composite tools) that are proven to work.

Why Does This Matter?

Adaptability: Hospitals change. New machines are bought. Old diseases get new names. The old AI breaks when things change. MACRO just learns the new pattern, creates a new shortcut, and keeps going.
Efficiency: By turning long, complicated sequences into single "super-tools," MACRO gets faster and more accurate.
Real-World Ready: It doesn't need a team of engineers to rewrite its code every time a new medical protocol is introduced. It learns on the job, just like a human doctor does.

In a nutshell:
Current medical AI is like a robot following a script.
MACRO is like a curious apprentice who watches the master, figures out the best ways to do things, writes those ways down as new rules, and gets better every single day without needing a human to rewrite its manual.

Here is a detailed technical summary of the paper "Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery".

1. Problem Statement

Current medical AI agents face a critical limitation: brittleness due to static tool composition.

The Gap: Clinical image interpretation is inherently multi-step, tool-centric, and iterative. However, existing LLM-based medical agents rely on predefined, static tool sets and invocation strategies established at deployment.
The Consequence: When faced with domain shifts (e.g., different scanners, hospitals, or evolving diagnostic protocols), these agents degrade significantly. They lack the mechanism to autonomously discover, validate, and internalize new multi-step routines from experience, requiring costly manual re-engineering to adapt.
The Goal: To shift from static orchestration to experience-driven tool discovery, enabling agents to autonomously grow their behavioral repertoire by synthesizing successful multi-step procedures into reusable "composite tools."

2. Methodology: The MACRO Framework

The authors propose MACRO (Medical Agent for Composite Reasoning and Orchestration), a self-evolving agent framework. It operates as a Partially Observable Markov Decision Process (POMDP) with three core pillars:

A. Experience-Grounded Memory

Mechanism: The agent maintains a memory buffer ( $M$ ) storing fragments of successful interactions. Each entry includes the prompt history, tool invocation sequence, results, and an image feature vector.
Retrieval: During inference, the agent retrieves top- $k$ similar successful cases based on image feature similarity (cosine similarity) to provide in-context guidance for tool selection.
Update: Successful trajectories are decomposed into step-level entries and added to the memory, creating a bootstrap for future reasoning.

B. Composite Tool Discovery

Mining: The system analyzes successful execution trajectories to identify recurring contiguous subsequences of atomic tools (e.g., resize $\to$ grayscale $\to$ segment).
Synthesis: If a subsequence exceeds a frequency threshold ( $\tau$ ), it is registered as a new Composite Tool ( $C$ ).
Abstraction: These composites act as high-level primitives. The agent can invoke a complex multi-step workflow as a single action, effectively expanding its action space dynamically based on accumulated experience.

C. Two-Stage Policy Optimization

To train the agent to utilize these evolving tools, a two-stage training loop is employed:

Stage 1: Supervised Cold Start (SFT):
- Initializes the policy using a strong teacher VLM (DeepSeek) to generate demonstration trajectories.
- Uses Behavior Cloning with a specific modification: the student executes tools in the environment to generate its own feedback context, mitigating exposure bias.
- During this phase, the memory ( $M$ ) and composite registry ( $C$ ) are populated with successful patterns.
Stage 2: GRPO-based Reinforcement:
- Uses Group Relative Policy Optimization (GRPO) to reinforce the usage of discovered composite tools.
- Reward Function: A sparse reward is assigned (+1) if the agent's trajectory contains any registered composite tool, encouraging the agent to treat these sequences as single conceptual units.
- Goal: Aligns the policy to prefer structured, efficient tool orchestration over random atomic calls.

3. Key Contributions

Paradigm Shift: Identifies and addresses the limitation of static tool sets in medical AI, proposing a shift to experience-driven self-evolution where agents learn to compose their own tools.
Novel Architecture: Introduces MACRO, featuring:
- An image-feature memory for context-grounded retrieval.
- A composite tool synthesis module that autonomously discovers and registers multi-step routines.
- A closed-loop training system combining SFT and GRPO to refine tool usage.
Empirical Validation: Demonstrates that autonomous composite tool discovery significantly improves multi-step orchestration accuracy and cross-domain generalization compared to strong baselines.

4. Experimental Results

The framework was evaluated on three diverse medical imaging datasets: REFUGE2 (Glaucoma), MITEA (Heart Disease), and RAM-W600 (Bone Erosion).

Performance vs. General VLMs: MACRO significantly outperformed state-of-the-art Vision-Language Models (e.g., GPT-4o, LLaVA-Med, Qwen2.5-VL).
- Example (Glaucoma): Achieved 92.7% Balanced Accuracy (BACC) and 80.3% F1, compared to ~54% BACC for the best baseline.
Performance vs. Medical Agentic Systems: Surpassed specialized medical agents (MedAgents, MMedAgent, MedAgent-Pro).
- Example: Outperformed MedAgent-Pro by 2.3% BACC and 3.9% F1 on glaucoma tasks.
Performance vs. Task-Specific Models: Even against models specifically trained on the target dataset (e.g., ResNet, MobileViT for bone erosion), MACRO achieved superior results (61.75% BACC vs. ~52% for baselines), demonstrating the power of adaptive tool integration.
Ablation Studies: Confirmed that each component (Memory, Composite Discovery, GRPO) contributes positively. The full system achieved the highest performance, with composite tool discovery providing the most significant leap in reasoning capability.

5. Significance and Impact

Adaptability: MACRO bridges the gap between brittle static tools and the dynamic nature of clinical practice. It allows agents to adapt to new imaging protocols or modalities without manual re-design.
Efficiency: By converting recurring multi-step workflows into single "composite" actions, the agent reduces reasoning complexity and improves execution efficiency.
Clinical Relevance: The approach mimics how human clinicians refine their expertise through practice, accumulating reusable diagnostic routines. This offers a viable path toward maintainable clinical deployment, where AI systems can continuously improve and audit their own capabilities over time.
Future Direction: The paper highlights the potential for "self-skill discovery" as a critical component for the next generation of trustworthy, autonomous medical AI.

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

1. The "Mental Notebook" (Experience-Grounded Memory)

2. The "Shortcut Discovery" (Self-Skill Discovery)

3. The "Coach" (Reinforcement Learning)

Why Does This Matter?

1. Problem Statement

2. Methodology: The MACRO Framework

A. Experience-Grounded Memory

B. Composite Tool Discovery

C. Two-Stage Policy Optimization

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning