Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

Imagine you are a doctor trying to diagnose a patient's chest X-ray. Instead of relying on just your own eyes, you have a team of specialist AI robots standing around you, each offering their own opinion.

Robot A says: "I see a mild heart enlargement."
Robot B says: "No, I see a severe heart enlargement."

They disagree. Who do you trust?

In the past, doctors (or AI agents) had two bad options:

The "Blind Trust" approach: Just pick the robot that sounds the most confident or gives the longest, most detailed explanation. (But sometimes, the robot that talks the most is just the most confused!)
The "Average" approach: Take the middle ground of all their answers. (But if one robot is right and the other is wrong, the middle ground is still wrong.)

The Problem: The "Resume" vs. The "Track Record"

Most current AI systems look at a robot's resume (its description: "I am an expert in heart diseases") to decide who to trust. But in the real world, a robot might have a great resume but still make mistakes on specific types of X-rays. They don't know which robot is actually reliable right now for this specific picture.

The Solution: TEA-CXA (The "Smart Intern")

The paper introduces a new AI agent called TEA-CXA. Think of TEA-CXA not as a doctor, but as a super-smart medical intern who learns by doing.

Here is how TEA-CXA learns, using a simple analogy:

1. The "Taste Test" Training

Imagine you are training a food critic. You give them a dish and ask them to guess the ingredients.

Old Way: You tell the critic, "Chef A is a French expert, so trust Chef A."
TEA-CXA Way: You let the critic try different chefs. Sometimes Chef A is right, sometimes Chef B is right.
- If the critic guesses Chef A's answer and it's correct, they get a gold star (reward).
- If they guess Chef B's answer and it's wrong, they get a thumbs down.

Over time, the critic stops looking at the chefs' resumes. Instead, they learn a track record: "Oh, for spicy dishes, Chef A is usually right. But for desserts, Chef B is the one to trust."

2. The "Conflict Resolution" Superpower

In the paper, when the two AI robots give different answers about an X-ray, TEA-CXA doesn't panic. It remembers its training:

"Hmm, this looks like a 'heart size' question. In my past training, Robot A was right 80% of the time on heart questions, even though Robot B wrote a longer explanation."
Decision: TEA-CXA ignores the long explanation and picks Robot A's answer.

3. The "Team Huddle" (Technical Magic)

The researchers also built a special "playground" (a code framework) to make this training possible.

Parallel Play: Usually, asking robots for help takes time. TEA-CXA asks multiple robots at the exact same time (like calling three friends at once instead of one by one).
Multi-Image: If the patient has two X-rays (front and side view), TEA-CXA knows exactly which robot to show which picture to, without getting confused by file names.

Why This Matters

The paper proves that this "learning by experience" approach works.

The Result: TEA-CXA became better at diagnosing X-rays than any single robot, better than just averaging their answers, and even better than the current "best" AI doctors in the world.
The Lesson: It's not about who says they are the expert; it's about who has proven to be the expert on this specific type of problem.

In a nutshell: TEA-CXA is an AI that stops guessing based on who talks the loudest and starts trusting who has the best track record for the specific job at hand. It turns a chaotic group of conflicting robots into a perfectly coordinated medical team.

1. Problem Statement

The paper addresses a critical limitation in current Medical AI agents: the inability to resolve conflicts between multiple AI tools.

Context: Medical AI agents often rely on Large Language Models (LLMs) or Multi-Modal Large Language Models (MLLMs) to invoke specialized "tools" (other AI models) for tasks like Chest X-ray (CXR) analysis.
The Gap: Existing approaches either use tools in a "zero-shot" manner (relying only on functional descriptions) or fine-tune agents using pre-constructed traces. Neither approach allows the agent to understand the real-world reliability of specific tools across different query types.
The Challenge: Medical tools are inherently error-prone and often produce contradictory outputs. Without empirical knowledge of which tool is trustworthy for a specific type of image or question, agents cannot effectively decide which output to trust, leading to suboptimal decision-making.

2. Methodology: TEA-CXA

The authors propose TEA-CXA, a framework that enables an agent to empirically learn the practical trustworthiness of tools through Multimodal Agentic Learning using Reinforcement Learning (RL).

A. Core Learning Paradigm

Instead of relying on static descriptions, the agent learns by interacting with tools and observing the outcomes of its choices.

Reinforcement Learning Algorithm: The system uses Group Relative Policy Optimization (GRPO).
Training Process:
1. Rollout Generation: For a given input (CXR image + query), the policy MLLM generates a group of trajectories (rollouts).
2. Tool Invocation: The agent calls multiple tools (e.g., MedGemma and Lingshu) in parallel.
3. Conflict Resolution: When tool outputs disagree, the agent is instructed to experimentally trust one tool's output over the others to form a final answer.
4. Reward Signal: The agent receives a reward based on the correctness of the final answer (Exact Match).
5. Optimization: The policy is updated to maximize the probability of selecting the correct tool response for specific query types, effectively learning a "trust map" for each tool.

B. Reward Function Design

The total reward $R(\tau)$ is a sum of three components:

Outcome Reward ( $R_o$ ): 1 if the final answer is correct, 0 otherwise.
Format Rewards ( $R_t, R_a$ ): Small rewards (0.1) for adhering to the prescribed JSON tool-calling format and wrapping the final answer in <answer></answer> tags.
Loss Masking: Gradients are computed only over MLLM-generated tokens, ignoring the tool response tokens, ensuring the agent learns to select rather than generate the tool content.

C. Enhanced Code Framework

The authors extended existing text-based RL frameworks (like RL-Factory) to support complex medical multimodal scenarios:

Parallel Tool Inference: Supports multiple tool calls per turn with round-robin API deployment to accelerate inference.
Multi-Image Handling: Allows agents to handle queries with multiple images (e.g., AP, PA, Lateral views) by using image indices (e.g., "Figure 1") rather than file paths, reducing token overhead and generation errors.
Dynamic Trusting: The system prompt explicitly instructs the agent not to bias toward longer or more detailed analyses, forcing it to learn reliability based on empirical success.

3. Key Contributions

Tool-Expertise Awareness: Pioneered the concept of training agents to learn the real-world reliability of tools dynamically, moving beyond static functional descriptions.
Multimodal Agentic Learning: Proposed a novel RL framework where agents learn to trust specific tools for specific query types through active interaction and reward feedback.
Robust Medical Codebase: Developed an extensible framework supporting multi-turn, multi-tool, and multi-image interactions tailored for medical scenarios.
State-of-the-Art Performance: Demonstrated significant improvements in Chest X-ray Visual Question Answering (VQA) compared to existing baselines and SOTA methods.

4. Experimental Results

The method was evaluated on CheXbench, a dataset comprising three subsets (Rad-Restruct, SLAKE, OpenI) with 618 multiple-choice questions.

Overall Accuracy: TEA-CXA achieved 73.8% overall accuracy, outperforming all baselines.
- Comparison: It surpassed the best baseline (MedRAX*) by ~4.2% and the Agent-ensemble by ~4.5%.
- Subsets: It achieved the highest accuracy on all three subsets (Rad-Restruct: 69.6%, SLAKE: 95.9%, OpenI: 67.9%).
Tool Response Selection: In cases where tools conflicted and at least one was correct, TEA-CXA selected the correct tool 63.8% of the time. This significantly outperformed the Agent-ensemble (54.0%) and MedRAX* (57.5%), proving its ability to discern tool reliability rather than relying on superficial features (like response length).
Qualitative Analysis: In case studies, TEA-CXA correctly ignored a tool with a detailed but incorrect justification (Lingshu) and trusted a tool with a concise but correct answer (MedGemma), demonstrating its learned "expertise awareness."

5. Significance

Clinical Applicability: By enabling agents to dynamically select the most reliable tool for a specific medical query, this approach reduces the risk of diagnostic errors caused by blindly trusting a single model or averaging conflicting outputs.
Generalizability: While demonstrated on Chest X-rays, the framework is designed for general medical research and multimodal settings, offering a blueprint for integrating diverse, error-prone AI tools into a cohesive, self-correcting agent system.
Efficiency: The framework's support for parallel inference and multi-image handling addresses practical bottlenecks in deploying medical AI agents in real-world workflows.

In summary, TEA-CXA represents a shift from static tool integration to dynamic, learned trust, allowing medical AI agents to navigate the uncertainty of multiple AI tools effectively.