DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

Imagine you are a doctor trying to diagnose a patient's chest X-ray. In the old days, you might have looked at the X-ray yourself and made a guess. Today, instead of just one doctor, you have a super-smart AI team working together.

This team consists of a Manager (a large language model) and several Specialists (tools that can read X-rays, find specific spots, write reports, or answer questions). The Manager looks at the X-ray, decides which Specialist to call, listens to their advice, and then writes the final diagnosis.

The paper you're asking about, DUCX, is like a fairness inspector sent to audit this AI team. The researchers wanted to know: Is this AI team treating all patients equally, regardless of their age or gender?

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "Hidden" Bias

Most people check if an AI is fair by looking only at the final answer. Did it get the diagnosis right?

The Old Way: If the AI gets 90% of the answers right for men and 85% for women, we say, "Oh, it's a little unfair." But we don't know why.
The DUCX Way: The researchers realized that in a team of specialists, the unfairness might happen during the process, not just at the end. It's like a relay race. If the team loses, it could be because the first runner was slow, the second runner dropped the baton, or the last runner tripped. You need to check every leg of the race to find the problem.

2. The Three Places Where Unfairness Hides

The researchers broke the AI's "thought process" into three stages to see where the bias creeps in:

A. Tool Exposure Bias (The "Who Gets to Speak?" Problem)

Imagine the AI Manager has a list of specialists: a "Nodule Finder," a "Report Writer," and a "Visualizer."

The Issue: The Manager might decide, "For male patients, I'll call the Nodule Finder. But for female patients, I'll skip that step and just guess."
The Finding: Even if the tools themselves are perfect, if the Manager doesn't use the best tool for a specific group of people, that group gets a worse diagnosis. In their tests, they found that for some groups, the AI was missing out on crucial tools up to 50% of the time compared to others.

B. Tool Transition Bias (The "Wrong Path" Problem)

Imagine the AI is navigating a maze to find the answer.

The Issue: The Manager might take a "shortcut" for one group of people (e.g., Men) but force another group (e.g., Women) to take a long, winding, confusing path with more steps.
The Finding: They found that the AI often routed different genders and ages through completely different "paths." For example, it might ask a "Visualizer" tool for men, but then immediately ask a "Classifier" tool for women. These different paths lead to different levels of confidence and accuracy.

C. LLM Reasoning Bias (The "Confidence" Problem)

Finally, the Manager writes the final report.

The Issue: Even if the tools gave the same information, the Manager might write the report differently.
- For Group A, it might say: "There is definitely a tumor here."
- For Group B, it might say: "There might be a tumor here, or it could be something else."
The Finding: The AI often used "hedge words" (like maybe, possibly, likely) much more frequently for certain groups. This makes the diagnosis sound less certain for them, even if the medical facts were the same.

3. The Big Takeaway

The researchers tested this on five different "Manager" AI brains (like LLaMA, Qwen, and Gemini). They found that:

Fairness isn't just about the final score. You can have a high overall accuracy, but still be deeply unfair to specific groups because of how the AI got there.
The "Middle" matters. The bias often happens in the middle of the process (choosing tools or changing paths), not just at the very end.
One size does not fit all. Different AI "Managers" had different types of bias. Some were bad at choosing tools; others were bad at writing the final report.

Why This Matters

If we only look at the final grade (the diagnosis), we might think the AI is "good enough." But in medicine, how you get the answer is just as important as the answer itself. If an AI is less confident or takes a shortcut for elderly women, it could lead to missed diagnoses or delayed treatment.

DUCX is a new toolkit that helps developers "look under the hood" of these AI teams. It ensures that the AI doesn't just give the right answer, but that it treats every patient with the same level of care, attention, and thoroughness on its way to finding that answer.

1. Problem Statement

While AI agents are increasingly used in medical imaging to orchestrate specialized tools (e.g., classifiers, segmenters, report generators) for complex tasks like Chest X-ray (CXR) question answering, existing fairness audits primarily focus on standalone models. These traditional audits treat the system as a single decision function, measuring disparities only in final predictions.

The authors argue that agentic systems introduce new, hidden pathways for demographic bias that are invisible in end-to-end evaluations. Specifically, unfairness can arise and propagate through:

Tool Exposure: Different subgroups may receive unequal utility from the same tool due to training imbalances.
Tool Transition: The agent's planner (LLM) may route different demographic groups through systematically different tool chains (e.g., longer, less reliable sequences for specific groups).
LLM Reasoning: The final synthesis of the answer may exhibit demographic-dependent variations in reasoning quality, uncertainty expression, or framing.

Without process-level decomposition, it is impossible to diagnose whether disparities stem from the tools, the routing logic, or the final generation, hindering effective debiasing.

2. Methodology: DUCX Framework

The paper proposes DUCX (Decomposing Unfairness in Chest X-ray agents), a systematic audit framework designed to localize disparities within the agent's execution trajectory.

A. Experimental Setup

Agent Architecture: The study uses MedRAX, a ReAct-style agent framework where a driver LLM iteratively reasons, selects tools, and synthesizes answers.
Driver LLMs: Five different Large Language Models (LLMs) were tested as the "planner": LLaMA3.1-8B, Ministral-3-8B, Qwen3VL-8B, Qwen3-8B, and Gemini3-Flash.
Tool Pool: Six categories of tools were available: Classifier (CLS), Visual Question Answering (QA), Report Generator (RG), Segmentator (SEG), Visualizer (VIS), and Phrase Grounding (GRD).
Datasets:
- CheXAgentBench: 2,500 expert-curated clinical cases with demographics.
- MIMIC-FairnessVQA: A newly curated benchmark derived from MIMIC-CXR, featuring 2,000 instances with balanced sampling by gender and age (threshold 60), and LLM-generated multi-choice questions.

B. Fairness Decomposition Metrics

DUCX decomposes unfairness into three distinct stages:

End-to-End Bias:
- Measures overall performance using Accuracy (ACC), Delta Accuracy ( $\Delta$ ACC), Demographic Parity (DP), Equalized Odds (EoD), and a Fairness-Utility Trade-off (FUT) score.
Tool-Exposure Bias:
- Definition: The accuracy gap between subgroups conditioned on the presence of a specific tool in the trajectory.
- Goal: Identifies if a specific tool (e.g., a segmenter) performs significantly worse for one demographic group when invoked, regardless of how often it is used.
Tool-Transition Bias:
- Definition: Analyzes the Markov transition matrix of tool usage ( $P(t_{k+1}|t_k, g)$ ).
- Goal: Detects if the planner routes different groups through different sequences of tools (e.g., females are routed directly to a report generator while males are routed through a classifier first).
LLM Reasoning Bias:
- Definition: Analyzes the final natural language response for subgroup disparities in:
  - Reasoning Quality: Scored by an external LLM judge.
  - Hedging: Frequency of uncertainty cues (e.g., "may," "possibly").
  - Demographic Framing: Frequency of explicit demographic terms (e.g., "elderly," "female").

3. Key Contributions

First Systematic Audit: The first comprehensive demographic fairness evaluation of MedRAX-style tool-using CXR agents across five diverse driver LLMs.
DUCX Framework: A novel stage-wise decomposition method that attributes unfairness to specific agent components (exposure, transition, reasoning) rather than just the final output.
MIMIC-FairnessVQA: The curation of a new, demographic-aware benchmark specifically designed for evaluating agentic systems in chest X-ray VQA.

4. Key Results

The experiments revealed significant findings across the five driver LLMs:

Persistent End-to-End Bias: Demographic gaps persist in final performance. Equalized Odds (EoD) reached up to 20.79%, and the fairness-utility trade-off (FUT) dropped as low as 28.65%, indicating a difficult balance between accuracy and fairness.
Hidden Tool-Exposure Bias: Intermediate behaviors showed disparities not predictable from end-to-end metrics. For instance, when conditioned on the availability of a segmentation tool, the subgroup utility gap reached as high as 50%.
- Observation: On CheXAgentBench, the Segmentator tool exhibited the largest exposure gaps, particularly for gender.
Distinct Tool-Transition Patterns: The agent's routing logic varied significantly by demographic.
- Female patients were more likely to be routed directly from the Classifier to the Report Generator.
- Male patients in MIMIC-FairnessVQA often called the Classifier again after using the Visualizer, suggesting a different (potentially less efficient) reasoning path.
- Older individuals and males showed higher frequencies of repeated calls to Grounding tools.
LLM Reasoning Disparities: Even with identical trajectories, driver LLMs produced different response styles.
- Qwen3VL showed massive disparities in hedging (uncertainty expression), with gaps orders of magnitude larger than other models.
- Some models (e.g., Qwen3) maintained low demographic-term gaps but still exhibited high reasoning quality gaps, proving that bias manifests in subtle stylistic ways.

5. Significance and Conclusion

The paper demonstrates that fairness in agentic systems is not a direct extension of model-level fairness. Relying solely on end-to-end evaluation masks the specific mechanisms driving inequality.

Diagnostic Power: DUCX allows researchers to pinpoint whether a bias originates from a specific tool's performance, the planner's routing logic, or the LLM's synthesis style.
Implications for Deployment: The findings underscore the necessity of process-level fairness auditing before deploying clinical agentic systems.
Future Directions: The authors suggest that future debiasing efforts must be stage-targeted (e.g., retraining specific tools, constraining routing policies, or fine-tuning LLMs for consistent reasoning styles) rather than applying a single global fix.

In summary, DUCX provides a critical roadmap for ensuring the equitable deployment of complex, multi-step medical AI agents by exposing the "black box" of their internal decision-making processes.