3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

Imagine you are a detective trying to solve a complex crime, but instead of a single photograph, you are handed a massive, 3D block of frozen time containing every slice of a city. Your job is to find a specific suspect (a tumor), measure their height, and figure out if they are dangerous, all without getting lost in the sheer volume of data.

This is exactly the challenge doctors face when reading 3D CT scans. They have to look at hundreds of thin "slices" of a patient's body, one by one, to build a complete picture. It's exhausting, time-consuming, and easy to miss a tiny clue.

Enter 3DMedAgent, a new AI system designed to be the ultimate detective's assistant. Here is how it works, explained simply:

The Problem: The "Flat" vs. "Deep" Mismatch

Most current AI models are like 2D photographers. They are amazing at looking at a single flat photo (like an X-ray) and answering questions about it. But when you give them a 3D CT scan (a whole block of data), they get confused.

Some older AI tries to squish the whole 3D block into a tiny summary, like trying to describe a whole movie by looking at one blurry frame. They miss the details.
Other AI tries to learn everything from scratch, but it needs millions of expensive, labeled 3D examples to do so, which we don't have enough of.

The Solution: The "Smart Detective" (3DMedAgent)

Instead of forcing the AI to "see" in 3D all at once, 3DMedAgent acts like a smart project manager who hires a team of specialized tools to do the heavy lifting. It uses a standard 2D AI (the "brain") but gives it a set of magical tools to handle the 3D data.

Here is the detective's workflow, broken down into three steps:

1. The "Map Check" (Organ-Aware Memory)

Before looking for the criminal, the detective needs a map.

What it does: The agent first uses a tool to quickly identify all the major organs (liver, lungs, kidneys) and their general sizes.
The Analogy: Imagine walking into a house and immediately noting, "The kitchen is on the left, the bedroom is upstairs, and the living room is big." You don't need to look inside every drawer yet; you just need a mental map of where everything is. This gives the AI a "big picture" context.

2. The "Searchlight" (Coarse-to-Fine Targeting)

Now, the doctor asks, "Is there a tumor in the liver?"

What it does: Instead of scanning every single slice of the liver (which could be thousands of images), the agent uses a "searchlight" tool. It scans the whole liver quickly to find the most suspicious areas. It then narrows its focus down to just a few specific slices where the tumor is likely hiding.
The Analogy: Instead of reading every page of a 500-page book to find a typo, you use the "Find" function to jump straight to the pages where the word appears. The agent ignores the boring parts and zooms in on the interesting spots.

3. The "Deep Dive" Loop (Think-with-1-Slice)

Sometimes, the clues are tricky. The agent might see something suspicious but isn't 100% sure.

What it does: This is the "Think" loop. The agent picks one single slice at a time, zooms in, looks at it closely, and asks itself, "Does this look like a tumor? Does it match the size I expect?" It writes down its findings in a shared notebook (Memory). If it's still unsure, it picks another slice, checks again, and updates the notebook.
The Analogy: Imagine a detective looking at a fingerprint. If it's smudged, they don't guess; they pull out a magnifying glass, look at one tiny ridge, write it down, then look at the next ridge. They build the answer piece by piece, keeping a running log of all the evidence they've gathered.

Why This is a Game-Changer

No Re-training Needed: The "brain" of the agent is a standard 2D AI that already knows how to talk and reason. We didn't have to teach it how to see in 3D from scratch. We just gave it the right tools and a good workflow.
The Shared Notebook: The most important part is the Memory. As the agent checks different slices, it doesn't forget what it saw earlier. It aggregates all the small clues (e.g., "The liver is big," "There's a dark spot here," "The spot is 2cm wide") into a structured report. This allows it to make complex medical decisions based on evidence, not just a lucky guess.
Better than the Experts: The paper tested this on over 40 different medical tasks (like measuring organ size, counting tumors, or diagnosing diseases). 3DMedAgent beat almost every other AI, including those specifically designed for 3D. It was especially good at the hard stuff, like figuring out if a tumor is dangerous, because it actually checked the evidence rather than guessing.

The Bottom Line

3DMedAgent is like giving a brilliant 2D detective a 3D microscope, a searchlight, and a notebook. It doesn't try to be a 3D superhero; instead, it breaks the massive, scary 3D problem into small, manageable 2D steps, gathers the evidence carefully, and writes a reliable report.

This means doctors might soon have an AI assistant that can scan a patient's entire body, find the trouble spots, measure them, and explain why it thinks something is wrong, all while reducing the doctor's workload and the risk of human error.

1. Problem Statement

3D medical imaging, particularly Computed Tomography (CT), requires a continuum of analysis ranging from low-level perception (e.g., organ segmentation, measurement) to high-level clinical understanding (e.g., tumor staging, diagnosis). Current approaches face two primary limitations:

Isolated Task Modeling: Existing methods often treat tasks (segmentation, VQA, report generation) in isolation, preventing the systematic accumulation of perceptual evidence needed for complex downstream reasoning.
Limitations of MLLMs: While Multimodal Large Language Models (MLLMs) excel at integrating text and 2D images, they struggle with 3D volumetric data. Directly feeding 3D volumes into 2D MLLMs (as sequences) discards spatial context, while adapting MLLMs with 3D encoders often leads to tokenization that blurs fine-grained anatomy and encourages "shortcut" pattern matching rather than genuine 3D understanding. Furthermore, 3D-specific models often suffer from brittleness due to data scarcity and domain shifts.

The core challenge is enabling general-purpose 3D clinical assistants that can perform systematic, multi-step reasoning without requiring task-specific 3D fine-tuning of the underlying MLLM.

2. Methodology: 3DMedAgent

The authors propose 3DMedAgent, a unified agent framework that empowers existing 2D MLLMs to perform general 3D CT analysis. Instead of training a new 3D model, 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent. It operates on a query-adaptive evidence-seeking loop, progressively decomposing complex 3D tasks into tractable sub-tasks.

The framework consists of three core stages, all supported by a long-term structured memory that aggregates intermediate tool outputs:

A. Organ-Aware Memory Initialization (OAMI)

Goal: Provide the agent with a global overview of the 3D volume.
Process: The agent uses a pre-trained segmentation model (VISTA3D) to identify major organs. It computes compact statistics for each organ (size, mean Hounsfield Unit (HU) value, and Z-axis range).
Output: A structured memory entry ( $M_0$ ) containing organ-level priors. This allows the agent to handle basic measurements and establish a baseline for reasoning without injecting noisy lesion data initially.

B. Coarse-to-Fine Lesion Targeting (CFLT)

Goal: Narrow the search space from the entire volume to specific Regions of Interest (ROIs) and informative slices.
Process:
1. Alignment: Uses a pre-trained 3D vision encoder (CT-CLIP) to align the 3D volume with the text query, generating a dense 3D similarity heatmap.
2. Filtering: The agent filters the heatmap based on the organ priors from OAMI to exclude irrelevant anatomical regions.
3. Scoring: It calculates a lesion-targeting score by aggregating patch-level responses and organ overlap ratios.
Output: A shortlist of high-confidence candidate ROIs and specific 2D slices, stored in memory ( $M_\ell$ ) as potential evidence.

C. Think-with-1-Slice Loop (T1S-Loop)

Goal: Perform fine-grained visual inspection to verify ambiguities and refine the answer.
Process: If the initial evidence is insufficient, the agent enters an iterative loop:
1. Reasoning: The MLLM analyzes current memory and generates a rationale, current answer, and list of assumptions.
2. Action Selection: A router determines if new visual evidence is needed. If yes, it selects a specific slice or ROI and applies a visual tool (e.g., mask overlay, crop-and-zoom).
3. Update: The agent processes the new slice, updates the structured memory with the new evidence, and refines the answer.
Termination: The loop stops when the agent is confident (no new evidence needed) or a maximum iteration limit is reached.

3. Key Contributions

Unified Agent Framework: 3DMedAgent is the first solution to enable 2D MLLMs to perform general 3D CT analysis from perception to understanding without 3D-specific fine-tuning. It bridges the gap between 2D MLLM capabilities and 3D data requirements.
Evidence-Centric Long-Term Memory: Introduces a mechanism to distill heterogeneous tool outputs (segmentation masks, heatmaps, slice images) into compact, structured textual evidence. This supports query-conditioned cue acquisition and multi-step reasoning.
DeepChestVQA Benchmark: The authors introduce a new benchmark specifically for 3D thoracic imaging, covering 17 capability dimensions (Recognition, Visual Reasoning, Medical Reasoning) across 1,020 VQA pairs. This addresses the lack of comprehensive chest CT evaluation in existing literature.
Scalable Paradigm: Demonstrates a path toward general-purpose clinical assistants by shifting from training specialized 3D models to building agents that actively acquire and validate evidence.

4. Experimental Results

The authors evaluated 3DMedAgent across 40+ 3D medical tasks on two benchmarks: DeepTumorVQA (abdominal) and DeepChestVQA (thoracic).

Performance: 3DMedAgent consistently outperformed general MLLMs (GPT-5, Qwen3-VL), medical-specific MLLMs (MedGemma, HuatuoGPT), and 3D-specialized models (RadFM, M3D).
Accuracy Gains: It achieved an average 20% accuracy gain over baselines. Notably, on challenging medical reasoning tasks, improvements exceeded 27%.
Generalization: The agent showed robust cross-dataset and cross-organ generalization, performing well on both abdominal and thoracic data without retraining, whereas fine-tuned 3D models often suffered from performance drops on unseen data sources.
Ablation Studies: Removing any of the three core components (OAMI, CFLT, T1S-Loop) resulted in significant performance degradation, confirming the necessity of the full pipeline.
Expert Alignment: The slice selection in the CFLT module showed high agreement with radiologists (approaching inter-radiologist agreement), validating the agent's ability to identify clinically relevant views.

5. Significance and Impact

Clinical Utility: 3DMedAgent offers a scalable solution to reduce the radiologist's burden of exhaustive slice-by-slice review by automating evidence gathering and structured reasoning.
Methodological Shift: It challenges the prevailing trend of training massive 3D-specific models, proposing instead a tool-augmented agent approach that leverages existing 2D MLLMs and specialized 3D tools. This is more data-efficient and adaptable to domain shifts.
Interpretability: By maintaining a structured memory of evidence (e.g., "Liver volume is X, HU is Y"), the system provides transparent, evidence-based reasoning rather than "black box" predictions.
Future Direction: The work highlights the potential for adaptive learning and richer tool suites, paving the way for reliable, general-purpose 3D medical decision support systems.

Code and Data Availability: The paper notes that code and data are available at https://github.com/jinlab-imvr/3DMedAgent.