AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Imagine you are trying to teach a very smart robot assistant how to read a chest X-ray. You want this robot to not only tell you what's wrong (like "there's pneumonia") but also to point exactly where it is on the picture and understand the difference between the left lung and the right lung.

The problem is, most current AI models are like students who memorized the answers but didn't understand the map. If you show them a normal X-ray, they get the diagnosis right. But if you flip the image upside down or swap the left and right sides, they get confused. They might say, "Oh, the pneumonia is on the left," when it's actually on the right, simply because they are guessing based on patterns rather than truly "seeing" the anatomy.

This paper introduces AnatomiX, a new AI model designed to fix this by acting more like a real radiologist.

Here is how AnatomiX works, broken down into simple analogies:

1. The Old Way: The "Guessing Game"

Most AI models look at an X-ray and try to guess the answer in one giant leap. They see a dark spot and say, "That's pneumonia." But they don't really know which lung that spot is in. It's like trying to find a specific house in a city by just looking at a blurry photo of the whole neighborhood. If the photo is flipped, you might point to the wrong house.

2. The AnatomiX Way: The "Two-Step Detective"

AnatomiX changes the game by breaking the job into two distinct steps, just like a human doctor does.

Step 1: The "Map Maker" (Anatomy Perception Module)
Before trying to diagnose anything, AnatomiX has a special internal tool called the Anatomy Perception Module (APM). Think of this as a GPS system that scans the X-ray first.

It doesn't just look at the whole picture; it specifically hunts for 36 different body parts (like the heart, the left lung, the right lung, the collarbones, etc.).
It draws invisible boxes around them and says, "Okay, I found the Left Lung here, and the Right Lung is over there."
It creates a detailed "map" of the body parts before it even tries to answer a question.

Step 2: The "Doctor" (The Large Language Model)
Once the "Map Maker" has identified and labeled all the body parts, it hands this organized information to the "Doctor" (the main AI brain).

The Doctor now doesn't have to guess where things are. It looks at the map and says, "Ah, I see the user asked about the Left Lung. The map tells me the Left Lung is here. Let me check what's happening in that specific box."
Because the Doctor has a clear map, it never gets confused if the image is flipped. It knows that "Left" is always "Left," regardless of how the picture is oriented.

3. The "Flashcard" Trick (Contrastive Retrieval)

One of the coolest parts of AnatomiX is how it learns what to say about each body part.

Imagine the AI has a massive library of flashcards. Each card has a picture of a specific body part (like the "Right Lower Lung") on one side and a medical description on the other (like "shows signs of fluid").
When AnatomiX sees a new X-ray, it finds the "Right Lower Lung" on its map, grabs the matching flashcard from its library, and uses that description to help the Doctor write the report. This ensures the AI uses the correct medical terms for the correct body part.

Why Does This Matter?

The researchers tested this new model against the best existing AI models. Here is what happened:

The "Flipped Image" Test: When they flipped the X-rays left-to-right, the old models failed miserably, mixing up left and right. AnatomiX, however, got it right almost every time because it actually understood the anatomy, not just the visual patterns.
The "Pointing" Test: When asked to draw a box around a specific disease, AnatomiX was 25% more accurate than the others.
The "Report" Test: It wrote better medical reports that were more accurate and easier for doctors to trust.

The Bottom Line

AnatomiX is like upgrading a robot from a parrot (which repeats what it hears) to a surgeon (which understands the body's structure). By forcing the AI to first identify where the body parts are before trying to diagnose them, it solves the biggest problem in medical AI: spatial confusion.

This means that in the future, AI assistants won't just be "smart" at reading text; they will be smart at understanding the human body, making them much safer and more reliable tools for doctors.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have shown promise in medical imaging but suffer from critical limitations in spatial reasoning and anatomical understanding.

The Gap: Existing models often rely on global image features and implicit correlations rather than explicit anatomical recognition. They struggle with fine-grained localization and fail to distinguish between anatomically distinct regions that share similar visual textures (e.g., left vs. right lung).
The Failure Mode: Current state-of-the-art models (e.g., RadVLM) often fail when presented with horizontally flipped images or when radiological markers are removed. They tend to rely on superficial orientation cues (left/right markers) rather than true anatomical structure recognition, leading to incorrect diagnoses when spatial cues are inverted.
Limitation of Current Grounding: While "grounding" (aligning text to image regions) exists, current methods often use a single-step process that fails to establish true anatomical correspondence, resulting in hallucinations or mislocalization in complex medical scenarios.

2. Methodology: AnatomiX

AnatomiX introduces a two-stage, anatomy-aware pipeline inspired by the radiological workflow (identify $\rightarrow$ localize $\rightarrow$ evaluate). It consists of two primary components:

A. Anatomy Perception Module (APM)

The APM is a multi-task learning framework designed to explicitly extract and align anatomical features before passing them to the LLM.

Architecture:
- Image Encoder ( $E$ ): Extracts global image embeddings ( $I_p$ ).
- Decoder ( $D$ ): Inspired by DETR, it uses learnable object tokens ( $\tilde{O}$ ) to predict bounding boxes ( $\hat{y}_{box}$ ) for $N=36$ predefined thoracic anatomical structures.
- Feature Extraction Module ( $M$ ): Uses cross-attention between the object tokens and image patches to extract fine-grained visual features ( $\hat{O}_A$ ) specific to each anatomical region.
Contrastive Alignment & Retrieval:
- Training: The module aligns visual features ( $\hat{O}_A$ ) with textual descriptions ( $S_t$ ) of findings in those regions using a Self-Similarity Loss. Unlike standard CLIP loss, this uses a soft contrastive approach (KL divergence) to handle the fact that multiple anatomical regions often share findings (e.g., bilateral pneumonia), avoiding false negatives.
- Inference: Instead of a sentence encoder, the APM uses a Vector Database (VDB) containing pre-computed embeddings of unique anatomical descriptions. It retrieves the most semantically similar text description for each detected anatomical object.
Output: The APM outputs image embeddings, predicted bounding boxes, fine-grained anatomical feature tokens, and retrieved textual descriptions.

B. Large Language Model (LM)

Base Model: Built upon MedGemma-4b-it.
Vocabulary Extension: The model's vocabulary is extended with special tokens:
- <obj_i>: Represents the feature embedding of the $i$ -th anatomical object.
- <box>, </box>, <ref>, </ref>: Tokens for spatial grounding and referencing.
Prompting: The APM outputs are integrated into a structured multimodal prompt template. This allows the LLM to directly reason over specific anatomical objects and their localized features rather than inferring them from a global image representation.
Training: The LM is trained using Low-Rank Adaptation (LoRA) for next-token prediction across nine diverse radiology tasks.

3. Key Contributions

AnatomiX Model: The first MLLM specifically designed with an explicit, two-stage anatomy-aware pipeline for Chest X-Ray (CXR) interpretation.
Superior Anatomical Reasoning: Demonstrates robustness against image flipping and removal of radiological markers, proving it learns true anatomical structures rather than orientation shortcuts.
State-of-the-Art (SOTA) Performance: Achieves significant improvements in grounding tasks while maintaining competitive performance in report generation and VQA.
Comprehensive Evaluation: Validated across 9 tasks (including phrase grounding, anatomy grounding, grounded diagnosis, and report generation) using multiple datasets (MIMIC-CXR, Chest ImaGenome, VinDr, etc.).

4. Experimental Results

The model was evaluated on four main categories of tasks, showing significant gains over baselines like RadVLM, CheXagent, and MAIRA-2.

Grounding Tasks (Major Improvement):
- Anatomy Grounding: AnatomiX achieved a 25%+ improvement in IoU and mAP compared to RadVLM.
- Robustness to Flipping: When images were horizontally flipped, RadVLM's performance collapsed (IoU $\approx$ 0.10), while AnatomiX maintained high accuracy (IoU $\approx$ 0.71), confirming its reliance on anatomical structure rather than spatial heuristics.
- Phrase Grounding: Showed up to 15% improvement over existing models.
Grounded Diagnosis & Captioning:
- Achieved 25-30% improvement in clinical metrics (RadGraph-F1, CheXbert-14-F1) and NLG metrics (ROUGE, BERTScore).
Report Generation:
- Outperformed larger models (e.g., Radialog, MAIRA-2) in clinical accuracy and linguistic coherence, despite having fewer parameters (4B vs. larger counterparts), highlighting efficiency.
Image Understanding & VQA:
- Matched or exceeded SOTA performance in image classification, abnormality detection, and both open/closed-ended VQA tasks.

Ablation Studies:

Removing the Anatomical Tokens ( $\hat{O}_A$ ) significantly degraded grounding performance.
Removing Retrieved Sentences ( $\hat{S}_t$ ) hurt report generation and grounded captioning.
Removing Bounding Boxes ( $\hat{y}_{box}$ ) reduced spatial precision.
The full combination of all components yielded the best results, validating the multi-component design.

5. Significance and Future Work

Paradigm Shift: AnatomiX moves away from "black-box" global feature matching toward explicit anatomical modeling. This bridges the gap between visual grounding and medical comprehension.
Clinical Relevance: By correctly identifying left/right structures and handling flipped images, the model reduces the risk of diagnostic errors caused by orientation biases, a critical requirement for clinical deployment.
Future Directions: The authors suggest extending this anatomy-oriented architecture to other modalities (e.g., MRI) and adapting the system for multi-turn conversational interactions to enhance clinical utility.

In summary, AnatomiX represents a significant advancement in medical AI by prioritizing anatomical structure over statistical correlation, resulting in a model that is not only more accurate but also more robust and clinically reliable for interpreting Chest X-Rays.

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

1. The Old Way: The "Guessing Game"

2. The AnatomiX Way: The "Two-Step Detective"

3. The "Flashcard" Trick (Contrastive Retrieval)

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology: AnatomiX

A. Anatomy Perception Module (APM)

B. Large Language Model (LM)

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks