Gaze2Report: Radiology Report Generation via Visual-Gaze Prompt Tuning of LLMs

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a radiologist looking at an X-ray of a patient's chest. You don't just stare blankly at the whole picture; your eyes dance around. You zoom in on a shadow in the left lung, then glance at the heart border, then check the ribs. Your eye movements (gaze) tell a story about what you think is important.

For a long time, computers trying to write medical reports from X-rays have been like students who are forced to write an essay while blindfolded. They can see the image, but they don't know where the doctor is looking or what caught their attention. They guess what's important, often missing the subtle details that a human expert would spot.

The paper you shared introduces Gaze2Report, a new AI system that tries to fix this by teaching computers to "see" the X-ray the way a doctor does.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Blind" AI

Most AI report generators are like a tourist taking a photo of a city and trying to write a travel guide based only on the photo. They might describe the sky and the buildings, but they miss the specific alleyway where the crime happened or the hidden shop the locals love.

In the paper's terms: These models look at the whole image but lack "medical priors" (the doctor's prior knowledge and focus). They often write reports that sound grammatically correct but miss the critical medical facts.

2. The Solution: The "Eye-Tracking" Tutor

The researchers realized that if they could show the AI where a real doctor looked, the AI would write a much better report.

The Analogy: Imagine a student learning to paint. Instead of just showing them a picture of a landscape, a master painter stands next to them and points exactly where to look: "Look at the light hitting this leaf," or "Notice the crack in this rock." The student learns faster and paints a better picture.
In the paper: They used data from eye-trackers (cameras that record where a doctor's eyes go) to teach the AI which parts of the X-ray are the "stars" of the show.

3. The Magic Ingredient: The "Social Network" for Image Parts (GNN)

The AI doesn't just look at the "hot spots" where the doctor looked; it understands how those spots connect.

The Analogy: Think of the X-ray as a city. The AI uses a Graph Neural Network (GNN) like a social network map. It doesn't just look at "Park A" and "Park B" separately. It understands that "Park A" is connected to "Park B" by a river, and if there's a problem in the river, it affects both parks.
In the paper: The AI connects the visual parts of the X-ray with the doctor's gaze points, creating a "joint visual-gaze token." It learns that a shadow near the heart (visual) + the doctor staring at it for 2 seconds (gaze) = "This is likely a serious heart issue."

4. The Big Challenge: The "Ghost" Gaze

Here is the tricky part: In a real hospital, when the AI generates a report for a new patient, nobody is wearing an eye-tracking camera. The AI needs to know where to look, but it can't see the doctor's eyes in real-time.

The Analogy: Imagine you are a detective trying to solve a crime. You usually have a witness (the doctor) pointing at the evidence. But what if the witness isn't there? You need a system that can predict where the witness would have looked based on the clues.
In the paper: They added a Scanpath Prediction Module. This is like a "crystal ball" inside the AI. It predicts where a doctor would look at the X-ray, even without the actual eye-tracking data. This allows the system to work in the real world without expensive cameras.

5. The Result: A Smarter, More Human Report

The team tested this new system (Gaze2Report) against other top AI models.

The Outcome: The new AI wrote reports that were not only grammatically better but, more importantly, medically more accurate.
Real-world example: In the paper's examples, the old AI might say "fluid congestion" (a vague term), while Gaze2Report said "mild signs of pulmonary edema" (a precise medical term). It found details like "small bilateral pleural effusions" that the others missed.

Summary

Gaze2Report is like giving an AI a pair of "doctor's eyes."

It learns from real doctors' eye movements to know what matters.
It uses a smart network (GNN) to connect those important spots together.
It has a built-in "prediction engine" to guess where to look when the real doctor isn't watching.

The result is an AI assistant that doesn't just describe the picture; it understands the story the picture is telling, just like a human expert would.

1. Problem Statement

Current deep learning methods for automated radiology report generation often rely solely on image-to-text architectures (similar to image captioning). These approaches suffer from three main limitations:

Lack of Medical Priors: They overlook physician-informed medical priors, leading to a misalignment between structured explanations and actual disease manifestations.
Contextual Depth: Standard Natural Language Generation (NLG) metrics are optimized for brief descriptions, failing to capture the complex, detailed context required for clinical radiology reports.
Inference Limitations of Gaze Data: While radiologists' eye gaze data provides critical insights into visual attention and decision-making, integrating it into AI workflows is difficult. Existing methods struggle with multimodal fusion complexity, and crucially, gaze data is unavailable during inference in real-world clinical settings, limiting the practical applicability of gaze-guided models.

2. Methodology: Gaze2Report Framework

The authors propose Gaze2Report, a novel framework that integrates eye gaze information into Large Language Model (LLM) prompt tuning to generate radiology reports. The architecture consists of two primary components:

A. Visual-Gaze Token Generation Module

This module processes images and simulated gaze data to create a unified multimodal representation:

Visual Feature Extraction: A Vision Transformer (ViT) extracts spatial features from chest X-ray images by patchifying them into non-overlapping grids.
Gaze Token Generation (Scanpath Prediction): Since real gaze data is unavailable at inference, the model employs MedGaze, a scanpath prediction model, to simulate a sequence of gaze fixation points (coordinates and duration).
- Duration Aggregation: For each image patch, the fixation durations of all simulated gaze points falling within that patch are summed to create a "gaze vector" ( $G_i$ ) representing the radiologist's attention distribution.
Graph Neural Network (GNN) Fusion:
- A graph is constructed where nodes represent image patches containing visual features ( $V_i$ ), gaze features ( $G_i$ ), and positional encodings ( $e_i$ ).
- Edges are formed based on the $k$ -nearest neighbors ( $k$ -NN) criterion derived from positional distances.
- A GNN iteratively updates node embeddings by aggregating information from neighbors, effectively modeling the spatial and attentional relationships between image regions.
- The final graph representation ( $h_{graph}$ ) is obtained via mean pooling.
Projection: The graph embedding is projected into the high-dimensional feature space of the LLM via a linear layer.

B. LLM Prompt Tuning (LoRA)

Multimodal Prompt Construction: The projected visual-gaze tokens are concatenated with instruction tokens (e.g., "Generate a comprehensive report...") and report tokens to form a unified prompt.
Fine-Tuning: The framework uses Llama2-7B as the base LLM. Instead of full fine-tuning, it employs Low-Rank Adaptation (LoRA) to update only a small set of parameters, ensuring computational efficiency.
Training Objective: The model uses an autoregressive loss function to minimize the negative log-likelihood of predicting the next token in the report sequence. During training, only the report tokens are used for loss calculation.

3. Key Contributions

GNN-Enhanced Visual-Gaze Interaction: The study introduces a Graph Neural Network to fuse visual and gaze modalities, allowing the model to reason over complex multimodal data and capture nuanced relationships between abnormal regions and clinical conclusions.
Inference-Time Gaze Simulation: To address the absence of gaze data during deployment, the framework integrates an auxiliary scanpath prediction module. This allows the model to generate "fake" but clinically relevant gaze tokens on the fly, enabling the system to operate effectively without real eye-tracking hardware in clinical settings.
Comprehensive Evaluation: The model is validated across multiple datasets (REFLACX, IU-XRAY, MIMIC-CXR) using both standard NLG metrics and Clinical Efficacy (CE) metrics, demonstrating superiority over existing baselines.

4. Experimental Results

The model was evaluated on three datasets: REFLACX (with real gaze data), IU-XRAY, and MIMIC-CXR.

Natural Language Generation (NLG) Metrics

Performance: Gaze2Report outperformed State-of-the-Art (SOTA) models (including R2GenGPT, MET, and EGGCA-Net) across all metrics (BLEU, ROUGE-L, METEOR).
- On MIMIC-CXR, it achieved a BLEU-4 score of 0.137 (vs. 0.134 for R2GenGPT) and a METEOR score of 0.167 (vs. 0.160).
Ablation Studies:
- Gaze Impact: The model significantly outperformed "Base-1" (visual-only), proving that incorporating scanpaths improves report accuracy.
- Fusion Mechanism: It outperformed "Base-2" (simple concatenation) and "Base-3" (cross-attention fusion), validating the effectiveness of the GNN-based assembly module.
- Inference Capability: Even when gaze was absent at inference (relying on the scanpath predictor), Gaze2Report outperformed baselines that had access to real gaze data during inference, demonstrating the robustness of the prediction module.

Clinical Efficacy (CE) Metrics

Using Chexbert to label 14 thoracic pathologies, Gaze2Report achieved the highest F1 Score (0.444) and Accuracy (0.786) among most competitors, surpassing R2GenGPT significantly.
While EGGCA-Net showed high performance due to dedicated multi-label classification losses, Gaze2Report achieved superior BERTScore (0.241) and RadGraph F1 (0.117), indicating better semantic similarity and precise representation of clinical terminologies compared to R2Gen.

Qualitative Analysis

Case studies showed that Gaze2Report successfully generated critical clinical terms (e.g., "small bilateral pleural effusions," "diffuse calcification") that were missed by visual-only baselines. It also demonstrated higher precision in terminology (e.g., using "mild signs of pulmonary edema" instead of the vaguer "mild fluid congestion").

5. Significance

Bridging the Gap: Gaze2Report successfully bridges the gap between human radiologist attention patterns and AI generation, enhancing the factual and clinical accuracy of reports.
Practical Deployment: By solving the "gaze absence at inference" problem through scanpath prediction, the framework makes gaze-guided report generation viable for real-world clinical environments where eye-tracking hardware is not standard.
Interpretability: The use of GNNs and gaze tokens provides a mechanism for the model to focus on relevant image regions, improving the interpretability of the generated reports and aligning them more closely with human diagnostic reasoning.