Gaze2Report: Radiology Report Generation via Visual-Gaze Prompt Tuning of LLMs

Gaze2Report is a novel framework that improves radiology report generation by using a scanpath prediction module and Graph Neural Network to generate joint visual-gaze tokens for fine-tuning Large Language Models, thereby incorporating physician-informed visual attention while enabling inference without requiring actual eye-gaze data.

Aishik Konwer, Moinak Bhattacharya, Prateek Prasanna

Published 2026-04-13
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a radiologist looking at an X-ray of a patient's chest. You don't just stare blankly at the whole picture; your eyes dance around. You zoom in on a shadow in the left lung, then glance at the heart border, then check the ribs. Your eye movements (gaze) tell a story about what you think is important.

For a long time, computers trying to write medical reports from X-rays have been like students who are forced to write an essay while blindfolded. They can see the image, but they don't know where the doctor is looking or what caught their attention. They guess what's important, often missing the subtle details that a human expert would spot.

The paper you shared introduces Gaze2Report, a new AI system that tries to fix this by teaching computers to "see" the X-ray the way a doctor does.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Blind" AI

Most AI report generators are like a tourist taking a photo of a city and trying to write a travel guide based only on the photo. They might describe the sky and the buildings, but they miss the specific alleyway where the crime happened or the hidden shop the locals love.

  • In the paper's terms: These models look at the whole image but lack "medical priors" (the doctor's prior knowledge and focus). They often write reports that sound grammatically correct but miss the critical medical facts.

2. The Solution: The "Eye-Tracking" Tutor

The researchers realized that if they could show the AI where a real doctor looked, the AI would write a much better report.

  • The Analogy: Imagine a student learning to paint. Instead of just showing them a picture of a landscape, a master painter stands next to them and points exactly where to look: "Look at the light hitting this leaf," or "Notice the crack in this rock." The student learns faster and paints a better picture.
  • In the paper: They used data from eye-trackers (cameras that record where a doctor's eyes go) to teach the AI which parts of the X-ray are the "stars" of the show.

3. The Magic Ingredient: The "Social Network" for Image Parts (GNN)

The AI doesn't just look at the "hot spots" where the doctor looked; it understands how those spots connect.

  • The Analogy: Think of the X-ray as a city. The AI uses a Graph Neural Network (GNN) like a social network map. It doesn't just look at "Park A" and "Park B" separately. It understands that "Park A" is connected to "Park B" by a river, and if there's a problem in the river, it affects both parks.
  • In the paper: The AI connects the visual parts of the X-ray with the doctor's gaze points, creating a "joint visual-gaze token." It learns that a shadow near the heart (visual) + the doctor staring at it for 2 seconds (gaze) = "This is likely a serious heart issue."

4. The Big Challenge: The "Ghost" Gaze

Here is the tricky part: In a real hospital, when the AI generates a report for a new patient, nobody is wearing an eye-tracking camera. The AI needs to know where to look, but it can't see the doctor's eyes in real-time.

  • The Analogy: Imagine you are a detective trying to solve a crime. You usually have a witness (the doctor) pointing at the evidence. But what if the witness isn't there? You need a system that can predict where the witness would have looked based on the clues.
  • In the paper: They added a Scanpath Prediction Module. This is like a "crystal ball" inside the AI. It predicts where a doctor would look at the X-ray, even without the actual eye-tracking data. This allows the system to work in the real world without expensive cameras.

5. The Result: A Smarter, More Human Report

The team tested this new system (Gaze2Report) against other top AI models.

  • The Outcome: The new AI wrote reports that were not only grammatically better but, more importantly, medically more accurate.
  • Real-world example: In the paper's examples, the old AI might say "fluid congestion" (a vague term), while Gaze2Report said "mild signs of pulmonary edema" (a precise medical term). It found details like "small bilateral pleural effusions" that the others missed.

Summary

Gaze2Report is like giving an AI a pair of "doctor's eyes."

  1. It learns from real doctors' eye movements to know what matters.
  2. It uses a smart network (GNN) to connect those important spots together.
  3. It has a built-in "prediction engine" to guess where to look when the real doctor isn't watching.

The result is an AI assistant that doesn't just describe the picture; it understands the story the picture is telling, just like a human expert would.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →