FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification

The paper introduces FixationFormer, a transformer-based architecture that directly integrates expert gaze trajectories as sequential tokens to achieve state-of-the-art chest X-ray classification by effectively modeling the temporal and spatial structure of diagnostic eye movements alongside image features.

Daniel Beckmann, Benjamin Risse

Published 2026-03-25
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a computer how to read a Chest X-ray, just like a seasoned radiologist does.

The Problem: The "Blurry" Map vs. The "Live" Tour

Traditionally, computers look at X-rays using a specific type of brain (called a CNN) that is great at spotting patterns in pictures. To help them, researchers have tried to show them where a human expert looked.

Usually, they did this by turning the expert's eye movements into a static heatmap.

  • The Analogy: Imagine trying to explain a hiking trail to someone by showing them a single, blurry photo of the whole mountain with a red dot where you stopped. You've lost the story of the hike. You don't know if the expert looked at the top of the mountain first, then the bottom, or if they stared at a specific crack in the rock for ten seconds. The "heatmap" is just a blurry summary; it misses the timing and the order of the expert's thoughts.

The Solution: FixationFormer (The "Live Tour" Guide)

The authors of this paper, Daniel Beckmann and Benjamin Risse, realized that modern AI (specifically Transformers, the same tech behind chatbots) is actually perfect for understanding sequences.

They built a new system called FixationFormer. Instead of turning the eye movements into a blurry map, they treated the expert's gaze like a story or a playlist.

  • The Analogy: Instead of a static map, imagine the computer gets a live GPS tour from the expert.
    • "First, look here for 2 seconds."
    • "Then, jump to the left and stare at that spot for 3 seconds."
    • "Finally, zoom in on the bottom right."

The computer doesn't just see where the expert looked; it sees the sequence of their thinking process.

How It Works (The "Conversation")

The system has two main characters having a conversation:

  1. The Image: The X-ray itself, broken down into tiny puzzle pieces.
  2. The Gaze: The expert's eye movements, turned into a sequence of "tokens" (like words in a sentence).

They use a special "attention" mechanism (like a spotlight) to let these two characters talk to each other:

  • One-Way Chat (Cross-Attention): The X-ray asks the Gaze, "Hey, where should I focus my attention based on what the expert did?" The X-ray updates its understanding, but the Gaze stays the same.
  • Two-Way Chat (Two-Way Attention): They talk back and forth. The X-ray asks the Gaze for help, and the Gaze also asks the X-ray, "Wait, does this spot on the image make sense with where I'm looking?"

The Results: Why It Matters

The team tested this on three different sets of Chest X-rays. Here is what happened:

  1. It Works Better: In most cases, the computer that listened to the "live tour" (the sequence) got better at diagnosing diseases than the ones that just looked at the "blurry map" (heatmaps).
  2. It's Smarter with Less Data: When they used a "weaker" computer brain (one that hadn't been trained on millions of medical images), the "live tour" helped it outperform the "blurry map" significantly. It's like giving a novice hiker a detailed GPS guide; they can navigate much better than if you just gave them a blurry photo of the trail.
  3. The "One-Way" Chat Won: Interestingly, the system worked best when the X-ray listened to the Gaze, but didn't try to argue back. Sometimes, just letting the expert's path guide the computer is enough; trying to make them "debate" each other actually confused the system a bit.

The Big Picture

This paper is a game-changer because it stops treating human eye movements as just a static picture. Instead, it treats them as dynamic, sequential data—exactly the kind of data modern AI is best at understanding.

In short: They taught the computer to watch the expert's eyes move in real-time, rather than just looking at a snapshot of where the eyes stopped. This helps the computer "think" more like a human doctor, leading to more accurate diagnoses.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →