FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification

Imagine you are trying to teach a computer how to read a Chest X-ray, just like a seasoned radiologist does.

The Problem: The "Blurry" Map vs. The "Live" Tour

Traditionally, computers look at X-rays using a specific type of brain (called a CNN) that is great at spotting patterns in pictures. To help them, researchers have tried to show them where a human expert looked.

Usually, they did this by turning the expert's eye movements into a static heatmap.

The Analogy: Imagine trying to explain a hiking trail to someone by showing them a single, blurry photo of the whole mountain with a red dot where you stopped. You've lost the story of the hike. You don't know if the expert looked at the top of the mountain first, then the bottom, or if they stared at a specific crack in the rock for ten seconds. The "heatmap" is just a blurry summary; it misses the timing and the order of the expert's thoughts.

The Solution: FixationFormer (The "Live Tour" Guide)

The authors of this paper, Daniel Beckmann and Benjamin Risse, realized that modern AI (specifically Transformers, the same tech behind chatbots) is actually perfect for understanding sequences.

They built a new system called FixationFormer. Instead of turning the eye movements into a blurry map, they treated the expert's gaze like a story or a playlist.

The Analogy: Instead of a static map, imagine the computer gets a live GPS tour from the expert.
- "First, look here for 2 seconds."
- "Then, jump to the left and stare at that spot for 3 seconds."
- "Finally, zoom in on the bottom right."

The computer doesn't just see where the expert looked; it sees the sequence of their thinking process.

How It Works (The "Conversation")

The system has two main characters having a conversation:

The Image: The X-ray itself, broken down into tiny puzzle pieces.
The Gaze: The expert's eye movements, turned into a sequence of "tokens" (like words in a sentence).

They use a special "attention" mechanism (like a spotlight) to let these two characters talk to each other:

One-Way Chat (Cross-Attention): The X-ray asks the Gaze, "Hey, where should I focus my attention based on what the expert did?" The X-ray updates its understanding, but the Gaze stays the same.
Two-Way Chat (Two-Way Attention): They talk back and forth. The X-ray asks the Gaze for help, and the Gaze also asks the X-ray, "Wait, does this spot on the image make sense with where I'm looking?"

The Results: Why It Matters

The team tested this on three different sets of Chest X-rays. Here is what happened:

It Works Better: In most cases, the computer that listened to the "live tour" (the sequence) got better at diagnosing diseases than the ones that just looked at the "blurry map" (heatmaps).
It's Smarter with Less Data: When they used a "weaker" computer brain (one that hadn't been trained on millions of medical images), the "live tour" helped it outperform the "blurry map" significantly. It's like giving a novice hiker a detailed GPS guide; they can navigate much better than if you just gave them a blurry photo of the trail.
The "One-Way" Chat Won: Interestingly, the system worked best when the X-ray listened to the Gaze, but didn't try to argue back. Sometimes, just letting the expert's path guide the computer is enough; trying to make them "debate" each other actually confused the system a bit.

The Big Picture

This paper is a game-changer because it stops treating human eye movements as just a static picture. Instead, it treats them as dynamic, sequential data—exactly the kind of data modern AI is best at understanding.

In short: They taught the computer to watch the expert's eyes move in real-time, rather than just looking at a snapshot of where the eyes stopped. This helps the computer "think" more like a human doctor, leading to more accurate diagnoses.

1. Problem Statement

Medical image analysis, particularly for Chest X-Rays (CXR), faces significant challenges due to overlapping anatomical structures, small dataset sizes compared to natural images, and the complexity of diagnostic reasoning. While Convolutional Neural Networks (CNNs) have dominated the field, they struggle with these specific constraints.

Researchers have attempted to improve performance by incorporating expert gaze data (eye-tracking from radiologists) as a source of domain knowledge. However, existing methods suffer from two main limitations:

Loss of Temporal Dynamics: Most approaches convert sequential gaze trajectories into static 2D heatmaps. This process discards the temporal order and dynamics of how a radiologist scans an image, which may contain crucial diagnostic context.
Architectural Mismatch: CNNs are not naturally suited for sequential data. While heatmaps can be fed into CNNs, they do not leverage the inherent sequential nature of gaze data.
Inefficiency: Generating heatmaps can be computationally expensive and introduces spatial smoothing that may obscure fine-grained details.

The paper argues that Transformers, which rely on attention mechanisms and handle sequential data natively, are a more suitable architecture for integrating gaze data directly.

2. Methodology: FixationFormer

The authors propose FixationFormer, a Transformer-based architecture that treats expert gaze trajectories as sequences of tokens, allowing for direct fusion with image features.

A. Image Encoder

Backbone: A standard Vision Transformer (ViT) is used.
Pretraining: To mitigate the performance drop of ViTs on small medical datasets, the image encoder is pre-trained using the MGCA (Multi-Granularity Cross-modal Alignment) framework on the large MIMIC-CXR dataset.
Fine-tuning: During the specific classification tasks, the image encoder weights are frozen, and LoRA (Low-Rank Adaptation) is applied for task-specific fine-tuning.

B. Gaze Representation (Tokenization)

Instead of heatmaps, raw gaze trajectories are converted into a sequence of tokens:

Fixation Extraction: Raw gaze data (often noisy with saccades) is condensed into fixations (points where the eye pauses). Each fixation is defined by spatial coordinates ( $x, y$ ), start time, and duration.
Token Generation: For each fixation $f$ $f$ , a token is created by:
- Projecting spatial coordinates and duration via learnable linear layers.
- Encoding the start time using positional embeddings (similar to standard Transformers) to capture temporal order.
- Summing these components to form a single token per fixation.
- Result: A sequence of gaze tokens $G \in \mathbb{R}^{T \times D}$ , where $T$ is the number of fixations and $D$ is the embedding dimension.

C. Gaze Integration Module

The core innovation lies in how image tokens and gaze tokens interact. The module consists of a stack of decoder-style Transformer layers employing two proposed attention mechanisms:

Image-to-Gaze Cross-Attention (One-Way):
- Image tokens attend to gaze tokens.
- Only the image features are updated; gaze tokens remain static.
- Spatial Encoding: An additional spatial positional encoding is added to both modalities before attention to ensure spatial correlations are preserved across layers.
- No Masking: Unlike standard NLP decoders, no causal masking is used; image patches can attend to the entire gaze trajectory.
Two-Way Attention:
- Extends the one-way approach by adding a mirrored Gaze-to-Image Cross-Attention.
- Both image and gaze tokens are updated in each layer, allowing for deeper bidirectional fusion.
- This design mimics the mask decoder in the Segment Anything Model (SAM).

3. Key Contributions

Direct Sequence Integration: The first work to represent expert gaze trajectories as sequences of tokens rather than static heatmaps, preserving temporal dynamics within a Transformer architecture.
Novel Fusion Mechanisms: Introduction of specific Cross-Attention and Two-Way Attention modules designed to fuse sequential gaze data with image features without losing spatial or temporal context.
Efficient Handling of Variable Lengths: Utilization of Nested Tensors in PyTorch to handle variable-length gaze sequences efficiently, eliminating the need for padding and reducing memory consumption.
Comprehensive Evaluation: Rigorous testing on three diverse public CXR datasets with expert gaze data.

4. Experimental Results

The model was evaluated on three datasets: CXR-Gaze, SIIM-ACR, and Reflacx.

State-of-the-Art Performance:
- CXR-Gaze: FixationFormer (Cross-Attention variant) achieved 84.11% accuracy, outperforming the previous SOTA (GazeGNN) which scored 83.18%.
- SIIM-ACR: Achieved 86.40% accuracy (Two-Way variant), matching or slightly improving upon existing methods like EG-ViT.
- Reflacx: Achieved 70.06% accuracy, outperforming GazeGNN.
Ablation Studies:
- Gaze-Only: A model trained only on gaze tokens (no images) achieved accuracies significantly above random guessing (e.g., 60% on CXR-Gaze), proving that the gaze tokenization captures meaningful diagnostic semantics.
- Image-Only vs. Fusion: Adding gaze to a strong MGCA-pretrained backbone yielded modest gains (~1-2.5%). However, when using a weaker ImageNet-pretrained ViT, the addition of gaze significantly boosted performance (e.g., CXR-Gaze accuracy jumped from 64.49% to 72.90% with Cross-Attention). This suggests FixationFormer is particularly effective when the image backbone is less specialized.
Mechanism Comparison: The Cross-Attention (One-Way) variant generally produced more stable and consistent results than the Two-Way variant, which showed higher variance, particularly on the challenging Reflacx dataset.

5. Significance and Conclusion

Paradigm Shift: The paper demonstrates that treating gaze as a sequence is superior to treating it as a heatmap for Transformer-based medical analysis. It leverages the natural alignment between the sequential nature of eye movements and the attention mechanisms of Transformers.
Data Efficiency: The method proves highly effective in scenarios with limited data or weaker backbones, suggesting it can compensate for less powerful image encoders by injecting expert diagnostic reasoning.
Interpretability: Qualitative analysis (GradCAM) showed that models with gaze integration focused more consistently on anatomically relevant regions compared to image-only models.
Future Impact: This approach opens the door for more sophisticated integration of human-in-the-loop data in medical AI, moving beyond static supervision to dynamic, sequential reasoning cues.

In summary, FixationFormer successfully bridges the gap between human diagnostic reasoning (gaze) and deep learning by utilizing the Transformer architecture's native ability to process sequential data, achieving state-of-the-art results in Chest X-Ray classification.