SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation

Imagine you are standing in front of a famous painting in a museum. You don't just stare at it like a statue; your eyes dance around. You might look at the face of the subject first, then jump to the bright red dress, then drift to the dark background, and finally land on a tiny detail in the corner. This journey your eyes take is called a scanpath.

The paper you shared, SPGen, is about teaching a computer to predict exactly how a human's eyes will dance across a painting. But here's the tricky part: computers are usually trained on photos of real life (like cats, cars, and trees), but paintings are different. They have different colors, styles, and rules.

Here is a simple breakdown of how the authors solved this puzzle, using some everyday analogies.

1. The Problem: The "Real World" vs. The "Art World"

Think of a computer model as a tour guide who has spent their whole life giving tours of a bustling city (natural photos). They know exactly where people look: at traffic lights, storefronts, and faces.

Now, you ask this same tour guide to lead a group through a fantasy art gallery (paintings). The guide gets confused! In the city, people look at the center of the street. In a painting, the "center" might be a quiet corner, or the most important part might be in the top left. The guide keeps trying to apply city rules to the art gallery, and the tour goes wrong.

The researchers needed a way to teach their "city guide" how to navigate the "art gallery" without having to hire a new guide who only knows art.

2. The Solution: SPGen (The Smart Eye-Tracker)

The authors built a new AI model called SPGen. Think of it as a super-smart robot eye that learns to mimic human curiosity. Here are its three main superpowers:

A. The "Bias Map" (The Internal Compass)

Humans have a natural habit of looking at the center of an image first (like looking at the middle of a menu before reading the sides). The model learns this habit using something called Gaussian Priors.

Analogy: Imagine the model has a magnetic compass that naturally pulls its attention toward the center of the room. But, unlike a real compass that always points North, this one is learnable. It can adjust its magnetism to fit the specific style of the painting it's looking at.

B. The "Randomness Switch" (The Temperature Control)

This is the most unique part. If you ask two different people to look at the same painting, they will look at different things. If you ask the same person to look at it twice, they might still look at different spots. Human attention is stochastic (random).

Analogy: Most AI models are like a robot that always takes the exact same route every time. SPGen has a "Temperature Knob."
- Low Temperature: The robot is very focused and predictable (like a strict tour guide).
- High Temperature: The robot gets a little "tipsy" or playful. It adds a bit of random noise, allowing it to generate different eye paths for the same painting. This mimics how real humans have different moods and attention spans.

C. The "Translator" (Unsupervised Domain Adaptation)

This is how they solved the "City vs. Art Gallery" problem. They didn't have enough data on how humans look at paintings to train the model from scratch. So, they used a trick called Unsupervised Domain Adaptation.

Analogy: Imagine the model is a student who studied hard for a Math test (Natural Photos) but is now taking an Art History test (Paintings).
- The researchers added a "Domain Classifier" (a strict teacher) that tries to guess: "Is this a Math problem or an Art problem?"
- They added a Gradient Reversal Layer. This is like a "reverse psychology" trick. When the teacher tries to tell the student "This is Art!", the student's brain flips the signal and says, "No, I will ignore the Art clues and focus only on the Math clues that are the same for both!"
- Result: The model learns the universal rules of what catches the eye (like faces or bright colors) that apply to both cities and paintings, ignoring the specific "noise" that makes them different.

3. How They Tested It

They tested their robot guide on two types of maps:

Natural Scenes (The City): They used the Salicon and MIT1003 datasets (photos of real life). The model did incredibly well, beating other top models.
Paintings (The Art Gallery): They used datasets of famous paintings (like the Le Meur and AVAtt datasets).
- Before the "Translator" trick: The model looked at paintings like it was looking at photos of cats. It got lost.
- After the "Translator" trick: The model suddenly understood the art. It started looking at the important parts of the paintings, just like a human would.

4. Why Does This Matter?

Why do we care if a computer knows where our eyes go?

Preserving Culture: It helps us understand how people interact with art. We can analyze which parts of a masterpiece are most engaging to viewers.
Virtual Museums: Imagine a VR museum where the exhibit changes based on where you are looking. This technology could power those experiences.
Restoration: It can help restorers understand what details are most important to the human eye, ensuring they don't accidentally paint over the "soul" of the artwork.

The Bottom Line

SPGen is a clever AI that learns to "see" like a human. It uses a special trick to translate its knowledge from everyday photos to complex paintings, and it includes a "randomness" feature to mimic the unpredictable nature of human curiosity. It's a big step forward in helping computers understand not just what we see, but how we look.

1. Problem Statement

The paper addresses the challenge of predicting scanpaths (the sequence of eye movements, including fixations and saccades) when humans view images. While significant progress has been made in predicting visual attention for natural scenes (photographs), there is a substantial performance gap when applying these models to paintings (artworks).

Key challenges identified include:

Domain Gap: Natural scene datasets (e.g., Salicon, MIT1003) differ significantly in distribution from painting datasets due to differences in composition, style, and semantic content.
Stochasticity and Subjectivity: Human gaze is inherently stochastic; different observers (or the same observer at different times) produce different scanpaths for the same image. Deterministic deep learning models often fail to capture this variability, producing a single "average" path rather than diverse, realistic trajectories.
Data Scarcity: High-quality eye-tracking data for paintings is limited compared to natural scenes, making supervised learning difficult.

2. Methodology: SPGen Architecture

The authors propose SPGen, a deep learning framework designed to generate variable-length, stochastic scanpaths. The architecture consists of the following key components:

A. Feature Extraction & Prior Bias

Backbone: Uses MobileNet as a lightweight feature extractor to capture relevant visual patterns efficiently.
Learnable Prior Bias Maps: Instead of hard-coding a center bias (a common phenomenon where eyes focus on the image center), the model employs learnable Gaussian priors. These are concatenated with feature maps to allow the network to learn domain-specific attention biases dynamically.

B. Scanpath Decoding

Merging Module: Combines extracted features with the learned bias maps through a series of convolutional layers.
Soft-ArgMax (SAM): Converts feature maps into continuous coordinates for fixation points.
Fixation Selector: A critical module that generates a binary mask to select which predicted points constitute the final scanpath. This allows the model to generate variable-length scanpaths.
- Technical Note: To handle the non-differentiability of the binarization step during backpropagation, the authors multiply the selection vector by the feature maps, effectively bypassing the gradient block.

C. Stochastic Generation

To model the inherent randomness of human gaze, the authors introduce a random noise sampler ( $L$ ) in the latent space, modulated by a temperature parameter ( $T$ ).

The generation process is modeled as: $y = \text{decoder}(\text{encoder}(x) + L \times T)$ .
Temperature ( $T$ ): Controls the intensity of the noise. Low $T$ yields deterministic, focused paths; high $T$ yields more diverse, scattered paths.

D. Unsupervised Domain Adaptation (UDA)

To bridge the gap between natural scenes (source domain) and paintings (target domain) without labeled painting data, the authors employ Adversarial Domain Adaptation:

Gradient Reversal Layer (GRL): A domain classifier branch is added to distinguish between natural scenes and paintings. The GRL reverses the gradient sign during backpropagation ( $\nabla \to -\nabla$ ).
Objective: This forces the feature extractor to learn a unified representation space that is indistinguishable between domains, effectively removing domain-specific noise while preserving task-relevant features.

3. Key Contributions

Novel Deep Learning Model: Introduction of SPGen, an efficient architecture for vector-to-sequence scanpath prediction using Fully Convolutional Neural Networks (FCNN).
Stochastic Mechanism: Incorporation of a temperature-modulated noise sampler to generate multiple diverse scanpaths for a single stimulus, reflecting human subjectivity.
Variable-Length Prediction: A selective module (Fixation Selector) that enables the generation of scanpaths with varying numbers of fixations, rather than fixed-length sequences.
Unsupervised Domain Adaptation: Successful application of adversarial training (GRL) to transfer knowledge from natural scenes to paintings without requiring labeled eye-tracking data for the target domain.
Comprehensive Evaluation: Extensive testing on both natural scene (Salicon, MIT1003) and painting (Le Meur, AVAtt) datasets.

4. Experimental Results

Datasets Used

Natural Scenes: Salicon (training/validation) and MIT1003 (cross-dataset evaluation).
Paintings: Le Meur (saliency maps only) and AVAtt (scanpaths available).

Quantitative Performance

Natural Scenes (Salicon/MIT1003): SPGen outperformed state-of-the-art models (e.g., PathGAN, SALYPATH, Le Meur) in MultiMatch (MM Score) and NSS (Normalized Scanpath Saliency) metrics. It achieved the highest NSS score on Salicon (1.0140), indicating superior alignment with salient regions.
Paintings (Domain Adaptation):
- On the Le Meur dataset, applying domain adaptation improved the NSS score from 1.3620 to 1.5093 and Congruency from 0.4024 to 0.4244.
- On the AVAtt dataset, domain adaptation significantly boosted the MM Score (from 0.8245 to 0.8312) and NSS (0.6476 to 0.6534).
Stochasticity Analysis: Lower temperature values (less noise) yielded higher accuracy (NSS/Congruency), while higher temperatures increased diversity but slightly reduced spatial precision.

Qualitative Observations

Visualizations showed that without domain adaptation, scanpaths on paintings were short and concentrated in limited regions.
With domain adaptation, scanpaths covered wider areas and aligned better with the underlying saliency maps of the artworks.
The model successfully generated plausible scanpaths across diverse art styles (Renaissance, Baroque, Chinese ink).

5. Significance and Conclusion

The paper presents a significant advancement in cultural heritage preservation by enabling the analysis of how humans interact with artistic treasures.

Generalization: The unsupervised domain adaptation approach proves that models trained on natural scenes can be effectively adapted to the complex domain of art without expensive re-labeling.
Realism: By introducing stochasticity, the model moves beyond deterministic predictions, offering a more realistic simulation of human visual exploration.
Future Applications: The authors suggest this framework can be extended to analyze monuments, architecture, and 3D virtual museum visits, providing deeper insights into the cognitive processing of cultural artifacts.

Limitations: The current model does not predict the duration of fixations (temporal aspect), focusing solely on spatial trajectories. Future work aims to integrate time prediction modules.