Neural network-based encoding in free-viewing fMRI with… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Watching a Movie vs. Staring at a Dot

Imagine you are sitting in a movie theater. In a real life scenario, you are free to look around. You might look at the hero's face, then glance at the background scenery, then look at a car speeding by. Your eyes are constantly dancing, picking up the most interesting parts of the story.

However, for decades, scientists studying the brain with MRI machines have forced people to do the opposite. They tell participants: "Don't move your eyes. Stare at a tiny dot in the center of the screen for two hours."

This is like watching a movie while wearing a blindfold that only has a tiny hole in the center. You can see the dot, but you miss everything else. While this makes the data easier to analyze, it doesn't reflect how our brains actually work in the real world. It's also mentally exhausting to stare at a dot while a chaotic movie plays around you.

The Problem: The "Blind" Computer Model

Scientists use powerful computer programs (called Convolutional Neural Networks or CNNs) to guess what a person is thinking just by looking at their brain scan.

The Old Way: The computer looks at the entire movie frame, pixel by pixel, from top to bottom, left to right. It tries to guess which part of the image made a specific brain cell light up.
The Flaw: This is like trying to find a needle in a haystack by looking at the entire hayfield at once. The computer has to remember millions of details (parameters) to make a guess. It's computationally expensive, requires huge amounts of data, and ignores the fact that you were only looking at a tiny corner of the screen.

The Solution: The "Gaze-Aware" Spotlight

The authors of this paper proposed a smarter way. They asked: "What if we only feed the computer the parts of the movie that the person actually looked at?"

They used eye-tracking technology to see exactly where the participants' eyes landed. Then, they built a new model called a "Gaze-Aware Encoding Model."

Here is the analogy:

The Old Model is like a security guard trying to describe a room by memorizing every single brick, dust mote, and shadow in the entire building, even the parts no one looked at.
The New Model is like a spotlight. It only shines on the specific spot where the person's eyes are looking. It ignores the rest of the room.

How They Did It (The Recipe)

The Dataset: They used a public dataset called "StudyForrest," where people watched the movie Forrest Gump (in German) without being told to stare at a dot. Their eye movements were recorded the whole time.
The Computer Brain: They used a pre-trained AI (VGG-19) that is really good at recognizing images.
The Trick: Instead of feeding the AI the whole image, they used the eye-tracking data to "crop" the image. If a person looked at a tree on the left, the AI only analyzed the tree. If they looked at a car on the right, the AI only analyzed the car.
The Result: They created a "feature time series" that only contained the visual information relevant to where the eyes were looking at that exact moment.

The Surprising Results

The researchers compared their "Spotlight" model against the "Whole Room" model. Here is what they found:

Same Accuracy, Less Work: The new model predicted brain activity just as well as the old, heavy model.
Massive Efficiency: The new model used 112 times fewer parameters.
- Analogy: Imagine the old model was a 10,000-page encyclopedia trying to describe a single sentence. The new model was a 90-page book that described the same sentence perfectly.
Memory Savings: Because the model was so much smaller, it could run on a standard laptop. The old model required a massive supercomputer to handle the memory.
Dynamic Viewers Win: The new model worked especially well for people who moved their eyes a lot. The more active the viewer, the better the model performed. This proves that tracking eye movements is crucial for understanding how we process dynamic scenes.

Why This Matters

This paper is a game-changer for two main reasons:

Realism: It allows scientists to study the brain in a way that feels like real life. We don't stare at dots in the real world; we explore. This model respects that natural behavior.
Accessibility: Because the model is so efficient, smaller labs with less money and less computing power can now run these complex brain studies. You don't need a supercomputer to understand how the brain sees the world anymore.

The Bottom Line

The authors showed that by simply paying attention to where people look, we can build brain models that are smarter, faster, and cheaper, while still being incredibly accurate. It's a shift from forcing the brain to behave unnaturally, to letting the brain be human and building models that understand that humanity.

1. Problem Statement

Traditional brain encoding models, which map visual stimuli to fMRI voxel activity using Convolutional Neural Networks (CNNs), typically rely on central fixation protocols. Participants are instructed to stare at a fixed point while viewing stimuli. This approach introduces several critical limitations:

Low Ecological Validity: Natural vision is active and dynamic, involving frequent saccades and gaze shifts. Fixation protocols suppress activity in visually dynamic brain regions and impose high cognitive load.
Computational Inefficiency: Standard encoding models often pool features from entire CNN layers (all spatial locations) to predict voxel activity. This inflates the parameter space, requiring massive datasets to achieve reliable fits and creating ambiguity in feature selection.
Data Scarcity: The high parameter count necessitates large amounts of training data, which is often unavailable in naturalistic settings.

The authors propose a solution to model brain activity under free-viewing conditions (without fixation) by integrating eye-tracking data directly into the encoding framework, thereby reducing model complexity and increasing ecological validity.

2. Methodology

Dataset

The study utilized the StudyForrest dataset, which includes:

fMRI Data: ~2 hours of task-free movie viewing (German-dubbed Forrest Gump) across 8 runs for 15 subjects.
Eye-Tracking: High-quality monocular eye-tracking data recorded simultaneously.
Preprocessing: Data was motion-corrected, denoised (ICA), normalized to MNI space, and z-scored. Analysis was restricted to visual stream regions (ventral/dorsal streams, temporal lobe).

Gaze-Aware Feature Extraction Pipeline

CNN Feature Extraction: A pre-trained VGG-19 network (ImageNet) was used to extract features from movie frames. Features were taken from 5 max-pooling layers.
Hyperlayer Construction: To unify spatial dimensions across layers, feature maps from all 5 layers were spatially rescaled to a common size ( $7 \times 16$ ) and concatenated. This created a "hyperlayer" feature map with a total of 1,472 channels ( $\bar{C}$ ).
Gaze-Driven Sampling:
- Instead of using the full $7 \times 16$ feature map for every timepoint, the model sampled only the specific pixel location corresponding to the participant's gaze coordinates ( $x_{gaze}, y_{gaze}$ ) at each fixation.
- This generated a subject-specific feature time series $X^{gaze} \in \mathbb{R}^{N \times \bar{C}}$ , where $N$ is the number of timepoints.
Hemodynamic Adjustment: Features were averaged within fMRI TRs, shifted by 4.5 seconds to account for the Hemodynamic Response Function (HRF), and aligned with voxel timecourses.

Model Training

Linear Encoding: A linear model $Y = X^{gaze}W$ was trained to predict voxel activity ( $Y$ ) from the gaze-sampled features ( $X^{gaze}$ ).
Regularization: Ridge regression with Pearson correlation as the loss function was used. Hyperparameters were optimized via grid search.
Baseline Comparisons:
- Full Baseline: Used the entire hyperlayer feature map ( $7 \times 16 \times 1472 = 164,864$ features) without gaze sampling.
- Center-Fixation Baseline: Sampled only the center of the feature map (matching the gaze-aware model's dimensionality but ignoring actual gaze).
- PCA Baseline: Used the first 1,472 principal components of the full feature space.

3. Key Contributions

Gaze-Aware Encoding Framework: A novel method that integrates eye-tracking data to dynamically select relevant CNN features based on where the subject is actually looking, rather than assuming a fixed receptive field or pooling all spatial data.
Drastic Parameter Reduction: By sampling only the gaze-relevant features, the model reduces the number of parameters by a factor of 112× compared to the full-feature baseline.
- Baseline Parameters: ~3.2 billion.
- Gaze-Aware Parameters: ~28.9 million.
Memory Efficiency: The reduction in feature dimensionality lowers the peak working memory required for training from 15.6 GB (baseline) to 419 MB (gaze-aware), making these models trainable on standard laptops rather than requiring High-Performance Computing (HPC) resources.
Ecological Validity: Demonstrates that robust brain encoding is possible in free-viewing scenarios without restrictive fixation, aligning better with natural human behavior.

4. Results

Performance Parity: Gaze-aware models achieved statistically significant encoding performance comparable to the full-feature baseline models.
- Voxel Coverage: Gaze-aware models significantly predicted 53% of voxels (FDR corrected), while the baseline predicted 57%.
- Regional Performance: Both models successfully encoded activity from early visual areas (V1, V2) to higher-order areas (LO, FG, STS). No statistically significant difference in performance was found between the two models across predefined Regions of Interest (ROIs).
Dynamic Viewing Advantage:
- Gaze-aware models showed a strong positive correlation ( $r = 0.81$ ) between the number of fixations a subject made and model performance.
- Subjects with more dynamic eye-movement patterns benefited significantly more from the gaze-aware approach.
- Conversely, the baseline model performed better for subjects with less dynamic viewing or where the model's learned spatial weights happened to align with the subject's gaze distribution.
Baseline Limitations: The "Center-Fixation" baseline predicted only 32% of voxels, and the PCA baseline predicted only 3%, highlighting that the performance gain comes from personalized spatial selection (gaze), not just dimensionality reduction.
pRF Integration: Attempting to incorporate Population Receptive Field (pRF) estimates to adjust gaze coordinates did not improve performance, likely due to the downsampling of early visual features and the dynamic nature of natural vision.

5. Significance and Future Directions

Scalability: The 112× reduction in parameters and the ability to run on standard hardware make neural encoding models accessible to more laboratories and feasible for larger-scale studies.
Naturalistic Neuroscience: This approach validates the use of fMRI in interactive, real-world scenarios (e.g., gaming, VR) where fixation is impossible or undesirable. It shifts the paradigm from "controlling behavior" to "modeling behavior."
Data Efficiency: The method suggests that high-quality encoding models can be built with significantly less data (2 hours in this study) by leveraging behavioral priors (gaze).
Future Work: The authors suggest extending the model to sample a spatial kernel around the fixation point (to capture parafoveal processing), integrating recurrent networks for temporal dynamics, and using ecologically valid CNNs trained on naturalistic datasets rather than ImageNet.

In conclusion, the paper demonstrates that gaze-aware encoding models offer a computationally efficient, ecologically valid, and equally performant alternative to traditional full-feature encoding models, particularly for subjects exhibiting dynamic visual exploration.

Neural network-based encoding in free-viewing fMRI with gaze-aware models