Towards Interpretable Visual Decoding with Attention to Brain Representations

Imagine you are trying to guess what a friend is dreaming about just by looking at their brain waves. That's essentially what visual decoding is: trying to turn brain activity back into the images the person was seeing or imagining.

For a long time, scientists tried to do this with a complicated, two-step "middleman" approach. It was like trying to translate a secret message from Brain Language to English, and then translating that English into French (the final picture). The problem? Every time you translate, you lose some nuance, and you can't tell exactly which part of the brain was responsible for which part of the picture.

This paper introduces a new method called NeuroAdapter that cuts out the middleman. Here is how it works, explained with some everyday analogies:

1. The Old Way: The "Translator Chain"

Think of previous methods like a game of "Telephone" played with a translator.

Step 1: You take the brain signal and ask a super-smart AI (like CLIP or DINO) to translate it into a generic description (e.g., "a red ball").
Step 2: You give that description to an image generator to draw the ball.
The Flaw: If the translator makes a mistake, the picture is wrong. Also, if you want to know why the AI drew a red ball, you can't easily tell if it was because of the brain's "color center" or its "shape center." The connection is blurry.

2. The New Way: NeuroAdapter (The "Direct Line")

The authors built a system that connects the brain directly to the image generator, skipping the translation step entirely.

The Analogy: Imagine the image generator (a Latent Diffusion Model) is a master chef. Previously, you had to give the chef a written recipe (the intermediate translation) to cook the dish.
NeuroAdapter is like handing the chef a live video feed of the customer's brain. The chef looks at the brain activity and says, "Ah, I see the signal for 'face' and 'blue,' so I'll start cooking that."
The Result: The chef (the AI) cooks the image directly from the brain signal. The picture is just as good as before, but now the connection is clear and direct.

3. The "Brain Token" Puzzle

The brain is huge and messy. To make this work, the researchers broke the brain down into 200 distinct neighborhoods (called "parcels").

The Analogy: Imagine the brain is a giant orchestra. Instead of listening to the whole symphony at once, they assigned a specific "token" (a musical note) to each section of the orchestra (the violin section, the drum section, etc.).
They taught the image generator to listen to these specific notes. When the "face area" of the brain lights up, the generator knows to focus on drawing faces.

4. The "X-Ray Vision" (IBBI Framework)

The coolest part of this paper isn't just the picture; it's the ability to see how the picture is being made. They created a tool called IBBI (Image-Brain BI-directional framework).

The Analogy: Imagine watching a painter create a masterpiece. With old methods, you could only see the final painting. With IBBI, you have X-ray vision that shows you exactly which brushstrokes were guided by which part of the brain.
How it works: As the AI slowly turns a blurry cloud of noise into a clear image (like a time-lapse video), IBBI tracks which brain neighborhoods are "talking" to the AI at every second.
- Early in the process: The "Scene" area of the brain might be shouting, "Make it look like a forest!"
- Later in the process: The "Face" area might whisper, "Add eyes here."
Why it matters: This lets scientists see the "generative trajectory." They can prove that specific parts of the brain are actually responsible for specific parts of the image, not just guessing.

5. The "Mental Imagery" Test

To prove it works, they didn't just test it on people looking at photos. They tested it on people imagining photos in their heads (mental imagery).

The Result: The system could reconstruct what people were imagining, even though they weren't looking at anything. It's like reading someone's mind as they daydream about a beach or a cat.

Summary

In short, this paper says: "Stop translating brain signals into text or generic features. Just plug the brain directly into the image generator."

Not only does this make better pictures, but it also gives us a "control panel" that shows us exactly which parts of the brain are driving the creation of the image. It turns "mind reading" from a magic trick into a transparent, understandable process.

1. Problem Statement

Current neural decoding approaches for reconstructing visual stimuli from human brain activity (fMRI) typically rely on a two-stage pipeline:

Intermediate Mapping: Brain signals are first mapped to intermediate feature spaces (embeddings) derived from large foundation models (e.g., CLIP, DINO).
Generation: These embeddings are used to condition a generative model (e.g., Stable Diffusion) to produce images.

Limitations of Existing Approaches:

Information Bottleneck: The intermediate feature space may not perfectly align with neural representations, potentially losing critical information.
Lack of Interpretability: The two-stage process obscures the specific contributions of different brain regions (parcels) to the final image. It is difficult to trace how specific cortical areas influence the generative trajectory or which parts of the image are driven by specific neural signals.
Dependency: Performance relies heavily on the alignment between neural data and the pre-trained embedding space of the foundation model.

2. Methodology: NeuroAdapter

The authors propose NeuroAdapter, an end-to-end visual decoding framework that conditions a latent diffusion model directly on brain representations, bypassing intermediate feature spaces.

A. Model Architecture & Training

Base Model: Built upon the pre-trained Stable Diffusion (SD) model (v1.5).
Conditioning Mechanism: The authors replace the standard cross-attention layers in the SD U-Net with an IP-Adapter-style cross-attention module. This allows the model to attend directly to fMRI-derived tokens.
Neural Data Processing:
- Parcellation: fMRI data is processed using the Schaefer parcellation, clustering cortical vertices into 500 parcels per hemisphere.
- Selection: To ensure robustness, the top $k$ parcels with the highest Signal-to-Noise Ratio (SNR) are selected (e.g., $p=200$ parcels total).
- Tokenization: A parcel-wise linear mapping transforms the vertex responses of each parcel into fixed-size embedding vectors (tokens).
Training Strategy:
- Freezing: The pre-trained SD backbone and text encoder (set to empty input) are frozen. Only the linear mapping module and the new cross-attention modules are trained.
- fMRI Token Dropout: A stochastic dropout strategy is applied to fMRI tokens during training to improve robustness against missing or noisy data.
- Loss Weighting: The Min-SNR weighting strategy is used to balance the training signal, preventing the model from overfitting to easy, high-SNR denoising steps.

B. Evaluation Strategy

Image Selection: Due to the stochastic nature of diffusion models, the decoder generates multiple candidate images ( $n$ seeds) for a single fMRI sample.
Brain Encoder: A separate whole-brain encoder (trained on the same data) predicts fMRI activity for each candidate image. The candidate with the highest Pearson correlation between predicted and ground-truth fMRI is selected as the final output.

3. Key Contributions: The IBBI Framework

The paper introduces the Image–Brain BI-directional (IBBI) interpretability framework to analyze the generative process. This leverages the cross-attention weights between the fMRI tokens (keys/values) and the spatial image tokens (queries) at every denoising step.

Brain-Directed View (Parcel Contribution):
- Aggregates attention weights to calculate a parcel contribution vector ( $B^{(t)}$ ) at each timestep.
- Visualizes which brain regions (ROIs) are most influential at different stages of the denoising process (e.g., early steps vs. late steps).
Image-Directed View (ROI Attention Maps):
- Maps the attention weights back to the spatial grid of the image to create ROI Attention Maps.
- Reveals where in the generated image specific brain regions (e.g., Face area, Scene area) are directing their attention.
- Validates these maps against semantic segmentation masks (using SAM3), showing that specific ROIs attend to semantically relevant image regions (e.g., Face ROIs attend to faces).

4. Results

The model was evaluated on three datasets: NSD (Natural Scene Dataset), NSD-Imagery (mental imagery), and Deeprecon.

Decoding Performance:
- Competitive Quality: NeuroAdapter achieves performance comparable to, and in some high-level semantic metrics surpassing, state-of-the-art two-stage methods (e.g., MindEye1, Brain Diffuser, Takagi & Nishimoto).
- Semantic vs. Low-Level: The model excels at capturing high-level semantic content (CLIP, Inception metrics) without relying on external embedding spaces. While low-level metrics (PixCorr, SSIM) are slightly lower than methods using dedicated low-level feature predictors (like VDVAE in Brain Diffuser), the authors argue this trade-off is acceptable for the gain in interpretability and direct brain-to-image mapping.
- Generalization: The model successfully generalizes to mental imagery tasks (NSD-Imagery) and disjoint class datasets (Deeprecon), inferring shape, orientation, and color even for artificial stimuli.
Interpretability Findings:
- Temporal Dynamics: Attention maps show that brain regions influence the image generation dynamically. Early denoising steps show broad attention, while later steps converge on specific semantic features relevant to the ROI.
- Causal Perturbation: Masking high-level ROIs (e.g., Face, Scene) drastically alters the semantic content of the generated image, whereas masking low-level ROIs (e.g., V1-V4) primarily affects fine details but preserves the overall semantic structure.

5. Significance and Impact

End-to-End Transparency: NeuroAdapter establishes a path for interpretable neural decoding by removing the "black box" of intermediate feature spaces. It provides a direct, anatomically grounded link between specific brain parcels and image features.
Mechanistic Insight: The IBBI framework offers a new way to study the generative dynamics of the brain, revealing how different cortical areas contribute to the unfolding of visual reconstruction over time.
Future Directions: The authors suggest that future benchmarks should move beyond simple image quality metrics (which may saturate) toward interpretability metrics that measure the fidelity of the brain-image mapping. This work paves the way for understanding the "neural-generative interface" and could be applied to decoding subjective perceptions like dreams or mental imagery.

In summary, NeuroAdapter demonstrates that high-quality visual reconstruction is possible without intermediate embedding spaces, and more importantly, it provides a rigorous framework to visualize and understand how the brain constructs visual representations.