Image-to-Brain Signal Generation for Visual Prosthesis with CLIP Guided Multimodal Diffusion Models

Imagine you have a friend who is blind. You want to help them "see" the world again, not with their eyes, but by directly stimulating their brain. This is the dream of visual prosthetics (like a high-tech artificial eye).

For a long time, scientists have been great at the second half of this process: taking brain signals and turning them back into pictures (like reading a mind). But the first half—taking a picture from the real world and figuring out exactly what electrical signal to send to the brain to make it "see" that picture—has been a huge mystery. It's like knowing how to decode a secret message, but having no idea how to write one.

This paper introduces a new, clever way to solve that writing problem. Here is how it works, explained simply:

1. The Problem: The "Translator" Gap

Think of the brain as a very complex, foreign language.

The Old Way: Previous attempts to write this language were like guessing. Scientists would look at a picture of a dog and guess, "Maybe the brain thinks 'dog' like this?" But they didn't have a real dictionary to check if their guess was right. The results were often blurry and didn't look like real brain activity.
The New Goal: We need a perfect translator that can look at a photo and instantly write the exact "brain code" needed to make a blind person see that photo.

2. The Solution: A "Brain-Writer" AI

The authors built a new AI system that acts as this translator. They used a special type of AI called a Diffusion Model (think of it as an artist who starts with a blank, static-filled canvas and slowly paints a clear picture by removing the noise).

Here are the three "superpowers" they gave this AI to make it work:

A. The "Double-Check" Dictionary (CLIP + LLM)

Usually, an AI looks at a picture and just sees pixels (colors and shapes). But the brain understands meaning.

The Trick: The AI doesn't just look at the picture. It also asks a smart language bot (a Large Language Model) to describe the picture in words.
The Analogy: Imagine you are trying to describe a sunset to someone who has never seen one. If you just say "orange and red," they might imagine a fire. But if you say, "It's a peaceful, warm sunset over the ocean with seagulls flying," they get the feeling of it.
How it helps: The AI combines the visual data (the photo) with the text description (the story). This helps the AI understand the core meaning of the image, not just the colors, so it can write a much more accurate brain signal.

B. The "Brain Map" GPS (Spatio-Temporal Encoding)

Brain signals are tricky because they happen in two ways at once:

Where: Different parts of the brain light up for different things (the back of the brain sees shapes, the side sees motion).
When: These signals happen in a specific order, like a drumbeat.

The Trick: The AI has a built-in GPS system. It knows exactly which "neighborhood" of the brain (spatial) and which "moment in time" (temporal) each signal belongs to.
The Analogy: Imagine a massive concert hall. If you just hear noise, it's chaos. But if you know who is playing (the violin section in the back) and when they are playing (the second beat of the song), you can reconstruct the music perfectly. The AI does this for the brain.

C. The "Cross-Attention" Bridge

This is the mechanism that connects the picture to the brain signal.

The Analogy: Imagine the brain signal is a student trying to take a test. The "Key" and "Value" (the information) are the textbook. The "Query" is the student's question.
How it works: The AI asks the brain signal, "What do you need to see this picture?" and then looks at the combined picture+text description to find the answer. It aligns the two perfectly, ensuring the generated signal matches the visual input.

3. The Results: Does it Work?

The team tested this on two huge datasets containing thousands of pictures and the actual brain recordings of people looking at them.

The Score: Their new AI was significantly better than all previous methods. It generated brain signals that were much closer to the real thing (biologically plausible).
The "Aha!" Moment: They found that if they removed the "text description" part or the "brain map" part, the AI got worse. This proved that understanding the story of the image and knowing the location in the brain are both essential.

Why Does This Matter?

This isn't just a cool science experiment. It's a missing piece of the puzzle for visual prosthetics.

If we can reliably turn a camera image into a perfect brain signal, we can build implants that don't just flash random lights in a blind person's mind. Instead, they could potentially restore a recognizable, meaningful view of the world. It's like going from a broken radio that only plays static to a high-fidelity stereo that plays the symphony of life.

In short: They taught an AI to speak "Brain" by teaching it to look at pictures, read the story behind them, and know exactly where and when to speak in the brain's complex language.

1. Problem Statement

Visual prostheses aim to restore vision for individuals with severe visual impairments (e.g., retinitis pigmentosa) by converting external images into electrical stimuli for the retina. The process involves two stages:

Brain Decoding: Converting brain signals (M/EEG/fMRI) back into images (recently advanced by diffusion models).
Brain Encoding: Converting external images into predicted brain signals (stimuli) to drive the prosthesis.

The Gap: While brain decoding has seen significant progress, brain encoding remains underdeveloped. Previous attempts often rely on synthetic datasets (like MNIST) or use image data as supervision without validating against real biological responses. This results in predicted stimuli that lack biological plausibility, limiting the efficacy of vision restoration. There is a critical need for a framework that can generate biologically plausible M/EEG signals directly from natural images, using real brain responses as ground truth.

2. Methodology

The authors propose a novel Image-to-Brain framework based on a Diffusion Transformer (DiT) architecture enhanced with cross-attention mechanisms and multimodal alignment.

A. Core Architecture: Diffusion Transformer (DiT)

Base Model: The framework utilizes a DiT architecture based on Denoising Diffusion Implicit Models (DDIM). Unlike traditional U-Net based diffusion models, DiT leverages the scalability and performance of Transformer architectures.
Process: The model learns to predict the noise added to a clean brain signal ( $y_0$ ) at a specific timestep $t$ , conditioned on an input image. The DDIM formulation allows for faster, deterministic sampling compared to standard DDPM.

B. Cross-Modal Alignment via Cross-Attention

To bridge the gap between visual input and brain signal output, the model employs a cross-attention mechanism:

Query ( $Q$ ): Derived from the brain signal patch embeddings (the noisy signal being generated).
Key ( $K$ ) and Value ( $V$ ): Derived from Unified Visual-Semantic Embeddings.
Mechanism: This allows the brain signal generation process to dynamically attend to relevant visual and semantic features from the input image.

C. Unified Visual-Semantic Embeddings

To capture both low-level visual features and high-level semantic context, the authors construct a unified embedding vector ( $c_{unified}$ ):

Visual Component: A CLIP Visual Encoder (ViT-L/14) extracts image embeddings ( $h_{img}$ ).
Semantic Component: A Large Language Model (Qwen2-VL-2B-Instruct) generates descriptive captions for the input image. These captions are encoded by the CLIP Text Encoder to produce text embeddings ( $h_{txt}$ ).
Concatenation: $h_{img}$ and $h_{txt}$ are concatenated along the token dimension to form $c_{unified}$ . This ensures the model understands not just what the image looks like, but what it represents semantically.

D. Learnable Spatio-Temporal Position Encoding

Brain signals possess distinct spatial (brain regions) and temporal (time steps) structures. The authors introduce a specialized position encoding scheme:

Brain Region Embedding: Encodes which anatomical region (e.g., Occipital, Frontal, Temporal) a specific channel belongs to.
Temporal Embedding: Encodes the time step of the signal.
Combination: These are summed ( $e_{pos} = e_{region} + e_{temporal}$ ) and added to the patch embeddings before entering the DiT blocks, enabling the model to learn the inherent spatio-temporal dynamics of M/EEG data.

3. Key Contributions

Novel Framework: First to propose a DiT-based framework specifically for Image-to-M/EEG signal generation, addressing the "brain encoding" bottleneck in visual prosthetics.
Multimodal Alignment: Introduction of a cross-attention mechanism that aligns brain signals with unified embeddings (CLIP Image + LLM-generated CLIP Text), significantly improving semantic capture.
Spatio-Temporal Modeling: Development of a learnable position encoding scheme that explicitly models the spatial distribution of brain regions and temporal evolution of signals, crucial for M/EEG data.
Comprehensive Evaluation: Extensive validation on two large-scale multimodal datasets (THINGS-EEG2 and THINGS-MEG) with rigorous ablation studies.

4. Experimental Results

The framework was evaluated on THINGS-EEG2 (10 subjects, 63 channels) and THINGS-MEG (4 participants, 271 channels) using four metrics: MSE, Pearson Correlation, Cosine Similarity, and Synchronization Likelihood (SL).

Performance: The proposed method outperformed all baselines, including traditional encoding models (Güçlü et al., Yamins et al.) and the recent generative approach SynBrain.
- THINGS-EEG2 (Within-subject): MSE reduced to 0.109 (vs. 0.156 for SynBrain); Pearson Correlation increased to 0.425 (vs. 0.366).
- THINGS-MEG (Within-subject): MSE reduced to 0.663; Pearson Correlation increased to 0.379.
Ablation Studies:
- Removing LLM-generated text embeddings degraded performance, confirming the value of semantic context.
- Removing Brain Region or Temporal embeddings significantly hurt performance, validating the necessity of spatio-temporal modeling.
- Occipital Ablation: Removing the occipital region channels caused the largest performance drop, aligning with neuroscience knowledge that the occipital cortex is the primary visual processing area.
LLM Selection: Qwen2-VL-2B-Instruct (2B parameters) outperformed larger models (VisualGLM-6B, MiniGPT-4-7B) in generating captions that correlated better with brain signals, offering a better efficiency/accuracy trade-off.
Cross-Subject Generalization: While performance dropped in cross-subject settings (training on one subject, testing on another) due to inter-individual variability, the model still demonstrated the ability to capture generalizable patterns.

5. Significance and Impact

Advancing Visual Prosthetics: This work provides a critical missing link for visual prostheses: a method to generate biologically plausible stimuli from natural images. This moves the field beyond rudimentary patterns toward restoring complex visual perceptions.
Neuroscientific Insight: The framework serves as a computational tool to probe neural mechanisms. By successfully mapping images to brain signals, it validates hypotheses about hierarchical visual processing (e.g., the critical role of the occipital lobe).
Methodological Innovation: The integration of LLMs for semantic enhancement in brain signal generation and the specific adaptation of DiT for spatio-temporal neuro-data sets a new standard for multimodal brain-computer interface (BCI) research.

In conclusion, this paper presents a state-of-the-art solution for the brain encoding stage of visual prostheses, leveraging the synergy of Diffusion Transformers, CLIP, and Large Language Models to generate high-fidelity, biologically plausible brain signals from visual inputs.