Do Models See in Line with Human Vision? Probing the Correspondence Between LVLM Representations and EEG Signals

Imagine you have a robot brain (a Large Vision Language Model, or LVLM) and a human brain. You show both of them a picture of a cat. The robot "sees" it as a grid of numbers and patterns. The human sees it as a fluffy animal, feels a sense of recognition, and their brain lights up with electrical sparks.

The big question this paper asks is: Do these two brains "think" about the picture in the same way?

Until now, scientists mostly checked this by looking at slow, blurry brain scans (like fMRI), which are like watching a movie in slow motion. This new paper uses EEG, which is like putting a high-speed camera on the brain. It captures the brain's electrical thoughts in milliseconds, giving us a super-fast, real-time look at how humans process images.

Here is the breakdown of what the researchers found, using some everyday analogies:

1. The "Middle Child" Discovery

The researchers looked at the "layers" inside the AI models. Think of an AI model like a multi-story office building:

The Ground Floor (Early Layers): Just sees basic shapes, lines, and colors.
The Penthouse (Deep Layers): Understands complex concepts and abstract ideas.
The Middle Floors (Layers 8–16): This is where the magic happens.

The Finding: The AI's "middle floors" matched the human brain's activity perfectly when the human was looking at the image between 100 and 300 milliseconds after seeing it.

Analogy: It's like a relay race. The human brain starts with a quick glance (seeing the shape), then passes the baton to a deeper understanding (recognizing the object). The AI does the exact same thing, but it does it in its "middle office." The ground floor and the penthouse didn't match the human brain as well as the middle floors did.

2. Design Matters More Than Size

A common belief in AI is: "If you make the model bigger (more parameters), it will be smarter and more human-like."
The Finding: The researchers tested 32 different models, from tiny ones to massive ones. They found that making the model bigger didn't help much. Instead, how the model was built mattered way more.

Analogy: Imagine building a car. You can make a car with a massive engine (huge size), but if it's built like a boat, it won't drive well on the road. The models that were designed specifically to handle both images and language (multimodal) drove much closer to human thinking than the ones that only looked at images.
The Stat: The design of the model contributed 3.4 times more to matching the human brain than just making the model bigger.

3. The "Brain Map" Match

When humans look at a picture, the electrical signals travel in a specific path: first to the back of the brain (the visual center), then to the side (for understanding what it is).
The Finding: The AI's internal signals followed this exact same path and timing.

Analogy: It's like a tour guide leading a group through a city. The human brain visits the "Visual District" first, then the "Meaning District." The AI's internal data traveled through its own "Visual District" and "Meaning District" at the exact same speed and order. This proves the AI isn't just guessing; it's simulating the actual biological process of seeing.

4. Better at Tasks = Closer to Humans

The researchers checked if the AI models that were better at real-world tasks (like describing an image or answering questions about it) were also better at matching the human brain.
The Finding: Yes! The models that got higher scores on standard tests were the ones whose "brain waves" looked most like human brain waves.

Analogy: Think of it like a student. The student who gets the best grades on the math test is also the one whose thought process most closely matches the teacher's method of solving the problem. If an AI is good at the job, it's thinking more like a human.

Why Does This Matter?

This paper is a huge step forward because it gives us a new ruler to measure AI.

Before: We measured AI by asking, "Can it pass a test?"
Now: We can measure AI by asking, "Does it think like a human?"

This helps scientists build better, more "human-aligned" AI. It also suggests that by studying how our brains work, we can teach robots to see and understand the world more naturally, rather than just crunching numbers.

In short: The paper proves that modern AI models are starting to "see" the world in a way that is surprisingly similar to how our own brains do, especially when they are built with the right architecture and trained to understand both pictures and words.

Here is a detailed technical summary of the paper "Do Models See in Line with Human Vision? Probing the Correspondence Between LVLM Representations and EEG Signals."

1. Problem Statement

While Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in visual understanding and reasoning, it remains unclear whether their internal representations truly reflect human visual cognition. Previous research has primarily relied on Functional Magnetic Resonance Imaging (fMRI) to study the alignment between deep learning models and the brain. However, fMRI suffers from low temporal resolution, failing to capture the millisecond-scale dynamics of human visual processing.

The paper addresses the following gaps:

Temporal Dynamics: Does the hierarchical processing in LVLMs align with the time-evolving nature of human visual perception?
Factor Analysis: What specific factors (model architecture, parameter scale, or image type) drive the alignment between LVLMs and human brain signals?
Methodological Shift: Can Electroencephalogram (EEG) signals, which offer high temporal precision, be used to establish a biologically grounded benchmark for LVLMs?

2. Methodology

The authors propose a framework to quantify the alignment between LVLM representations and image-evoked EEG signals using Ridge Regression and Representational Similarity Analysis (RSA).

Dataset: The study utilizes the THINGS-EEG dataset, a large-scale within-subject dataset containing EEG recordings from 10 subjects viewing 1,854 object concepts under a Rapid Serial Visual Presentation (RSVP) paradigm.
Models: The evaluation covers 32 open-source LVLMs from 9 different families (e.g., Qwen2.5-VL, Qwen3-VL, LLaVA, InternVL, DeepSeek-VL, SAIL-VL), spanning various architectures (ViT-based, hybrid) and scales (from 1B to 72B parameters).
Preprocessing:
- EEG: Band-pass filtered (0.1–100 Hz), segmented (0–1000 ms post-stimulus), baseline-corrected, and normalized. Principal Component Analysis (PCA) reduces dimensionality to 256 components.
- LVLM Features: Visual embeddings are extracted from the vision encoder. Both final-layer embeddings and layer-wise intermediate representations are analyzed.
Alignment Framework:
- Ridge Regression: Used to linearly map LVLM visual features ( $X$ ) to EEG responses ( $y$ ) to predict brain activity. Performance is measured via K-fold cross-validation using Pearson correlation.
- Representational Similarity Analysis (RSA): Constructs Representational Dissimilarity Matrices (RDMs) for both predicted and actual EEG signals to compare the geometric structure of representations using Spearman and Kendall correlations.
- Metrics: The study employs signal-level metrics (Pearson, Spearman) and representation-level metrics (Centered Kernel Alignment - CKA, RSA scores).

3. Key Contributions

First LVLM-EEG Alignment Study: This is the first work to systematically explore the correspondence between LVLMs and human brain activity using EEG, moving beyond the limitations of fMRI.
Architecture vs. Scale: The study quantifies the relative impact of model architecture versus parameter scaling, revealing that multimodal architectural design contributes 3.4× more to brain alignment than simply increasing model size.
Hierarchical & Temporal Correspondence: It establishes a structured mapping where intermediate LVLM layers (layers 8–16) align most strongly with human EEG activity in the 100–300 ms window, mirroring the hierarchical stages of human visual processing.
Performance Correlation: The paper demonstrates that higher neural alignment (EEG similarity) correlates strongly with downstream task performance on standard benchmarks, suggesting EEG alignment is a valid proxy for human-aligned visual understanding.

4. Key Results

A. Model Performance & Architecture

Multimodal Superiority: Multimodal LVLMs significantly outperform vision-only models (e.g., ViT). For instance, the InternVL3.5-38B model achieved the highest alignment (Pearson $\approx$ 0.265), while pure ViT models and early LLaVA versions performed near the baseline.
Architecture > Scale: While larger models generally perform better, the gain from scaling is marginal compared to architectural improvements. For example, scaling InternVL3.5 from 2B to 38B yielded only a 1.6% improvement, whereas the architectural gap between InternVL3.5-38B and LLaVA-v1.5-7B was over 3 times larger.
Multimodal Fusion: Input configurations involving multimodal fusion (Image + Text) consistently improved alignment over visual-only inputs, regardless of the specific prompt type (explicit, noise, or caption).

B. Spatiotemporal Dynamics

Layer-Time Alignment: Intermediate layers (8–16) showed peak alignment with EEG signals in the 100–300 ms window. This corresponds to the transition from low-level feature extraction (occipital) to high-level semantic processing (temporal/parietal) in the human brain.
Spatiotemporal Propagation: The correlation patterns in EEG signals followed the known human visual pathways:
- 0–100 ms: Low correlation.
- 100–300 ms: Strong correlation peaks in the occipital region (early visual processing).
- 300–400 ms: Correlation spreads to the parietal region (spatial/visuomotor integration).
- 400–500 ms: Decline in correlation.

C. Category and Benchmark Correlations

Category Dependence: Alignment strength varies by object category. "Geological formations" and "amphibians" showed the highest correlation ( $\approx$ 0.21), while "vehicles" and "fruits" showed the lowest ( $\approx$ 0.15), likely due to the richness of neural patterns and semantic diversity in training data.
Benchmark Link: There is a strong positive correlation ( $R^2 = 0.6096$ ) between LVLM-EEG alignment and overall benchmark scores (OpenCompass). The correlation is strongest for Multimodal Creation ( $R^2 = 0.6337$ ) and Multimodal Reasoning, suggesting that models that better mimic human brain dynamics also perform better on complex reasoning tasks.

5. Significance and Implications

Biologically Grounded Benchmark: The study proposes EEG-based alignment as a new, temporally precise metric for evaluating LVLMs, complementing traditional task-based benchmarks.
Neuro-Inspired Design: The findings suggest that future LVLM development should prioritize multimodal architectural designs and intermediate feature integration over mere parameter scaling to achieve human-like visual cognition.
Neuroscience Insights: The results validate that deep learning models recapitulate the hierarchical and spatiotemporal dynamics of the human visual cortex, offering a computational model to study human visual perception.
Limitations: The study is limited to open-source models (excluding proprietary ones like GPT-4V) and relies on EEG, which has lower spatial resolution than fMRI and cannot capture deep brain structures.

In conclusion, the paper provides robust evidence that modern LVLMs learn visual representations that are structurally and temporally aligned with human brain activity, establishing a new paradigm for evaluating and improving multimodal AI systems through the lens of neuroscience.