When word order matters: human brains represent… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Question: Do AI Brains Think Like Human Brains?

Imagine you have two chefs.

Chef A is a human who has cooked for decades. They understand that "The dog chased the cat" is very different from "The cat chased the dog," even though the ingredients (the words) are exactly the same. They understand the story.
Chef B is a super-advanced robot (a Large Language Model, like the ones powering modern AI). It has read every recipe book in the world. It can write a perfect sentence about a dog chasing a cat.

The big question scientists have been asking is: When Chef B writes a sentence, does it "understand" the story the same way Chef A does? Or is the robot just a fancy parrot that mimics the sound of understanding without actually getting the meaning?

This paper says: The robot is good at mimicking, but it's failing the "story test."

The Experiment: The "Word Swap" Game

To find the answer, the researchers didn't just ask the AI to write essays. They set up a tricky game to see how the brain and the AI handle sentence structure.

They created 108 special sentences. Think of these sentences as LEGO sets.

The "Same" Set: You have a blue brick, a red brick, and a green brick.
The "Swapped" Set: You have the exact same blue, red, and green bricks, but you rearrange them.

The Trap:
If you just look at the list of bricks (the words), the two sets look identical.

Sentence 1: "The cameraman brought the equipment to the director."
Sentence 2: "The director brought the cameraman to the equipment."

To a computer that only counts words, these are 90% similar. To a human, they are totally different stories! In the first, the cameraman is the hero; in the second, the director is the hero.

The Test:
The researchers put 30 people in an MRI machine (a giant camera that takes pictures of the brain) and showed them these sentences. They also asked a group of people online to rate how similar the sentences were.

Then, they asked: "Which computer model's 'brain' matches the human brain's reaction to these swapped sentences?"

They tested four types of "computer brains":

The "Word Bag" (Mean Model): A computer that just throws all the words in a bucket and averages them. It ignores order completely.
The "Transformer" (The AI): The fancy AI models (like GPT-4) that we use today.
The "Graph" Model: A computer that draws a map of how words connect to each other (like a family tree).
The "Hybrid" Model: A mix of the two, designed specifically to understand roles (who did what to whom).

The Results: The Robot Missed the Mark

Here is what happened when they compared the computer models to the human brain:

1. The "Word Bag" was terrible.
As expected, the model that ignores word order got a negative score. It thought the swapped sentences were almost identical, while the human brain knew they were opposites.

2. The "Transformer" (AI) was better, but still wrong.
The AI models were much better than the "Word Bag." They did a decent job. However, when the researchers looked closely at the "Swapped" sentences, the AI still thought they were too similar.

The Analogy: Imagine the AI is like a student who memorized the vocabulary list but didn't read the instructions. It sees the words "cameraman," "director," and "equipment" and thinks, "Ah, these are the same ingredients, so the dish must be the same!" It missed the recipe.
The Brain: The human brain, however, lit up differently for the swapped sentences. It knew the roles had changed. The AI failed to match this pattern.

3. The "Hybrid" and "Graph" models won.
The models that were explicitly built to understand who did what to whom (the semantic roles) matched the human brain the best.

The Analogy: These models were like a detective who doesn't just list the suspects; they draw a map of who did what to whom. When the roles were swapped, the detective (and the human brain) said, "Wait, the plot has changed!"

The "Long Sentence" Surprise

There was one other weird thing they found.
When people read very long sentences, their brains lit up in a very similar way, regardless of what the sentence actually meant.

The Analogy: It's like walking into a gym. Whether you are lifting a heavy weight or just stretching, your heart rate goes up and you sweat. The brain seems to have a "long sentence mode" where it just gets ready to work hard, ignoring the specific details for a moment. The researchers had to account for this "gym mode" to see the real differences in how the brain understood the stories.

The Bottom Line

What does this mean for us?

AI is impressive, but it's not human. Large Language Models are amazing at generating text and answering questions. But they don't "think" about sentence structure the way our brains do. They are more like super-advanced pattern matchers than true understanders.
Structure matters. The human brain cares deeply about the order of words and the roles they play. If you swap the subject and the object, the meaning changes completely, and our brains know it instantly. Current AI models are still a bit "clumsy" at this specific task.
We need better models. To build AI that truly understands us, we might need to move away from just "predicting the next word" and start building models that explicitly map out the relationships between words, just like our brains do.

In short: The AI can write a poem, but it doesn't quite understand the story behind the words the way a human does. It's a brilliant mimic, but not a true thinker yet.

1. Problem Statement

While Large Language Models (LLMs) based on Transformer architectures (e.g., GPT-4, Llama) demonstrate human-like linguistic capabilities, it remains unclear whether they encode and process linguistic meaning in a manner analogous to the human brain. Previous neuroimaging studies comparing brain activity to language models have largely relied on naturalistic stimuli or static word embeddings. These approaches suffer from two main limitations:

Confounding Variables: Naturalistic stimuli often conflate lexical similarity (word choice) with structural similarity (syntax and semantic roles), making it difficult to isolate whether models are capturing sentence structure or merely word-level semantics.
Model Limitations: Direct comparisons often fail to distinguish whether a model's predictive power stems from its ability to represent sentence structure or simply from contextualizing polysemous words.

The authors aim to disentangle lexical similarity from sentence structure to determine if Transformer models represent sentence meaning similarly to the human brain, specifically testing the hypothesis that the brain is highly sensitive to structural relations (e.g., semantic roles) even when lexical content is held constant.

2. Methodology

A. Stimuli Design

The researchers constructed a handcrafted dataset of 108 sentences designed to systematically vary lexical and compositional similarity while controlling for confounds.

Structure: The sentences are organized into six subsets (blocks), each sharing a core set of words but varying in structure.
Sentence Types:
- Same/Modified: Minor lexical changes (e.g., adding adjectives).
- Substituted: Key elements (subject, verb, object) changed.
- Swapped (Critical): Semantic roles are interchanged (e.g., "The cameraman brought the equipment to the director" vs. "The director brought the cameraman to the equipment"). These pairs have high lexical overlap but distinct meanings due to structural changes.
Goal: To create "block diagonal" pairs where models must rely on structural understanding rather than word overlap to differentiate meaning.

B. Computational Models

Four distinct classes of sentence representation models were evaluated:

Mean-CN: A baseline vector model averaging static word embeddings (ignores word order and structure).
Transformer: A collection of five state-of-the-art models (ERNIE 2.0, AMRBart, SentBERT, DefSent, OpenAI embeddings, and Llama 3 layers).
Graph Models: Models based on Abstract Meaning Representation (AMR) graphs, using Smatch (node-triple matching) and WWLK (Weisfeiler-Lehman kernel) metrics.
Hybrid Models:
- VerbNet-CN: Uses GPT-4 to parse sentences into semantic roles (Agent, Patient, Theme, etc.) and constructs vector embeddings for each role, then aggregates them.
- AMR-CN: Extracts semantic roles from AMR parse trees.

C. Experimental Data Collection

fMRI Study: 30 participants read the 108 sentences while undergoing 7T fMRI scanning. Each sentence was presented 4 times. Participants answered comprehension questions to ensure attention.
- Analysis: Brain activity patterns were extracted using GLMSingle. Representational Similarity Analysis (RSA) was performed on stable voxels within the language network and the whole cortex (excluding visual areas V1-V4).
Behavioral Study: 502 online participants rated the semantic similarity of sentence pairs on a 1–7 scale.
GPT-4 Ratings: GPT-4 was also prompted to rate sentence similarity directly for comparison.

D. Analysis Technique

Representational Similarity Analysis (RSA):
The study computed Representational Similarity Matrices (RSMs) for the brain (neural similarity) and each computational model. Partial Spearman correlations were calculated between model RSMs and brain RSMs, controlling for sentence length. This was done for both the full set of pairs and the critical "block diagonal" (swapped) subset.

3. Key Results

A. fMRI Findings (Brain vs. Models)

Overall Performance: When analyzing all sentence pairs, all models showed positive correlations with brain activity, with Transformers performing slightly better than the Mean model.
Critical Finding (Block Diagonal/Swapped Pairs):
- Mean-CN: Showed a strong negative correlation ( $\rho \approx -0.20$ ), indicating that brain representations are fundamentally different from simple word averaging.
- Transformers: Showed a negative correlation ( $\rho \approx -0.045$ ). While better than the Mean model, they still failed to match brain representations. They treated "swapped" sentences as highly similar (due to lexical overlap), whereas the brain treated them as distinct.
- Hybrid/Graph Models: The VerbNet-CN model achieved the highest positive correlation ( $\rho \approx 0.07$ ). The AMR-Smatch model also showed positive correlations ( $\rho \approx 0.047$ ).
Interpretation: The brain is highly sensitive to structural changes (semantic roles). Models that explicitly encode these roles (VerbNet-CN) align with the brain, while Transformers, despite their linguistic competence, fail to represent sentence structure in a brain-like manner when lexical similarity is controlled.

B. Behavioral Findings

Human ratings correlated strongly with all models on the full dataset.
However, on the "swapped" pairs, humans rated the sentences as less similar than the Transformers did.
VerbNet-CN and AMR-Smatch aligned better with human judgments on swapped pairs, correctly identifying them as less similar, whereas Transformers overestimated their similarity.

C. Regional Analysis

Searchlight RSA: Significant correlations were found throughout the language network (temporal, frontal, angular gyrus) and the default mode network.
Subregions: The temporal lobe showed the strongest sensitivity to structural differences.
Length Effect: A robust "minimum sentence length" effect was observed, where longer sentences elicited similar brain activity regardless of content, likely due to increased cognitive load or visual processing.

4. Key Contributions

Disentangling Lexical vs. Structural Meaning: The study provides the first fMRI evaluation using stimuli specifically designed to isolate sentence structure from lexical similarity, revealing that Transformers fail to capture structural nuances that the brain prioritizes.
Superiority of Hybrid/Graph Models: It demonstrates that models explicitly encoding semantic roles (Hybrid/Graph) outperform state-of-the-art Transformers in predicting brain activity for structurally complex sentences.
Critique of Transformers as Cognitive Models: The results cast doubt on the claim that Transformers are accurate cognitive models of human language processing. While they can perform linguistic tasks, their internal representations of sentence meaning do not mirror the human brain's structural sensitivity.
Methodological Advancement: The use of 7T fMRI and a rigorous block-diagonal design allows for a more precise comparison of vector-based vs. graph-based semantic representations.

5. Significance

Neuroscience: Confirms that the human brain represents sentence meaning through a structured integration of semantic roles, not just a bag-of-words or simple contextual embedding. It highlights the critical role of the temporal lobe in compositional processing.
AI & NLP: Suggests that current Transformer architectures, despite their success in generation and comprehension tasks, may rely on statistical correlations and surface-level features rather than deep structural understanding. To build more human-like AI, future models may need to incorporate explicit structural constraints or hybrid approaches that better mimic the brain's sensitivity to syntax and semantic roles.
Model Evaluation: Establishes that high performance on standard benchmarks (like SICK or STS) does not guarantee that a model's internal representations align with human neural processing, especially when structural variations are involved.

In conclusion, the paper argues that word order and structural relations are critical for human sentence representation, a feature currently underrepresented in Transformer models, which rely heavily on lexical co-occurrence.

When word order matters: human brains represent sentence meaning differently from large language models