Human-like Working Memory Interference in Large Language Models

Hua-Dong Xiong (School of Psychological and Brain Sciences, Georgia Tech), Li Ji-An (Department of Psychology, New York University), Jiaqi Huang (Department of Cognitive Science, Indiana University Bloomington, Honda Research Institute), Robert C. Wilson (School of Psychological and Brain Sciences, Georgia Tech, Center of Excellence for Computational Cognition, Georgia Tech), Kwonjoon Lee (Honda Research Institute), Xue-Xin Wei (Departments of Neuroscience and Psychology, The University of Texas at Austin)

Published 2026-04-14

📖 6 min read🧠 Deep dive

View on arXiv ↗PDF ↗

The Big Question: Why Do Super-Computers Forget?

Imagine you have a library with 100 billion books (neurons/parameters). You can walk into any aisle and pull out any book instantly. You have a perfect map of the entire library.

Now, imagine someone asks you: "What was the third book I asked for 10 minutes ago?"

Even with your perfect library, you might struggle if they ask for the 10th book, then the 20th, then the 30th, all while talking about random topics in between. You might mix them up.

This is the mystery the paper solves. Large Language Models (LLMs) like the ones you chat with have massive "libraries" (they remember everything you typed earlier in the chat). They have "superpowers" to look back at any part of the conversation instantly. So, why do they still get simple memory games wrong?

The answer isn't that they can't see the information. It's that they get distracted by the noise.

The Game: The "N-Back" Challenge

To test this, the researchers played a game called N-Back with AI models.

The Rules: The AI is shown a stream of letters: A, B, C, D, E...
The Task: If the game is "2-Back," the AI must say the letter that appeared two steps ago.
- Input: A, B, C, D
- Output: - , - , A, B (Because A was two steps before C, and B was two steps before D).

The Surprise:
When the researchers trained a tiny, simple AI just to play this game, it got 100% perfect.
But when they tested the giant, famous AI models (like Qwen, Gemma, Llama), they got worse and worse as the game got harder (e.g., remembering the letter from 4 steps ago instead of 2).

Even though these giant models have access to the entire history of the chat, they act like humans who are easily distracted.

The Real Culprit: "Representational Interference"

The paper argues that the problem isn't storage (running out of space); it's interference (too much noise).

🧠 The Analogy: The Radio Station

Imagine your brain (or the AI) is a radio.

The Target: You want to listen to "Station 2" (the letter from 2 steps ago).
The Problem: "Station 1" (the letter you just saw) and "Station 3" (the letter you saw 3 steps ago) are broadcasting at the same time, and they are all playing on the same frequency.

Because the AI stores all these letters in a "mixed-up" way (entangled representations), when it tries to tune into "Station 2," the signal from "Station 1" is so loud and similar that it drowns out the target.

The AI doesn't forget the letter; it just can't find it in the mess.

The Evidence: How We Know It's Interference

The researchers found three "smoking guns" that prove the AI is getting confused by noise, not just losing data:

The "Recency" Bias:
When the AI makes a mistake, it almost always guesses the most recent letter instead of the one from 2 steps ago. It's like a human saying, "I know I asked for the red one, but you just said blue, so maybe it's blue?" The most recent memory is too loud.
The "Lure" Effect:
If the AI sees a letter that looks like the one it's supposed to remember, it gets confused.
- Example: If the target is X, and the AI just saw X again, it might accidentally grab the new X instead of the old X. The content of the letters interferes with the position.
The "Smart" AI is the "Distracted" AI:
Here is the coolest part: The models that are generally "smarter" (better at math, logic, and writing) are the ones that perform best at this memory game.
- Why? Being "smart" in this context means being good at ignoring distractions. The AI that can successfully "mute" the irrelevant letters to find the right one is the same AI that is good at reasoning.

How the AI "Thinks" (The Mechanism)

The researchers looked inside the AI's "brain" (its layers) to see how it solves the problem. They found a common pattern:

Layer 1-5 (The Chaos): The AI holds all the letters in a big, jumbled pile. The target letter is mixed in with the noise.
Layer 10-20 (The Filter): The AI starts to suppress the irrelevant letters. It's like a DJ slowly turning down the volume on the wrong stations.
Layer 25+ (The Clarity): Finally, near the end, the target letter becomes loud and clear, and the AI outputs the answer.

The Catch: This filtering process is hard work. If there are too many letters (high memory load), the filter gets overwhelmed, and the noise leaks through.

The "Magic Fix" Experiment

To prove that "noise" was the problem, the researchers did a surgery on the AI.

They took the "letter identity" information (the fact that a token is the letter 'A') and silenced it in the middle of the AI's processing.
Result: The AI actually got better at the game!
Why? By removing the "noise" of the specific letters, the AI had an easier time finding the target position. It proved that the AI was indeed struggling because the letters were fighting each other.

The Big Takeaway

Humans and AI share a common weakness.

Even though humans have biological brains and AI has silicon chips, we both face the same computational challenge: How do you pick the right thing out of a pile of similar things?

Old Idea: We forget because we run out of "RAM" (storage space).
New Idea: We forget because we can't filter out the interference.

The paper suggests that to make AI smarter, we shouldn't just give them bigger libraries (more context windows). Instead, we need to teach them better noise-canceling headphones—ways to actively suppress irrelevant information so the important stuff can shine through.

In short: The problem isn't that the AI can't see the past; it's that the past is too loud, and the AI needs to learn how to tune it out.

1. Problem Statement

Despite possessing billions of parameters and full access to prior context via self-attention mechanisms, Large Language Models (LLMs) exhibit severe limitations in working memory tasks, similar to human cognitive constraints.

The Paradox: Humans and LLMs both show a capacity limit (e.g., failing at high $N$ -back levels) despite having sufficient theoretical storage resources.
The Question: Why do LLMs fail at working memory tasks if they can theoretically retrieve any token from the context window? Is the failure due to a lack of access, or is it caused by representational interference (the inability to isolate relevant information from competing, entangled representations)?
Hypothesis: The authors hypothesize that LLM working memory limits are not due to storage bottlenecks but rather a failure in interference control. Models encode multiple memory items in overlapping ("entangled") representations, making it difficult to suppress task-irrelevant content to retrieve the target.

2. Methodology

A. Task Design: The N-back Task

The authors adapted the classic human $N$ -back task for LLMs as a multi-turn dialogue:

Procedure: The user provides a sequence of letters. At turn $t$ , the model must output the letter presented $N$ turns ago ( $x_{t-N}$ ).
Conditions: Evaluated for $N \in \{1, 2, 3, 4\}$ .
Manipulations:
- Lure stimuli: Replacing the current input with a non-target item from $t-(N-1)$ or $t-(N+1)$ to test content-based interference.
- Stimulus Set Size: Reducing the alphabet from 26 to 10 letters to increase repetition probability.
- Transition Statistics: Imposing a Markov chain structure to test if models rely on statistical regularities rather than positional retrieval.
Evaluation Modes:
- Autoregressive: Standard generation where the model uses its own previous outputs.
- Teacher-Forced: The model is conditioned on ground-truth answers for previous turns to isolate retrieval mechanisms from error propagation.

B. Models Evaluated

The study analyzed 10 instruction-tuned models across four families to assess scaling effects:

Gemma 3: 1B, 4B, 12B, 27B parameters.
Qwen 3.5: 2B, 4B, 9B, 27B parameters (with "thinking" mode disabled).
Cross-family: Llama-3.1-8B and Ministral-3-14B.
Sanity Check: A small, two-layer causal transformer was trained from scratch on the task to prove the architecture is theoretically capable of perfect performance.

C. Mechanistic Analysis

The authors analyzed the internal representations ( $h_t^\ell$ ) across transformer layers to track how information evolves:

Letter Alignment: Measuring how much the current stimulus identity persists in the residual stream.
Decodability: Assessing if non-target letters can still be decoded from the hidden states.
Subspace Similarity: Measuring the overlap between representations of different relative positions (e.g., target vs. distractors).
Target Readout Alignment: Measuring how well the target representation aligns with the output layer weights.

D. Causal Intervention

To test causality, the authors performed a singular value decomposition (SVD) on letter-identity representations to identify the "letter-identity subspace." They then intervened by suppressing these specific directions in the residual stream during the answer generation phase.

3. Key Results

A. Behavioral Performance

Capacity Limits: All pretrained LLMs showed a sharp decline in performance (measured by Cohen's $\kappa$ ) as $N$ increased, mirroring human performance curves. Even the largest models (27B) struggled at $N=4$ .
Recency Interference: Errors were not random; they were systematically biased toward recent non-target items (e.g., recalling $t-1$ instead of $t-N$ ). This "recency effect" grew with memory load.
Content Sensitivity: Performance was significantly modulated by lure similarity, stimulus set size, and transition statistics. This contradicts a "pure positional pointer" theory, which would be insensitive to content.
Correlation with General Intelligence: Higher working memory capacity (N-back accuracy) strongly correlated with performance on general benchmarks (MMLU Pro, GPQA Diamond, IFEval), suggesting N-back measures a broader interference-control capability.

B. Mechanistic Findings (The "Interference-Control Trajectory")

Across all models, a consistent computational trajectory was observed:

Initial Entanglement: Multiple memory items are encoded in overlapping residual representations.
Progressive Suppression: As the signal moves through layers, task-irrelevant content (specifically letter identity) is progressively suppressed.
Separation: In middle layers, representations of different relative positions become more distinct (subspace similarity decreases).
Late Alignment: Only in the final layers does the target representation sharply align with the readout weights.

Key Insight: Successful retrieval requires actively suppressing competing alternatives, not just accessing the correct position.

C. Causal Intervention Results

Letter-Identity Removal: Selectively suppressing the letter-identity subspace in the residual stream causally improved N-back performance.
Implication: Residual letter-identity information acts as a source of interference. Removing it reduces competition, making the target easier to retrieve. This confirms that interference is a causal factor in performance degradation, not just a correlation.

4. Key Contributions

Identification of Representational Interference: The paper provides robust evidence that LLM working memory limits stem from representational interference (entangled states) rather than a lack of context access or storage capacity.
Human-Like Signatures: Demonstrated that LLMs exhibit the same specific interference signatures as humans (recency bias, lure effects, set-size effects), suggesting a shared computational constraint.
Mechanistic Decomposition: Mapped the internal "interference-control trajectory" of LLMs, showing how models separate and suppress information across layers.
Causal Validation: Used targeted interventions (SVD-based suppression) to prove that residual content information causally hinders retrieval.
Link to General Intelligence: Established that working memory capacity in LLMs is a predictor of broader reasoning and instruction-following capabilities, mirroring the human cognitive link between working memory and intelligence.

5. Significance and Implications

Theoretical: The findings challenge the view that LLMs simply "read" from a context window. Instead, they suggest that LLMs, like biological systems, must solve a difficult selection problem in the presence of noise and competing representations.
Shared Computational Challenge: The convergence of limitations in biological and artificial systems suggests that interference control is a fundamental computational bottleneck for any system using distributed, overlapping representations.
Future Directions: Improving LLM memory may require more than just larger context windows. It may necessitate architectural or training innovations that better facilitate selective retrieval and active suppression of task-irrelevant information.
Benchmarking: The N-back task is validated as a robust assay for measuring the interference-control capabilities of LLMs, which correlates with general model competence.

In summary, the paper argues that the "working memory limit" in LLMs is not a storage failure but a retrieval failure caused by interference, and that overcoming this requires better mechanisms for isolating relevant information from a sea of entangled representations.