Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

Imagine you are talking to a very smart, well-read assistant named "LLM" (Large Language Model). You ask it a question, and it gives you an answer based on everything it knows.

Now, imagine you are in a long conversation with this assistant. Every few minutes, you tell it, "Actually, I was wrong about that. The new fact is X." Then, five minutes later, you say, "No, wait, I changed my mind again. The new fact is Y." Then, "Actually, it's Z now."

This paper investigates what happens when you do this many, many times in a single conversation. Does the assistant remember the very first thing you told it? Does it remember the very last thing? Or does it get confused and mix them up?

Here is the breakdown of their findings, using some creative analogies.

1. The Problem: The "Echo Chamber" Effect

The researchers discovered a strange glitch they call "Retrieval Bias."

Think of your conversation with the AI as a long hallway.

The Beginning of the Hallway: You put a sign up that says "The President is Alice."
The Middle: You replace it with "The President is Bob."
The End: You replace it with "The President is Charlie."

When you ask the AI, "Who is the President?" at the very end of the conversation:

If you ask about the beginning: The AI remembers "Alice" perfectly. It's like looking at a sign that hasn't been touched in a while; it's still bright and clear.
If you ask about the end: The AI often forgets "Charlie." It might say "Bob" or "Alice," or just guess.

The Analogy: Imagine trying to listen to a song where the DJ keeps changing the track. The first song (Alice) is etched into your memory because you heard it first. But the last song (Charlie) gets drowned out by all the noise of the songs in between. The more songs the DJ plays, the harder it is for you to remember the current song.

2. The Psychology Connection: The "AB-AC" Interference

The authors borrowed an idea from human psychology called AB-AC Interference.

Scenario: You learn that A (a word) is linked to B (a picture). Later, you learn that A is actually linked to C (a different picture).
The Result: When you try to recall what goes with A, your brain gets stuck between B and C. The old memory fights the new one.

The paper shows that LLMs suffer from this exact same problem, but on steroids. When the same "cue" (like "President of Italy") is updated 32, 64, or even 512 times in one go, the AI gets overwhelmed. The "old" memories crowd out the "new" ones.

3. The Investigation: Looking Inside the Brain

The researchers didn't just ask the AI questions; they looked inside its "brain" (its internal code) to see why it was failing. They checked three things:

Attention (Where is it looking?): Imagine the AI has a spotlight. When it gets the answer right, the spotlight shines brightly on the latest fact. When it gets it wrong, the spotlight becomes a weak, flickering flashlight that wanders around the whole hallway, unable to focus on the newest fact.
Hidden States (The internal notes): When the AI is confused, its internal "notes" become blurry. It's like trying to read a handwritten note that has been smudged by rain. The clear distinction between "Old Fact" and "New Fact" disappears.
Confidence (The "I'm sure" meter): Even when the AI is wrong, it often acts very confident. It's like a student guessing on a test who is 100% sure they are right, even though they are wrong.

4. The "Fixes": Can We Help the AI?

The researchers tried to help the AI using "psychological tricks" (prompts) to see if they could fix the memory issue.

Rehearsal: "Hey AI, please repeat the new fact to yourself a few times."
- Result: It helped a little, like studying for a test, but didn't solve the problem.
Storytelling: "Please imagine these facts are a story chain."
- Result: A bit better, but still not enough.
Forgetting: "Please tell yourself that the old facts are trash and only remember the new one."
- Result: This was the most promising, but even this couldn't completely fix the issue.

The Verdict: The "band-aids" (prompts) helped a tiny bit, but they didn't cure the disease. The AI is still fundamentally bad at tracking a fact that changes dozens of times in a single conversation.

5. Why Does This Matter?

This is a big deal because we are starting to use AI for things that require long, evolving conversations—like legal cases, medical histories, or news analysis where facts change daily.

If you ask an AI, "What was the stock price of Company X yesterday?" and then "What is it today?" and then "What is it right now?" in a long chat, the AI might confidently tell you the price from yesterday instead of right now, simply because the "noise" of all the updates confused it.

Summary

The Issue: AI is great at remembering the start of a long conversation but terrible at remembering the end if the facts keep changing.
The Cause: Too many updates create "noise" that drowns out the latest information (Cue-Overload).
The Diagnosis: When the AI fails, its internal "spotlight" gets blurry, and it loses the ability to distinguish the new fact from the old ones.
The Future: We can't just "prompt" our way out of this. We need to build smarter AI brains that are actually designed to handle long, changing stories without getting confused.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in knowledge-intensive tasks where world knowledge evolves dynamically (e.g., changing political leaders or corporate executives). While prior research has focused on single-update scenarios or static knowledge conflicts, real-world contexts often involve multiple sequential updates to the same fact within a single context window.

The core problem addressed is Retrieval Bias in multi-update scenarios. When a single semantic cue (e.g., "President of Italy") is associated with a sequence of conflicting values ( $V_1 \to V_2 \to \dots \to V_T$ ) in the context, the model must retrieve the latest value ( $V_T$ ) while potentially ignoring historical ones ( $V_1, \dots, V_{T-1}$ ). The paper investigates whether LLMs exhibit a systematic bias where they retain access to the earliest historical state but fail to accurately retrieve the latest current state as the number of updates increases. This phenomenon is analogous to the AB-AC interference paradigm in cognitive psychology, where learning a new association (A-C) interferes with the retrieval of an old one (A-B), but generalized here to a sequence of updates.

2. Methodology

A. Dynamic Knowledge Instance (DKI) Framework

The authors propose a controlled evaluation framework called Dynamic Knowledge Instance (DKI) to formalize multi-update scenarios.

Structure: A DKI $S$ consists of a semantic cue $A$ and a trajectory of $T$ updated values: $S = A: V^{(1)} \Rightarrow \dots \Rightarrow A: V^{(T)}$ .
Datasets:
- Synthetic DKIs: Randomly generated cue-value pairs (e.g., word-to-word associations) to eliminate parametric memory priors and isolate interference effects. Update lengths ( $T$ ) range from 32 to 512.
- Real-World DKIs: Derived from evolving facts (e.g., roles like "President of Italy") using the EvolveBench dataset, with update lengths averaging ~8.77.
Evaluation Protocol (Endpoint Probing): Instead of testing all intermediate states, the framework probes two endpoints:
1. Earliest-State Query: Requesting $V^{(1)}$ .
2. Latest-State Query: Requesting $V^{(T)}$ .
Metric: The Earliest-Latest Accuracy Gap (ELAG) is defined as $ELAG = \text{Acc}_{\text{earliest}} - \text{Acc}_{\text{latest}}$ . A widening gap indicates stronger retrieval bias.

B. Internal Signal Diagnostics

To move beyond black-box accuracy, the authors analyze three internal signals to understand why retrieval fails:

Attention Allocation: Measuring attention weights from the answer position to candidate value tokens across layers and heads.
Hidden-State Similarity: Calculating cosine similarity between the hidden state at the answer position and the representations of candidate values.
Output Logits: Analyzing the confidence scores (logits/softmax probabilities) assigned to candidate values.

C. Cognitively Inspired Interventions

The paper tests whether prompting strategies derived from cognitive psychology can mitigate bias:

General Mnemonic Strategies: Rote Rehearsal (repeating pairs internally) and Semantic Elaboration (creating semantic links).
Memory Updating Strategies: Memory Integration (organizing updates as a chain $V_1 \to V_T$ ) and Directed Forgetting (explicitly marking old values as obsolete).

3. Key Contributions

Formalization of Multi-Update Retrieval: The paper introduces the DKI framework, shifting the focus from single-update conflicts to the more complex, realistic scenario of competing multiple versions of a fact within a single context.
Discovery of Persistent Retrieval Bias: The authors identify a robust phenomenon where earliest-state accuracy remains high (often near ceiling) while latest-state accuracy degrades significantly as the number of updates ( $T$ ) increases. This creates a widening ELAG.
Internal Mechanism Analysis: Through diagnostic analysis, the paper reveals that retrieval failures are not due to a lack of attention to the latest value, but rather a collapse of discriminative evidence:
- In error cases, attention distributions become flat or misaligned.
- Hidden-state similarities fail to form stable peaks for the correct candidate.
- Output logits lack a decisive peak, indicating the model cannot confidently distinguish the latest update from historical ones.
Limitations of Prompting: The study demonstrates that while cognitively inspired interventions (like Memory Integration) offer modest improvements, they cannot eliminate the fundamental retrieval bias, suggesting a need for architectural or training-level solutions.

4. Key Results

Bias Scaling: Across diverse models (LLaMA 3.1, Qwen 2.5/3, GPT, Claude, etc.), the ELAG increases rapidly as $T$ grows from 32 to 128, then saturates. Smaller models exhibit stronger bias than larger ones.
Real-World Validation: The bias persists even in real-world datasets with fewer updates (avg. $T \approx 9$ ) and in narrative long-text formats, proving the issue is not an artifact of synthetic data.
Error Patterns:
- LLaMA models tend to "regress" toward earlier candidate positions.
- Qwen models tend to cluster near the end but produce "out-of-field" (OOF) outputs or fail to lock onto the specific latest token.
Signal Degradation: On correct samples, internal signals (attention, similarity, logits) show sharp peaks aligned with the latest value. On incorrect samples, these signals become flattened and non-discriminative, offering no stable basis for the model to select the correct update.
Intervention Efficacy:
- 2-shot prompting and Memory Integration yielded the best results among baselines, improving latest-state accuracy by ~5-10% in some cases.
- However, the ELAG remained significant (e.g., ~12-15% for Qwen, ~12-18% for LLaMA), confirming that prompting alone is insufficient to solve the bias.

5. Significance and Implications

Reliability in Long Contexts: The findings highlight a critical limitation in current LLMs for long-context applications. Even if a model "sees" the latest information, it may fail to prioritize it over historical context, leading to hallucinations or outdated answers in dynamic environments.
Cognitive Parallel: The work bridges cognitive psychology (AB-AC interference, cue-overload theory) and AI, suggesting that LLMs suffer from similar memory competition mechanisms as human brains when faced with cue overload.
Future Directions: The paper argues that generic prompting is a band-aid. Future research must focus on model-side mechanisms (e.g., architectural changes, specialized training objectives for update tracking) to fundamentally resolve the instability in representing and retrieving the latest state amidst competing historical versions.

In summary, this paper provides the first systematic diagnosis of how LLMs struggle to track the "current" state of a fact when multiple historical versions compete in the context, revealing that the failure is rooted in the instability of internal representations rather than a simple lack of attention.