Understanding the Dynamics of Demonstration Conflict in In-Context Learning

Imagine you are teaching a very smart, but slightly naive, robot how to do a new math trick. You show it a few examples: "2 + 2 = 4," "3 + 3 = 6," and "5 + 5 = 10." The robot looks at these examples, figures out the pattern (it's addition), and successfully solves a new problem. This is called In-Context Learning.

But what happens if you slip a lie into the examples? You show it: "2 + 2 = 4," "3 + 3 = 6," "5 + 5 = 12" (a lie!), and "7 + 7 = 14."

Even though three out of four examples are correct, the robot often gets confused and fails. It might start thinking the answer is 12. This paper investigates why this happens and where inside the robot's "brain" the confusion takes root.

Here is the story of their discovery, broken down into simple parts:

1. The Problem: One Bad Apple Spoils the Bunch

The researchers found that these AI models are incredibly fragile. If you have 100 correct examples and just one wrong one, the model's performance can crash. It's as if the robot ignores the 99 people shouting "It's addition!" and listens only to the one person whispering "It's multiplication!"

They noticed something strange: it didn't matter where the lie was. Whether the lie was the first example or the last, the robot still got tricked. This suggested the robot wasn't just "forgetting" the truth; it was actively processing the lie as if it were real.

2. The Brain Scan: Two Stages of Thinking

To understand what was happening, the researchers used special tools (like an X-ray for the robot's brain) to watch how the information flowed through the model's layers. They discovered that the robot's reasoning happens in two distinct phases, like a two-step dance:

Phase 1: The "Gossip" Phase (Early & Middle Layers)
In the beginning, the robot reads all the examples. It's like a group of people sitting in a circle, each sharing a story. The researchers found that in these early layers, the robot is actually recording both the truth and the lie. It knows "2+2=4" is true, but it also notes "2+2=12" is in the room. It hasn't decided which one to believe yet; it's just collecting all the gossip.
- The Culprit: They found specific parts of the brain called "Vulnerability Heads." These are like over-eager note-takers who pay too much attention to specific spots in the list. If a lie appears in a spot they are watching, they get very excited and start spreading that lie immediately.
Phase 2: The "Decision" Phase (Late Layers)
Later in the process, the robot has to make a final choice. It needs to pick one rule to follow. Here is where the magic (or the mistake) happens. The robot should look at the crowd and say, "Okay, 3 people said 'Addition,' 1 person said 'Multiplication,' so I'll go with Addition."
- The Failure: Instead, the robot often flips a coin. It builds up strong confidence in both the truth and the lie simultaneously. Then, in the final split second, it gets swayed by the lie.
- The Culprit: They found another group called "Susceptible Heads." These are like a weak-willed judge in the final round. Even though the evidence clearly points to the truth, these judges are easily intimidated by the single lie and switch their vote to the wrong side.

3. The Experiment: Removing the Bad Neurons

To prove they had found the real troublemakers, the researchers did a "surgery." They temporarily turned off (masked) the Vulnerability Heads and the Susceptible Heads.

The result was amazing. When they removed these specific parts of the brain:

The robot stopped getting confused by the single lie.
Its accuracy jumped by more than 10%.
It became much better at ignoring the noise and sticking to the majority truth.

The Big Takeaway

This paper teaches us that AI models aren't just "dumb" when they fail; they are actually doing a complex, two-step process that goes wrong in specific ways.

They collect conflicting info too easily (thanks to the Vulnerability Heads).
They fail to filter out the noise when making a final decision (thanks to the Susceptible Heads).

The Analogy: Imagine a jury trying to decide a verdict.

Vulnerability Heads are the jurors who get distracted by the loudest voice in the room, regardless of whether they are telling the truth.
Susceptible Heads are the jurors who, after hearing all the evidence, suddenly change their mind because one person whispered a doubt, even if everyone else agreed.

By identifying and "silencing" these specific jurors, the researchers made the whole jury much smarter and more reliable. This helps us understand how to build AI that is less easily tricked by bad information in the future.

1. Problem Statement

While Large Language Models (LLMs) excel at In-Context Learning (ICL) by inferring rules from few-shot demonstrations, they are highly vulnerable to conflicting demonstrations. Real-world data often contains noise or outliers that contradict the underlying pattern.

The Gap: Previous research focused on conflicts between context and parametric knowledge (context-memory conflicts) or general task conflicts. However, the specific dynamics of inter-context conflicts (where multiple demonstrations contradict each other) in rule inference tasks remain underexplored.
The Phenomenon: The authors observe that even a single corrupted demonstration (e.g., one example showing the wrong mathematical operator) among many correct ones can cause substantial performance degradation (up to 58% in some cases). Models often systematically adopt the minority corrupted rule rather than the majority correct rule.

2. Methodology

The authors employ a mechanistic interpretability approach to trace how models process conflicting evidence internally.

A. Experimental Framework

Tasks: Two strictly demonstration-dependent tasks were selected to ensure models rely on context rather than pre-trained knowledge:
1. Operator Induction: Inferring mathematical operations (+, -, ×) from examples.
2. Fake Word Inference: Mapping synthetic vocabulary to real concepts (e.g., "blimontar" $\to$ "red").
Corruption Protocol: A "single-position corruption" framework was used. In a sequence of $k$ demonstrations (e.g., 4-shot), exactly one demonstration at a specific position $p$ is corrupted with a conflicting rule, while the remaining $k-1$ follow the correct rule.
Models: Experiments were conducted on four LLMs: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, and Llama-3.1-8B.

B. Internal Analysis Techniques

To understand the "black box" of reasoning, the authors utilized:

Linear Probes: Trained to detect the presence of specific rules (correct vs. corrupted) in the residual stream at each layer. This reveals where information is encoded.
Logit Lens: Projects intermediate layer representations through the unembedding matrix to decode predictions layer-by-layer. This reveals when the model commits to a specific prediction.
Attention Head Localization:
- Vulnerability Score: Defined as the product of disproportionate attention allocation to a position and high output sensitivity when that position is corrupted.
- Susceptibility Score: Defined as the reduction in a head's preference for the correct rule over the corrupted rule when corruption is introduced.
Targeted Ablation: Masking specific identified attention heads during inference to verify causal impact on performance.

3. Key Findings & Results

A. The Two-Phase Computational Structure

The analysis reveals a temporal separation in how models handle conflict, leading to a Two-Phase Hypothesis:

Phase 1: Conflict Creation (Early-to-Middle Layers):
- Linear probes show that models encode both the correct and the corrupted rules simultaneously in intermediate layers.
- Vulnerability Heads are identified in these layers. They exhibit positional bias (attending disproportionately to specific demonstration positions) and high sensitivity to corruption at those positions. They effectively "create" the vulnerability by amplifying the impact of a single corrupted example.
Phase 2: Conflict Resolution (Late Layers):
- Logit lens analysis shows that prediction confidence emerges only in the final layers.
- Susceptible Heads are identified in these late layers. When exposed to corrupted evidence, these heads significantly reduce their support for the correct rule, failing to resolve the conflict in favor of the majority.

B. Causal Validation via Ablation

Performance Recovery: Masking the top identified Vulnerability Heads and Susceptible Heads significantly improves model performance under corruption.
- Ablating just a few heads (e.g., top 5-10) resulted in relative performance improvements of over 10% (up to 11.12% for Llama-3.2-3B).
- Random head ablation resulted in performance degradation, confirming the specificity of the identified heads.
Synergy: Ablating Vulnerability Heads was found to reduce the susceptibility scores of late-layer Susceptible Heads, suggesting a causal chain where early-layer vulnerabilities feed into late-layer resolution failures.

C. Positional Bias and Generalizability

Positional Bias: The study confirms that corruption at different positions causes varying degrees of performance loss. Removing Vulnerability Heads reduces this positional variance, making the model more robust to where the noise appears.
Generalizability: Susceptible Heads show significant overlap across different tasks (Operator Induction vs. Fake Word Inference), suggesting a general mechanism for conflict resolution failure. Vulnerability Heads are more task-specific.

4. Key Contributions

Framework for Conflict Dynamics: Established a rigorous framework for studying how LLMs process conflicting demonstrations in rule inference, moving beyond simple accuracy metrics to mechanistic analysis.
Two-Phase Hypothesis: Provided empirical evidence that ICL reasoning under corruption follows a two-stage process: encoding multiple competing rules (early/middle) followed by failed resolution (late).
Identification of Critical Components: Localized two distinct types of attention heads responsible for reasoning failures:
- Vulnerability Heads: Create systematic weaknesses via positional bias.
- Susceptible Heads: Fail to maintain majority support during final prediction.
Mitigation Strategy: Demonstrated that targeted ablation of these specific heads can recover significant performance, offering a pathway for improving model robustness without retraining.

5. Significance

Safety and Robustness: The findings explain why models are easily misled by adversarial or noisy demonstrations, a critical issue for deploying LLMs in high-stakes environments where data quality cannot be guaranteed.
Mechanistic Insight: The work advances the field of mechanistic interpretability by mapping specific computational failures (conflict resolution) to specific neural components (attention heads), moving from correlation to causation.
Future Directions: The identified heads provide targets for developing more robust architectures or prompting strategies that can better filter out conflicting evidence, potentially leading to models that can reason more reliably in uncertain or noisy contexts.