Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Here is an explanation of the paper "Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction" using simple language and creative analogies.

The Problem: The "Stubborn GPS"

Imagine you are driving with a GPS that is incredibly smart but also stubborn.

Turn 1: You tell the GPS, "I need to get to the beach." It immediately says, "Okay, I'll route you through the highway. It will take 2 hours and cost $20 in gas."
Turn 2: You realize you only have $5 in your pocket. You say, "Wait, I only have $5. Find a cheaper way."
The Glitch: Instead of recalculating a cheap route, the GPS gets confused. It thinks, "But I already decided on the highway! I can't change my mind now." So, it suggests a crazy solution: "Okay, let's drive to the highway, but maybe you can find 3 friends to split the gas money with you so it fits your budget."

This is Contextual Inertia. The AI is so attached to its first idea (the "reasoning trace") that even when you give it new, critical information (the budget), it ignores the new facts and tries to force the old plan to work. It gets "Lost in Conversation."

The Diagnosis: Why does this happen?

The researchers found that this isn't just a random mistake. It's a specific behavior where the AI blindly follows its own previous steps, even if those steps were wrong.

The Evidence: They tested many models (like Llama, Qwen, and GPT-4) and found that in 70% to 90% of multi-turn failures, the AI wasn't failing because it was bad at math or logic in the final step. It was failing because it was carrying baggage from a wrong answer it gave in the first step.
The Indiscriminate Nature: The AI doesn't care if the previous conversation was helpful or harmful. It just keeps the momentum going, like a train that can't hit the brakes.

The Solution: RLSTA (The "Internal Compass")

The paper proposes a new training method called Reinforcement Learning with Single-Turn Anchors (RLSTA).

Here is how it works, using a Chef Analogy:

The Scenario: A chef (the AI) is cooking a complex dish.
- Single-Turn: If you give the chef the full recipe at once, they make a perfect dish.
- Multi-Turn: If you give the ingredients one by one, the chef starts cooking immediately. Then you say, "Oh, I forgot to tell you, I'm allergic to nuts!" The chef, stuck in their "inertia," keeps adding nuts because they already started that step.
The Fix (The Anchor):
Instead of just telling the chef "Don't do that," the researchers teach the chef to check their own internal "Perfect Recipe" memory before finalizing the dish.
- The AI is trained to remember: "If I had all the information at the start, I would have known the answer is X."
- When the conversation gets messy, the AI uses that "Perfect Single-Turn Answer" as an Anchor (a stable reference point).
- It asks itself: "Does my current answer match what I know I should be able to do if I had all the facts?" If not, it breaks the inertia and corrects itself.

How They Taught It (The Training Camp)

They didn't just tell the AI to "be better." They used a reinforcement learning game:

The Filter: They only trained the AI on problems where the AI could solve it perfectly if given all the info at once, but failed when the info was split up. This ensures the AI actually knows the answer; it just needs to stop being stubborn.
The Reward:
- If the AI corrects its mistake and matches its "Single-Turn Anchor," it gets a high score.
- If it stubbornly keeps the wrong path, it gets a low score.
- Crucially, they did this without needing an external teacher to check every answer. The AI used its own "Single-Turn Self" as the teacher.

The Results: Why It Matters

The paper shows that this method is a game-changer:

It Works Everywhere: They trained the AI on math problems, but it got better at coding and summarizing text too. It learned a general skill: "Don't be stubborn when the story changes."
It's Faster and Smarter: Unlike other methods that make the AI "give up" (abstain) when it's confused, RLSTA teaches the AI to fix its mistakes and keep going.
It Keeps the Good Stuff: The AI didn't lose its ability to handle long conversations or solve single-turn problems. It just became better at updating its mind.

The Bottom Line

Current AI is like a brilliant student who is afraid to change their answer once they've written it down. RLSTA is the training that teaches the student: "It's okay to change your mind. In fact, if you have new information, you must update your answer to match what you know is true."

By anchoring the AI to its own best capabilities, they broke the "inertia" and made multi-turn conversations actually reliable.

Here is a detailed technical summary of the paper "Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction."

1. Problem Definition: Contextual Inertia

The paper identifies a critical vulnerability in Large Language Models (LLMs) known as Contextual Inertia. While LLMs demonstrate strong reasoning capabilities in single-turn settings (where all information is provided at once), their performance degrades significantly in multi-turn interactions.

The Phenomenon: In multi-turn dialogues, models tend to rigidly adhere to previous reasoning traces, even when subsequent user inputs explicitly contradict, correct, or update those traces.
The Consequence: Instead of recalibrating based on new information, the model propagates errors or misleading context from earlier turns, leading to a "collapse" in performance compared to single-turn baselines.
Root Cause Analysis: The authors define this as an indiscriminate adherence to prior responses. Through quantitative analysis, they found that 70%–90% of multi-turn failures are not due to new reasoning errors in the final turn, but rather the propagation of "Misleading Context" or "Propagated Errors" from previous turns.
Limitations of Existing Solutions:
- Fine-tuning (SFT/DPO): Often relies on external supervision and bypasses the internal mechanism of failure.
- Abstention/Clarification: Strategies that ask the model to remain silent or ask for clarification (e.g., RLAAR, CollabLLM) fail in scenarios requiring state updates (e.g., "MT-Refine" where a user corrects a wrong premise). In these cases, the model must generate a response and correct itself, not stay silent.

2. Methodology: Reinforcement Learning with Single-Turn Anchors (RLSTA)

The authors propose RLSTA, a training framework designed to break contextual inertia by leveraging the model's own superior single-turn capabilities as a stable internal guide.

A. Latent Capability Filtering

Before training, the authors filter the dataset to ensure the model possesses the "latent capability" to solve the problem if given full information.

Condition: They retain conversation histories where the model's Single-Turn performance (given full info $i_{full}$ ) is significantly better than its Multi-Turn performance (given sequential history $H$ ).
Purpose: This ensures the training signal is derived from cases where the model knows the answer but fails due to inertia, rather than cases where it simply lacks the knowledge.

B. The Training Algorithm (GRPO)

The method uses Group Relative Policy Optimization (GRPO). The core innovation lies in the reward function, which combines an outcome reward with a novel Single-Turn Anchor Reward.

Outcome Reward ( $R_v$ ): A standard binary reward (0 or 1) based on whether the final answer is correct (verified by a verifier).
Single-Turn Anchor Reward ( $R_s$ ):
- This is the key contribution. It calculates the likelihood of the multi-turn response ( $m_n$ ) under the model's single-turn policy ( $\pi_{ref}$ ) when given the full information ( $i_{full}$ ).
- Formula: $R_s = \left( \prod_{t=1}^{|m_n|} \pi_{ref}(m_{n,t} | i_{full}, m_{n,<t}) \right)^{1/|m_n|}$
- Mechanism: By maximizing $R_s$ , the model is encouraged to align its multi-turn generation with the reasoning path it would have taken if it had seen all information at once. This acts as a "pull" away from the biased trajectory created by the conversation history.
Total Reward: $R = R_v + \alpha R_s$ (where $\alpha$ is a hyperparameter).

3. Key Contributions

Identification of Contextual Inertia: The paper formally defines and quantifies "Contextual Inertia," demonstrating that it is an indiscriminate phenomenon affecting both high-quality and low-quality conversation histories, and is the primary driver of multi-turn degradation.
RLSTA Framework: A generalizable training approach that uses the model's internal single-turn reasoning as a reward signal (anchor) to stabilize multi-turn interactions without relying on external verifiers for every step.
Cross-Domain Generalization: The method is shown to work effectively across different domains (Math, Code, Summarization) even when trained exclusively on Math data.
Verifier Independence: The method proves effective even without external ground-truth verifiers, relying solely on the internal consistency between single-turn and multi-turn reasoning.

4. Experimental Results

The authors evaluated RLSTA on models like Qwen2.5 (3B/7B), Qwen3-4B, and Llama-3.2-3B across MT-Add (incremental info) and MT-Refine (error correction) scenarios.

Performance Gains: RLSTA significantly outperforms standard SFT, DPO, and vanilla GRPO.
- Example: On Qwen2.5-7B, RLSTA improved the average Math score from 0.638 (Base) to 0.857 in MT-Add and 0.350 to 0.822 in MT-Refine.
Breaking Inertia: Post-training analysis (Figure 6) shows that RLSTA successfully reduces the "inertia intensity" for low-quality histories (where the model previously failed) while maintaining high similarity for high-quality histories.
Comparison with Baselines:
- RLSTA outperforms RLAAR (abstention-based) and CollabLLM (active inquiry) in dynamic settings, particularly in MT-Refine scenarios where abstention is not an option.
- It achieves comparable or superior performance to abstention methods while being applicable to a wider range of interaction types.
Generalization: Training on Math data resulted in significant performance improvements in Code and Summarization tasks, indicating the method learns a fundamental reasoning stability rather than task-specific patterns.
Long-Context Preservation: RLSTA maintains or improves the model's ability to process long contexts (measured by Coverage Score on summarization tasks), proving that breaking inertia does not degrade long-context capabilities.

5. Significance and Impact

Paradigm Shift: The paper moves beyond treating multi-turn failures as a lack of information or a need for "clarification." Instead, it frames the issue as a reasoning alignment problem where the model must learn to override its own historical biases.
Practical Applicability: By not requiring external verifiers or complex turn-level reward engineering, RLSTA is highly scalable for general-domain applications.
Robustness: The method works even with "Thinking" models (reasoning-heavy LLMs) and across heterogeneous data sources, suggesting a robust solution for building reliable, adaptive AI agents capable of handling real-world, dynamic user interactions.

In summary, RLSTA offers a principled solution to the "Lost in Conversation" problem by using the model's own "ideal" single-turn reasoning as a compass to navigate the complexities of multi-turn dialogue, effectively breaking the inertia that causes models to repeat past mistakes.