Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

This paper introduces RLSTA, a reinforcement learning framework that leverages single-turn capabilities as stable anchors to overcome "Contextual Inertia," thereby enabling large language models to effectively integrate new information and maintain robust reasoning in multi-turn interactions.

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou

Published 2026-03-06
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction" using simple language and creative analogies.

The Problem: The "Stubborn GPS"

Imagine you are driving with a GPS that is incredibly smart but also stubborn.

  1. Turn 1: You tell the GPS, "I need to get to the beach." It immediately says, "Okay, I'll route you through the highway. It will take 2 hours and cost $20 in gas."
  2. Turn 2: You realize you only have $5 in your pocket. You say, "Wait, I only have $5. Find a cheaper way."
  3. The Glitch: Instead of recalculating a cheap route, the GPS gets confused. It thinks, "But I already decided on the highway! I can't change my mind now." So, it suggests a crazy solution: "Okay, let's drive to the highway, but maybe you can find 3 friends to split the gas money with you so it fits your budget."

This is Contextual Inertia. The AI is so attached to its first idea (the "reasoning trace") that even when you give it new, critical information (the budget), it ignores the new facts and tries to force the old plan to work. It gets "Lost in Conversation."

The Diagnosis: Why does this happen?

The researchers found that this isn't just a random mistake. It's a specific behavior where the AI blindly follows its own previous steps, even if those steps were wrong.

  • The Evidence: They tested many models (like Llama, Qwen, and GPT-4) and found that in 70% to 90% of multi-turn failures, the AI wasn't failing because it was bad at math or logic in the final step. It was failing because it was carrying baggage from a wrong answer it gave in the first step.
  • The Indiscriminate Nature: The AI doesn't care if the previous conversation was helpful or harmful. It just keeps the momentum going, like a train that can't hit the brakes.

The Solution: RLSTA (The "Internal Compass")

The paper proposes a new training method called Reinforcement Learning with Single-Turn Anchors (RLSTA).

Here is how it works, using a Chef Analogy:

  1. The Scenario: A chef (the AI) is cooking a complex dish.

    • Single-Turn: If you give the chef the full recipe at once, they make a perfect dish.
    • Multi-Turn: If you give the ingredients one by one, the chef starts cooking immediately. Then you say, "Oh, I forgot to tell you, I'm allergic to nuts!" The chef, stuck in their "inertia," keeps adding nuts because they already started that step.
  2. The Fix (The Anchor):
    Instead of just telling the chef "Don't do that," the researchers teach the chef to check their own internal "Perfect Recipe" memory before finalizing the dish.

    • The AI is trained to remember: "If I had all the information at the start, I would have known the answer is X."
    • When the conversation gets messy, the AI uses that "Perfect Single-Turn Answer" as an Anchor (a stable reference point).
    • It asks itself: "Does my current answer match what I know I should be able to do if I had all the facts?" If not, it breaks the inertia and corrects itself.

How They Taught It (The Training Camp)

They didn't just tell the AI to "be better." They used a reinforcement learning game:

  • The Filter: They only trained the AI on problems where the AI could solve it perfectly if given all the info at once, but failed when the info was split up. This ensures the AI actually knows the answer; it just needs to stop being stubborn.
  • The Reward:
    • If the AI corrects its mistake and matches its "Single-Turn Anchor," it gets a high score.
    • If it stubbornly keeps the wrong path, it gets a low score.
    • Crucially, they did this without needing an external teacher to check every answer. The AI used its own "Single-Turn Self" as the teacher.

The Results: Why It Matters

The paper shows that this method is a game-changer:

  1. It Works Everywhere: They trained the AI on math problems, but it got better at coding and summarizing text too. It learned a general skill: "Don't be stubborn when the story changes."
  2. It's Faster and Smarter: Unlike other methods that make the AI "give up" (abstain) when it's confused, RLSTA teaches the AI to fix its mistakes and keep going.
  3. It Keeps the Good Stuff: The AI didn't lose its ability to handle long conversations or solve single-turn problems. It just became better at updating its mind.

The Bottom Line

Current AI is like a brilliant student who is afraid to change their answer once they've written it down. RLSTA is the training that teaches the student: "It's okay to change your mind. In fact, if you have new information, you must update your answer to match what you know is true."

By anchoring the AI to its own best capabilities, they broke the "inertia" and made multi-turn conversations actually reliable.