On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

The Big Picture: The "Smart" Detective Who Stops Asking Questions

Imagine you hire a brilliant detective (an AI agent) to solve a mystery. The detective has a superpower: they can ask anyone in the city any question they want to find clues.

In the past, we trained these detectives using a simple rule: "If you solve the case, you get a gold star. If you fail, you get nothing." This is called Outcome-Based Reinforcement Learning.

The paper discovers a weird problem with this training method. Sometimes, the detective gets stuck in a loop called "Information Self-Locking."

Here is what happens:

The detective asks a question.
The answer is vague or useless.
The detective fails to solve the case and gets no gold star.
Because they didn't get a gold star, the detective thinks, "Maybe asking questions is a bad idea. I'll just guess based on what I already know."
They stop asking questions. They stop gathering new info. They get "locked" in a low-information state where they can never learn to be better.

The paper argues that the detective isn't just "bad at guessing"; they have forgotten how to ask good questions and how to remember the answers they did get.

The Two Superpowers: "The Questioner" and "The Note-Taker"

To understand why this happens, the authors break the detective's brain into two parts:

Action Selection (AS) - The Questioner: This is the part that decides what to ask. "Should I ask about the time of the crime, or the suspect's alibi?"
Belief Tracking (BT) - The Note-Taker: This is the part that takes the answer and updates the detective's internal map of the truth. "Oh, the suspect was at the park? Okay, I need to cross the bank robbery off my list."

The Trap:
The paper found that these two parts get stuck in a toxic relationship:

If the Note-Taker is bad at updating its notes, the Questioner thinks, "Why bother asking? The answers don't seem to change anything!" So, the Questioner stops asking.
If the Questioner stops asking good questions, the Note-Taker has nothing new to learn from, so it gets rusty and stops getting better.

They lock each other in a cage. The AI stops exploring because it thinks exploration is useless, and it thinks exploration is useless because it stopped exploring.

The Solution: AREW (The "Directional Critique" Coach)

The authors propose a new training method called AREW. Instead of just waiting until the end of the case to give a gold star (or no star), they give the detective instant, tiny hints during the process.

Think of it like a coach standing next to the detective during the interrogation:

For the Questioner (AS): If the detective asks a question that gets a juicy, new clue, the coach whispers, "Good job! That was a great question!" If the detective asks a question that gets a "I don't know" or a repeat, the coach says, "That was a waste of time. Try something else."
For the Note-Taker (BT): If the detective hears a clue and successfully updates their map, the coach says, "Great update!" If they ignore the clue, the coach says, "You missed that! Update your map!"

Why this works:
In the old method, the detective only knew they failed at the very end. By the time they got the "fail" signal, they had forgotten which specific question or note-taking step caused the problem.

With AREW, the coach gives immediate feedback. Even if the detective doesn't solve the case yet, they learn: "Asking about the alibi was good, even if I didn't solve it yet." This breaks the "Self-Locking" cycle. It forces the AI to keep asking questions and keep updating its notes, even when the final result isn't perfect yet.

The Results: Breaking the Lock

The researchers tested this on 7 different tasks, from figuring out what movies a user likes to diagnosing medical symptoms.

Before (Old Way): The AI got stuck. It stopped asking questions, stopped learning, and its performance plateaued.
After (AREW): The AI started asking better questions and remembering answers better.
The Score: In some cases, the new method improved the AI's performance by 60%.

The Takeaway

The paper teaches us that for AI agents to be truly "active" (like a detective or a doctor), we can't just reward them for the final answer. We have to reward them for the process of learning.

If you want an AI to be smart, you have to teach it not just what the answer is, but how to ask the right questions and how to listen to the answers. Otherwise, it will get "self-locked" in a room where it refuses to open the door.

1. Problem Statement: Information Self-Locking (SeL)

The paper addresses a critical failure mode in training Large Language Model (LLM) agents for active reasoning tasks using Reinforcement Learning (RL) with outcome-based rewards. In active reasoning, agents must strategically ask questions over multiple turns to acquire missing information before solving a task.

The authors identify a phenomenon called Information Self-Locking (SeL), where agents trained with standard outcome-based RL (e.g., PPO) become trapped in a low-information regime. Key characteristics of SeL include:

Cessation of Inquiry: Agents stop asking informative questions, often resorting to repetitive or uninformative queries.
Failure to Internalize: Agents struggle to update their internal beliefs based on the information they do receive.
The Vicious Cycle: The paper posits that SeL arises from a bidirectional coupling between two core capabilities:
1. Action Selection (AS): The ability to choose queries that elicit informative feedback.
2. Belief Tracking (BT): The ability to update the internal belief state based on received evidence.
The Mechanism: Weak BT masks the learning signal for informative AS actions (making them appear unrewarding). Conversely, weak AS provides insufficient information for BT to improve. This creates a negative confounding effect where neither capability can improve the other, locking the agent in a suboptimal state.

2. Methodology: AREW (Active Reasoning with Directional Critiques)

To break the SeL loop, the authors propose AREW, a lightweight framework that injects directional critiques into the policy gradient optimization process. Instead of relying solely on the final outcome reward, AREW utilizes easy-to-obtain diagnostic signals at each step to reallocate learning credits.

Core Components:

Decomposition of Agent Behavior:
The agent's interaction is modeled as alternating rounds of Action Selection (AS) and Belief Tracking (BT).
Stepwise Directional Critiques:
The framework assigns a binary directional critique ( $z_t \in \{-1, 0, +1\}$ $z_{t} \in {- 1, 0, + 1}$ ) to each step:
- AS Critique ( $z^Q_t$ ): Determines if a query was informative (e.g., did the user reveal new evidence?).
- BT Critique ( $z^U_t$ ): Determines if the agent successfully updated its internal belief state based on the feedback (e.g., did confidence in the correct hypothesis increase?).
Margin-Aware Auxiliary Objective:
AREW introduces an auxiliary loss function that encourages the agent to increase the probability of positively critiqued steps and decrease the probability of negatively critiqued steps within the same trajectory.
$\hat{L}(\omega; \tau) = \frac{1}{|P_\tau|} \sum_{t \in P_\tau} \log \pi_{\omega,t} - \frac{1}{|N_\tau|} \sum_{t \in N_\tau} \log \pi_{\omega,t}$
Where $P_\tau$ and $N_\tau$ are sets of positively and negatively critiqued steps.
Advantage Reweighting:
Crucially, this auxiliary objective is integrated into standard RL (like PPO, GRPO, GSPO) via advantage reweighting. The standard advantage $A_t$ is modified as:
$\hat{A}_t \leftarrow A_t + \lambda u_t$
Where $u_t$ is a coefficient derived from the critique direction ( $+1/|P_\tau|$ or $-1/|N_\tau|$ ). This allows the method to "steer" the policy gradient toward informative actions and correct belief updates without changing the underlying reward function or requiring complex reward shaping.

3. Theoretical Analysis

The authors provide a theoretical framework characterizing the dynamics of AS and BT under outcome-based RL.

Self-Locking Regime: They define a region in the parameter space where both AS informativeness ( $I_{th}$ ) and BT capability ( $C_{BT}$ ) are low.
Theorem 3.4: They prove that within this regime, the learning signals from outcome rewards are scaled linearly with the current low levels of $I_{th}$ and $C_{BT}$ . Consequently, the agent cannot escape this regime through standard gradient updates alone; it requires an external signal (the directional critique) to break the coupling.
Convergence: They show that AREW effectively breaks this lock if the weighted accuracy of the critiques exceeds 50%, ensuring the agent receives a non-degenerate learning signal even when the outcome reward is sparse or misleading.

4. Experimental Results

The authors evaluated AREW across 7 datasets spanning three domains: Preference Estimation (PE-G, PE-F), Medical Diagnosis (MediQ), and Troubleshooting (FloDial). They tested on various models (Qwen-2.5-7B, LLaMA-3.1-8B) and RL algorithms (PPO, GRPO, GSPO).

Key Findings:

Performance Gains: AREW significantly outperforms vanilla RL baselines. In the best cases (e.g., PE-F with LLaMA), it achieved up to 60% improvement in final task success rates.
Mitigation of SeL:
- Reward Dynamics: While vanilla RL often plateaus or degrades in reward, AREW shows continuous improvement and faster convergence.
- AS and BT Proxies: AREW consistently improves both Action Selection (asking better questions) and Belief Tracking (updating beliefs correctly). Notably, improving AS alone often leads to secondary improvements in BT, confirming the coupling hypothesis.
Robustness: The method is robust to noise in the critique signals. Even with up to 50% of critiques randomly flipped, AREW still outperforms the vanilla baseline.
Generalizability: The approach works across different RL algorithms (PPO, GRPO, GSPO) and model families, suggesting the issue is structural to outcome-based RL in active reasoning, not specific to a particular optimizer.

5. Significance and Contributions

Identification of a Structural Failure: The paper formally identifies and proves the existence of "Information Self-Locking," a structural credit assignment failure in multi-turn active reasoning that standard RL cannot solve.
Decoupling AS and BT: By decomposing agent behavior into Action Selection and Belief Tracking, the authors provide a granular view of why agents fail, moving beyond simple success/failure metrics.
Simple, Effective Intervention: AREW offers a computationally cheap, plug-and-play solution. It does not require training separate reward models or complex intermediate reward shaping; it simply reweights existing gradients based on lightweight, rule-based diagnostics.
Implications for Agent Design: The findings suggest that for long-horizon interactive agents, relying solely on outcome rewards is insufficient. Explicit mechanisms to guide information-seeking (AS) and belief-updating (BT) are necessary to prevent agents from collapsing into non-interactive heuristics.

In summary, this paper provides a rigorous theoretical and empirical explanation for why LLM agents fail to learn active reasoning strategies under standard RL and offers a robust, generalizable method (AREW) to unlock their potential by realigning learning signals with the informational value of agent actions.

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

The Big Picture: The "Smart" Detective Who Stops Asking Questions

The Two Superpowers: "The Questioner" and "The Note-Taker"

The Solution: AREW (The "Directional Critique" Coach)

The Results: Breaking the Lock

The Takeaway

1. Problem Statement: Information Self-Locking (SeL)

2. Methodology: AREW (Active Reasoning with Directional Critiques)

Core Components:

3. Theoretical Analysis

4. Experimental Results

5. Significance and Contributions

More like this

Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Intelligence Inertia: Physical Principles and Applications

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates