Internalizing Agency from Reflective Experience

The Big Problem: The "Blind Retry" Trap

Imagine you are teaching a robot to solve a complex maze.

The Old Way (Outcome-Driven RL): You let the robot run the maze 100 times. It fails 99 times and succeeds only once. You tell the robot, "Great job on that one success! Do that again."
- The Flaw: The robot gets really good at repeating that one specific lucky path. But if the maze changes slightly, or if it makes a tiny mistake early on, it gets stuck. It doesn't know how to fix a mistake; it just knows how to hope for a lucky run. This is called "Distribution Sharpening." It becomes a master of one narrow path but fails to learn how to navigate the whole world.
The New Way (LEAFE): Instead of just waiting for a win, the robot is taught to stop, think, and rewind when it hits a wall.
- It realizes, "Oh, I turned left here, but the wall is there. That was a bad move."
- It rewinds the tape to the moment before the mistake.
- It tries a different path (turns right instead).
- It learns from that specific correction.

The Solution: LEAFE (Learning Feedback-Grounded Agency)

The authors propose a two-step training process called LEAFE. Think of it as turning a "lucky guesser" into a "strategic problem solver."

Step 1: The "Time-Travel" Practice (Exploration)

Imagine the robot is playing a video game.

The Mistake: The robot walks into a pit.
The Reflection: Instead of just restarting the whole game, the robot pauses. It says, "Wait, I shouldn't have jumped there. The ground looked shaky."
The Rollback: The game rewinds to the exact moment before the jump.
The Branch: The robot tries a different action (e.g., "I'll walk around the pit instead").
The Result: It creates a "tree" of possibilities. It explores many different ways to fix mistakes, not just one lucky path.

This step generates a massive library of "What went wrong, and how I fixed it" stories.

Step 2: The "Internalization" (Distillation)

Now, the robot has all these "fix-it" stories, but it can't carry a giant notebook of instructions into the real game. It needs to learn the skill of fixing things.

The researchers take those "fix-it" stories and train the robot's brain (the AI model) to memorize the feeling of correcting a mistake.
They teach the robot: "When you see a situation like this, the right move is that," without needing the notebook open.
The Goal: The robot learns to self-correct instantly. It doesn't need to stop and think "I should rewind" every time; the ability to recover is now built into its DNA.

Why This Matters: The "Pass@K" Analogy

The paper measures success using a metric called Pass@K.

Pass@1: How often does the robot solve the problem on the first try?
Pass@128: If you let the robot try 128 times (or use a lot of computing power), how often does it eventually solve it?

The Discovery:

Old Methods (like GRPO): They are great at improving Pass@1. They make the robot very confident and good at the "standard" way of doing things. But if the robot hits a snag, it gets stuck. Its Pass@128 doesn't improve much because it hasn't learned how to explore new solutions.
LEAFE: It might not always be the absolute fastest on the first try, but it is much better at recovering. When you give it 128 tries, it solves way more problems because it knows how to pivot when things go wrong.

The Real-World Impact

Think of a Junior Programmer vs. a Senior Engineer:

The Junior (Old Method): Writes code that works perfectly if everything goes right. But if a bug appears, they panic and restart the whole project.
The Senior (LEAFE): Writes code, sees a bug, immediately thinks, "Ah, that function is wrong," fixes just that part, and keeps going. They have internalized the agency to handle failure.

Summary in One Sentence

LEAFE teaches AI agents not just to hope for a lucky win, but to learn how to spot their own mistakes, rewind, and try a better path, making them much smarter and more reliable in complex, real-world situations.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of long-horizon interaction with environments (e.g., coding, web navigation, robotics). However, current post-training methods, particularly Outcome-Driven Reinforcement Learning with Verifiable Rewards (RLVR) like GRPO, suffer from a critical limitation: Distribution Sharpening.

The Limitation of RLVR: These methods optimize for a single terminal scalar reward (success/fail). In long-horizon tasks, this provides sparse feedback. The model learns to upweight a narrow set of already-successful trajectories found in its initial distribution.
The Consequence: While this improves Pass@1 (single-shot success), it fails to expand the model's Pass@k (capability to find a solution among $k$ samples). The model becomes better at reproducing existing successes but fails to learn the agency required to detect errors, reflect on them, and recover from mistakes during interaction.
The Gap: There is a missing link between "distribution sharpening" (exploiting known solutions) and "agency internalization" (learning to recover from failure using structured feedback). Current methods rely on expensive test-time computation (e.g., massive sampling, tree search) to escape errors rather than having the model intrinsically know how to fix them.

2. Methodology: LEAFE Framework

The authors propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a two-stage framework designed to internalize the ability to recover from mistakes using environment feedback.

Stage 1: Tree-Based Experience Generation with Rollback

Instead of treating a failed trajectory as a single negative sample, LEAFE treats it as a learning opportunity through reflective backtracking.

Periodic Reflection: During exploration, the agent periodically reviews its interaction history ( $h_t$ ).
Rollback Identification: The model identifies a suboptimal decision point $\tau$ (a "rollback" point) where the trajectory deviated from the desired path.
Experience Synthesis: The model generates a natural language experience summary ( $e$ ), acting as a diagnostic instruction (e.g., "The previous action was invalid because...; try X instead").
Branching: The system resets the environment to state $\tau$ , replays the history up to that point, and then generates a revised action ( $a'_\tau$ ) conditioned on the history and the synthesized experience ( $e$ ).
Tree Construction: This process creates a "rollback tree" of trajectories following the pattern: Failure $\to$ Rollback $\to$ Correction $\to$ Success.

Stage 2: Experience-to-Policy Distillation

The goal is to teach the model to perform these corrections without needing the explicit reflection step or experience summary at test time.

Counterfactual Distillation ( $L_{cf}$ ): The model is trained to predict the corrected action ( $a'_\tau$ ) given only the original history ( $h_\tau$ ) and instruction, effectively learning to internalize the "fix" into its weights.
Behavior Rehearsal ( $L_{reh}$ ): To prevent catastrophic forgetting, the model is also trained on successful trajectories (rejection sampling style) to maintain baseline competence.
Final Objective: The model is fine-tuned using a joint loss: $L(\theta') = L_{cf}(\theta') + \beta L_{reh}(\theta')$ .

3. Key Contributions

Structured Exploration via Feedback-to-Experience: The paper introduces a mechanism to convert scalar failure signals into actionable, experience-guided branches (rollback + correction), enabling targeted exploration beyond the base policy's dominant modes.
Richer Supervision than Scalar Rewards: Unlike RLVR which treats a rollout as a single sample scored by a terminal reward, LEAFE provides decision-level supervision (reflect $\to$ revise), explicitly specifying where an error occurred and how to fix it.
Internalized Recovery: By distilling these corrective decisions, LEAFE expands the model's behavioral coverage. It shifts the burden of competence from heavy test-time sampling to an internalized, experience-driven agency, significantly improving Pass@k metrics.

4. Experimental Results

The framework was evaluated on diverse benchmarks requiring long-horizon interaction and error recovery: WebShop, ALFWorld, ScienceWorld, Sokoban, and CodeContests, using Qwen2.5 and Llama-3/3.1 series models.

Pass@1 vs. Pass@128:
- GRPO (Baseline): Often improves Pass@1 but plateaus or yields diminishing returns as $k$ increases (distribution sharpening).
- LEAFE: Consistently outperforms baselines in Pass@128. For example, on CodeContests with Qwen2.5-72B, LEAFE achieved a Pass@128 of 47.88% compared to GRPO's 36.97% (a ~11% absolute gain) and the base model's 33.94%.
- CodeContests Specifics: LEAFE showed gains of up to 14% over the base model on Pass@128, demonstrating its ability to handle iterative correction in coding tasks.
Sample Efficiency: LEAFE reaches higher success rates with fewer samples compared to baselines, indicating a higher "capability ceiling."
Out-of-Distribution (OOD) Generalization: On the MBPP benchmark (trained on CodeContests), LEAFE maintained or improved performance (+1.8% on Llama3-70B), whereas GRPO suffered significant degradation (-4.2%), suggesting LEAFE learns fundamental reasoning rather than dataset shortcuts.
Ablation Studies:
- Removing the counterfactual distillation ( $L_{cf}$ ) significantly reduced Pass@128, confirming that internalizing the correction is crucial.
- Combining LEAFE with other methods (like EarlyExp) improved Pass@1 but sometimes constrained Pass@128, highlighting the trade-off between exploitation and exploration.

5. Significance

Paradigm Shift: LEAFE moves beyond the "outcome-only" optimization paradigm. It demonstrates that training agents to reflect and recover from specific failure points is more effective for long-horizon tasks than simply rewarding final success.
Deployment Efficiency: By internalizing recovery capabilities, LEAFE reduces the reliance on expensive test-time strategies like massive sampling, tree search, or external branching, making autonomous agents more practical for real-world deployment.
Scalability: The method shows consistent scaling benefits across different model sizes (7B to 72B) and architectures (Qwen, Llama), proving that "agency" can be effectively distilled into model weights.

In conclusion, LEAFE provides a robust framework for transforming LLMs from passive responders into resilient agents capable of self-correction, significantly expanding their effective problem-solving capacity in complex, interactive environments.