Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Imagine you are teaching a very smart but inexperienced robot to solve complex puzzles, like fixing a broken circuit or buying the perfect shirt online. This robot is powered by a Large Language Model (LLM)—basically, a super-smart brain that has read almost everything on the internet.

The problem? This robot is too confident in what it already knows. If it gets stuck, it keeps trying the same wrong thing over and over because it's relying on its "pre-trained" knowledge rather than actually exploring the room. It's like a tourist who refuses to ask for directions because they are sure they know the way, even though they are lost.

The paper introduces a new method called EMPO2 (Exploratory Memory-Augmented On- and Off-Policy Optimization) to fix this. Think of EMPO2 as a super-coach that teaches the robot how to learn from its mistakes without needing a human to step in every time.

Here is how EMPO2 works, broken down with simple analogies:

1. The "Notebook" Strategy (Memory)

In the past, when the robot failed, it would just try again, forgetting why it failed. EMPO2 gives the robot a digital notebook (Memory).

How it works: Every time the robot tries something and fails, it writes a "tip" in its notebook. For example: "I tried to turn on the red light bulb in the hallway, but it wasn't there. I need to look in the workshop."
The Magic: Before the robot tries the next time, it opens its notebook, reads the tip, and avoids the same mistake. This helps it explore new areas instead of spinning its wheels.

2. The "Two-Mode" Training (Hybrid Optimization)

This is the secret sauce of EMPO2. The robot learns in two different ways simultaneously, like a student who both practices with a tutor and takes a solo test.

Mode A: The "Tutor" Session (On-Policy with Memory)
The robot tries a task while reading its notebook. It learns how to use the tips to solve the problem. This is great for immediate learning, but the robot might get too dependent on the notebook.
Mode B: The "Solo" Session (Off-Policy without Memory)
The robot tries the same task again, but this time, the notebook is closed. It has to solve the problem using only its own brain.
- The Trick: The coach (the algorithm) compares the robot's "Tutor" performance with its "Solo" performance. If the robot did well with the help of the tip, the coach says, "Okay, you learned something! Now, I'm going to update your brain's wiring so you can do that without the notebook next time."

This is called Knowledge Distillation. It's like a student who studies with a cheat sheet, then takes a practice test without it. The goal is to eventually internalize the cheat sheet so the student doesn't need it anymore.

3. The "Curiosity Bonus" (Intrinsic Rewards)

Sometimes, the environment doesn't give the robot a "Good job!" reward (like finding the light bulb). To stop the robot from giving up, EMPO2 gives it a Curiosity Bonus.

The Analogy: Imagine a video game that gives you points just for visiting a room you've never been to before. Even if there's no treasure in the room, the robot gets a little "curiosity reward" for being new to that spot. This forces the robot to wander into unknown areas instead of sticking to the safe, familiar path.

Why is this a big deal?

Previous methods were like a student who memorized the answers to a specific test but failed the moment the questions changed.

Old Way: If the robot learned to fix a red light bulb, it couldn't figure out how to fix a green one.
EMPO2 Way: Because it learned how to explore and how to learn from mistakes, it can adapt to new tasks (like fixing a green bulb or buying a shirt) very quickly, often with just a few tries and no need to re-train its entire brain.

The Result

In tests, EMPO2 was a massive success:

It solved ScienceWorld puzzles (like elementary science experiments) 128% better than the previous best method.
It improved WebShop (online shopping) performance by 11%.
Most importantly, when thrown into a completely new, unfamiliar environment, it adapted almost instantly by using its "notebook" to guide its exploration, proving it's becoming a truly general, adaptable agent.

In short: EMPO2 teaches AI not just what to do, but how to figure things out when it doesn't know the answer, turning a rigid, confident robot into a curious, adaptable learner.

1. Problem Statement

Large Language Model (LLM) agents trained with Reinforcement Learning (RL) often struggle with exploration. While they excel at exploiting pre-trained knowledge, they frequently fail in environments requiring the discovery of novel states or the acquisition of new information through trial-and-error.

The Bottleneck: Standard online RL methods (like GRPO) rely heavily on scalar reward signals. When an agent fails, it often repeats the same erroneous behavior because prior trajectories provide no continuity or insight beyond the reward score.
Limitations of Existing Solutions:
- Non-parametric memory methods (e.g., Reflexion): These use external memory to store reflections but do not update the model's internal parameters. Consequently, they cannot generalize well to new tasks without the memory and suffer from performance saturation.
- Offline RL: Requires large logged datasets and lacks the adaptability of online learning.
- Pure Online RL: Often converges prematurely to suboptimal solutions due to insufficient exploration.

The core challenge is to create a framework that enables agents to explore effectively using memory while simultaneously internalizing these experiences into the model's parameters for long-term generalization.

2. Methodology: EMPO2

The authors propose EMPO2 (Exploratory Memory-Augmented On- and Off-Policy Optimization), a hybrid RL framework that jointly updates the LLM's parametric policy and its non-parametric memory module.

Core Components

Self-Generated Memory (Non-Parametric):
- Instead of external human feedback, the policy $\pi_\theta$ itself generates "tips" (reflective insights) after each episode based on the trajectory and final state.
- These tips are stored in a memory buffer $M$ .
- During rollouts, the agent can retrieve relevant tips to condition its next actions, effectively guiding exploration away from past mistakes.
Hybrid Rollout Modes:
- Mode 1 (No Memory): The agent acts based solely on the current state and task ( $a \sim \pi_\theta(\cdot|s, u)$ ).
- Mode 2 (Memory-Augmented): The agent retrieves tips from memory and acts conditioned on them ( $a \sim \pi_\theta(\cdot|s, u, \text{tips})$ ).
- The agent samples between these modes with probability $p$ for memory usage.
Hybrid Update Modes:
EMPO2 employs a dual-update strategy to balance exploration and parameter internalization:
- On-Policy Update: Uses the memory-augmented prompt for both the old and new policy. This stabilizes learning when memory is present.
- Off-Policy Update (Key Innovation):
  - Rollout: Actions are generated using the memory-augmented prompt (Teacher).
  - Update: The log-probabilities for the current policy are calculated without the tips (Student).
  - Mechanism: This acts as reward-guided knowledge distillation. The model learns to reproduce high-reward trajectories generated with memory, but without relying on memory at inference time. This forces the model to internalize the "wisdom" of the tips into its weights.
Stabilization Techniques:
- Token Masking: To prevent instability caused by low-probability tokens in off-policy updates (which can cause gradient explosion), the advantage term is masked if the token probability falls below a threshold $\delta$ .
- Intrinsic Rewards: An intrinsic reward based on state novelty (cosine similarity to visited states) is added to encourage exploration even when extrinsic rewards are sparse.

3. Key Contributions

Unified Framework: EMPO2 bridges the gap between non-parametric memory (short-term adaptation) and parametric optimization (long-term generalization).
Hybrid Optimization: It introduces a novel off-policy update mechanism where the model learns from memory-guided trajectories but is penalized/encouraged to reproduce them without memory, effectively distilling exploration strategies into the model weights.
Robustness: The framework ensures the agent performs well with memory during training but remains robust and capable in its absence during testing.
Stability: The introduction of token masking and intrinsic rewards addresses common instability issues in off-policy LLM training.

4. Experimental Results

Experiments were conducted on ScienceWorld (interactive science experiments) and WebShop (e-commerce navigation) using the Qwen2.5-7B-Instruct model.

Performance Gains:
- ScienceWorld: EMPO2 achieved a 128.6% improvement over the strong online RL baseline (GRPO). It solved tasks that GRPO failed to converge on (e.g., the "power-component" task).
- WebShop: EMPO2 achieved an 11.3% improvement over GRPO and outperformed other baselines including GiGPO and Retrospex.
Out-of-Distribution (OOD) Adaptability:
- In OOD tests, EMPO2 demonstrated superior adaptability to new tasks. When transferred to a new domain with only a few trials and no parameter updates, it leveraged its memory to adapt rapidly, whereas GRPO struggled significantly.
Ablation Studies:
- Removing either the on-policy or off-policy component resulted in suboptimal performance, confirming the necessity of the hybrid approach.
- Intrinsic rewards were found to be crucial for preventing policy collapse into homogeneous behaviors.
- The memory mechanism added ~19% computational overhead per iteration but significantly improved sample efficiency.

5. Significance and Conclusion

EMPO2 represents a significant step forward in building generalizable and adaptive LLM agents.

Beyond Static Knowledge: It moves beyond relying solely on pre-trained knowledge or static external memory, enabling agents to learn how to explore and internalize those strategies.
Efficiency: By combining on-policy stability with off-policy distillation, it achieves faster convergence and higher final performance than pure online or offline methods.
Future Potential: The framework suggests a path toward agents that can continuously learn from interaction, internalize complex reasoning patterns, and generalize to unseen environments without requiring massive retraining or external human intervention.

The paper concludes that EMPO2 is a promising direction for creating embodied agents capable of solving complex, multi-step tasks in dynamic environments through systematic exploration and self-improvement.

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

1. The "Notebook" Strategy (Memory)

2. The "Two-Mode" Training (Hybrid Optimization)

3. The "Curiosity Bonus" (Intrinsic Rewards)

Why is this a big deal?

The Result

1. Problem Statement

2. Methodology: EMPO2

Core Components

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach