Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Imagine you are a brilliant detective trying to solve a massive, 100-step mystery. You have a notebook (your brain's working memory), but it's only big enough to hold about 50 pages of notes.

As you investigate, you interview witnesses, find clues, and analyze evidence. Every time you do something, you write it down. But here's the problem: your notebook is filling up fast.

The Problem: The "Full Notebook" Dilemma

In the world of AI, Large Language Models (LLMs) are like these detectives. They are great at solving problems, but they have a strict limit on how much text they can "read" at once (their context window).

If a task takes 100 steps, the AI tries to keep everything in its notebook:

"I looked at the kitchen."
"I found a red key."
"I tried the door, but it was locked."
"I called the landlord..."

By step 50, the notebook is overflowing. The AI has to either:

Tear out pages: It deletes old notes to make room for new ones. But then, it forgets that the red key was in the kitchen!
Summarize everything: It writes, "I did a lot of stuff in the kitchen." But now it doesn't remember exactly what the key looked like or which door it was for.

This is the bottleneck. The AI gets lost in its own history because it can't hold the whole story in its head.

The Solution: Memex (The "Index Card" System)

The paper introduces Memex, a new way for AI to manage its memory. Instead of trying to cram everything into the notebook, Memex changes the game entirely.

Think of Memex as a Librarian with a magical filing cabinet.

Here is how it works:

The Compact Notebook (The Summary): The AI keeps a tiny, super-organized index in its notebook. Instead of writing out the whole story, it writes: "See File #42: The Red Key."
The Filing Cabinet (The External Store): The actual details—the exact words of the witness, the photo of the key, the code snippet—are saved in a massive, infinite external database.
The Magic Index: The AI doesn't need to remember the content of the file, just the label (the index).

How It Works in Real Life

Imagine you are cooking a complex recipe that takes 3 hours.

Old Way: You try to keep the entire recipe, the grocery list, and every step you've taken so far in your head. Eventually, you forget if you added salt or sugar because your brain is full.
Memex Way: You keep a small sticky note on the counter that says: "Step 1: Sauté onions (See Recipe Page 12)."
- When you need to know how to sauté onions, you don't try to remember it. You look at the note, go to Recipe Page 12 in your cookbook, read it, and then put the book back.
- Your "working memory" (the sticky note) stays small and clean.
- Your "long-term memory" (the cookbook) holds all the details perfectly.

The Secret Sauce: MemexRL (The Coach)

The paper doesn't just give the AI a filing cabinet; it teaches the AI how to use it using a method called MemexRL (Reinforcement Learning).

Think of MemexRL as a strict coach training the detective:

The Reward: "Good job! You solved the mystery!"
The Penalty: "You wasted time looking for the red key again because you didn't write down where you put it!" or "You filled your notebook with too much junk!"

Through trial and error, the AI learns:

What to summarize: "I don't need to write down the whole conversation with the landlord, just the phone number."
What to archive: "I need to save the exact code error message because I'll need to fix it later."
When to look it up: "I'm stuck on Step 50. I should check File #42 to see what the red key looked like."

Why This is a Big Deal

Before this, AI agents were like people trying to carry a library in their pockets. If the library got too big, they had to throw books away.

Memex is like giving them a library card. They can carry a tiny card with them, and whenever they need a book, they can instantly pull the exact page they need from the library, read it, and put it back.

The Result:

The AI can solve much longer, more complex problems (100+ steps) without getting confused.
It uses less computer memory (because it doesn't carry the whole library).
It makes fewer mistakes because it can retrieve the exact evidence it needs, rather than guessing based on a fuzzy summary.

In short, Memex teaches AI to be organized, efficient, and able to remember the details of a long story without forgetting the plot.

Here is a detailed technical summary of the paper "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory."

1. Problem Statement

Large Language Model (LLM) agents face a fundamental bottleneck when executing long-horizon tasks (workflows spanning dozens to hundreds of steps).

Context Window Limitations: As agents accumulate observations, tool outputs, and reasoning traces, the working context grows indefinitely. Eventually, it exceeds the model's finite context budget, leading to truncation or infeasible prompt lengths.
Lossy Compression: Existing solutions typically rely on truncation or running summaries. These methods are "lossy," meaning they compress or discard raw evidence (e.g., specific tool outputs, code snippets, logs). Once discarded, this precise evidence cannot be recovered, causing the agent to fail when it needs to revisit specific details from the distant past.
Brittle Retrieval: Alternative approaches using external semantic retrieval often fail in long-horizon tool use because they rely on fuzzy similarity matching. This leads to ambiguity, redundant re-parsing of history, and a lack of precise, stable references to specific artifacts.

2. Methodology: Memex and MemexRL

The paper proposes Memex, a novel agent architecture that decouples the working context from the full-fidelity experience archive, managed via an Indexed Experience Memory mechanism.

A. Core Architecture: Indexed Experience Memory

Instead of keeping all history in the context window, Memex maintains two distinct components:

Compact In-Context State (Indexed Summary): A short, structured summary containing actionable progress (e.g., current plan, verified facts) and a set of stable indices (pointers).
External Experience Store ( $D$ ): A key-value database storing full-fidelity artifacts (raw tool outputs, logs, code, exact reasoning traces) mapped to the stable indices.

Key Operations:

CompressExperience: The agent rewrites the growing working context into a compact IndexedSummary. It archives the discarded raw content into the external store under specific indices. The summary includes a map linking index IDs to descriptions (e.g., Index A: "Repo snapshot").
ReadExperience(index): When a specific piece of past evidence is needed, the agent explicitly dereferences an index to retrieve the exact archived content from $D$ and re-injects it into the working context.
Dual-Mode Archiving: The system supports both explicit authoring (model paraphrasing notes) and anchor-based extraction (model specifying start/mid/end anchors to archive exact spans of text verbatim), ensuring critical data like IDs or code is preserved without token bloat.

B. Learning Framework: MemexRL

Since determining what to compress, how to index, and when to retrieve is complex and depends on future needs, the authors introduce MemexRL, a Reinforcement Learning framework.

Action Space: Memory operations (CompressExperience, ReadExperience) are treated as first-class actions alongside environment tools.
Reward Shaping: The reward function ( $R$ $R$ ) combines task success with three penalties:
1. Context Overflow Penalty: Penalizes exceeding the token threshold, encouraging proactive compression.
2. Redundant Tool Call Penalty: Penalizes repeating identical tool calls, incentivizing the agent to use ReadExperience to recall past info rather than re-executing tools.
3. Format Error Penalty: Penalizes malformed tool calls.
Segmented Trajectory Processing: To handle long episodes with multiple compressions, trajectories are segmented at compression boundaries. Each segment is trained independently but shares the terminal reward of the full episode. This allows the model to receive credit (or blame) for early compression decisions that only impact the outcome many steps later.
Soft Triggering: Instead of hard system-enforced truncation, the agent receives a "Context Status" indicator (e.g., "working tokens approaching limit"). The agent learns to decide when to compress based on task semantics rather than arbitrary token counts.

3. Theoretical Analysis

The authors provide a theoretical proof that the Memex loop can achieve two desirable properties simultaneously:

Decision Quality Preservation: They define a "decision-sufficient indexed summary." If the summary points to a bounded number ( $B$ ) of relevant archived blocks, the agent can recover the optimal policy ( $\pi^*$ ) conditioned on the full history, without needing the full history in-context.
Bounded Computation: As the full message history grows infinitely, the working context remains bounded ( $C_{work} \le \tau_\sigma + B \cdot L$ ), provided the summary size and the number of dereferenced blocks per step are bounded. This ensures the agent's computational cost does not scale with the total task duration.

4. Experimental Results

The authors evaluated MemexRL on a modified, harder version of the ALFWorld benchmark (hiding valid actions and initial location IDs, forcing exploration and memory reliance).

Model: Qwen3-30B-A3B-Thinking (MoE).
Task Success Rate:
- Without RL: ~24.2%
- With MemexRL: 85.6% (a >3.5x improvement).
Context Efficiency:
- Peak Working Context: Reduced from 16,934 tokens to **9,634 tokens** (approaching the 8,000 token penalty threshold).
- Compression Ratio: The agent successfully maintained a small active context while solving complex tasks.
Behavioral Shift:
- Compression Frequency: Decreased (from ~6.5 to ~3 calls per episode), indicating the agent learned to compress selectively rather than aggressively.
- Retrieval Frequency: Increased significantly (from ~1 to ~6-7 calls), showing the agent learned to rely on the external store for precise evidence rather than re-running tools or keeping everything in context.

5. Key Contributions

Indexed Experience Memory Interface: A novel memory mechanism that pairs a compact in-context summary with a full-fidelity external archive, enabling precise, explicit dereferencing of past evidence.
MemexRL Framework: A reinforcement learning approach that jointly optimizes memory writing (summarization, indexing) and reading (retrieval) behaviors using reward shaping and segmented trajectory processing.
Theoretical Guarantee: A formal analysis demonstrating that bounded dereferencing from an indexed summary is sufficient to preserve optimal decision quality while keeping in-context computation bounded.
Empirical Validation: Demonstration that learned indexed memory significantly outperforms baseline approaches in long-horizon tasks, achieving high success rates with drastically reduced working context sizes.

6. Significance

Memex addresses a critical scaling bottleneck for LLM agents. By moving away from lossy compression and fuzzy semantic search, it offers a deterministic, auditable, and precise way to manage long-term memory. The work suggests that learning how to summarize, index, and retrieve is a crucial, complementary scaling axis for building persistent, reliable, and efficient LLM agents capable of handling complex, multi-step workflows without being limited by context window sizes.