Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

The Big Problem: Everyone is Talking About "Memory," But They Mean Different Things

Imagine a group of architects trying to build houses. They all claim their houses have "storage."

Architect A says, "My house has a backpack." (You can carry a few things with you while walking).
Architect B says, "My house has a filing cabinet." (You can store documents for years).
Architect C says, "My house has a library." (You can remember facts from books you read last year).

If you ask, "Which house has the best memory?" it's impossible to answer because they are talking about completely different things.

This is exactly what is happening in Reinforcement Learning (RL) (the field where AI learns by trial and error). Researchers build AI agents and claim they have "memory." Sometimes they mean the AI can remember the last few seconds of a game. Other times, they mean the AI can remember a lesson learned in a completely different game yesterday.

Because there is no standard definition, people often get tricked. An AI might look like it has a super-memory, but it's actually just cheating by using a "shortcut" in the game rules.

The Solution: A New Dictionary for AI Memory

The authors of this paper decided to fix this confusion. They took concepts from human neuroscience (how our brains work) and created a strict, mathematical dictionary for AI memory.

They split memory into two main categories, just like humans have:

1. Short-Term vs. Long-Term Memory (The "Backpack" vs. The "Filing Cabinet")

Short-Term Memory (STM): This is like a backpack. You can only carry a limited number of items with you right now. If the game gets too long, you drop the oldest items to make room for new ones.
- In AI terms: The AI can only look back at the last $K$ steps. If the important clue happened 100 steps ago, but the backpack only holds 10 steps, the AI forgets it.
Long-Term Memory (LTM): This is like a filing cabinet or a diary. You can store information for a long time and pull it out whenever you need it, even if it was a long time ago.
- In AI terms: The AI has a mechanism (like a special neural network) that lets it remember things from way back in the past, even if they don't fit in its immediate "backpack."

2. Declarative vs. Procedural Memory (The "Fact" vs. The "Skill")

Declarative Memory: This is remembering facts. "The key was under the red mat."
- In AI terms: The agent remembers specific events from this specific game to make a decision right now.
Procedural Memory: This is remembering skills. "How to ride a bike."
- In AI terms: The agent learns a general skill in one game and uses it to solve a different game later. (This is often called "Meta-RL" in the paper).

The "Correlation Horizon": The Ruler for Memory

The paper introduces a clever tool called the Correlation Horizon. Think of this as a ruler that measures the distance between a "clue" and the "action."

The Clue: You see a sign that says "Turn Left."
The Action: You actually turn left.
The Horizon: How many steps passed between seeing the sign and turning?

If the sign was 5 steps ago, and your AI's "backpack" (context) only holds 3 steps, you have a Long-Term Memory problem. If the backpack holds 10 steps, it's just a Short-Term Memory problem.

The authors realized that many researchers were testing AI memory with the wrong ruler. They would test an AI on a game where the clues were always close together. The AI would succeed, and the researchers would say, "Wow, this AI has great memory!" But in reality, the AI was just using its short-term backpack. It never actually needed to open its filing cabinet.

The Experiment: Catching the Cheaters

To prove their point, the authors ran experiments with different types of AI:

Transformers (like the ones in Chatbots): These are great at looking at a long list of recent words (Short-Term Memory).
RNNs (Recurrent Neural Networks): These are designed to keep a running summary of the past (Long-Term Memory).

The Setup:
They put these AIs in a maze (called the "Passive T-Maze").

Scenario A: The clue is 10 steps away. The AI's backpack holds 20 steps.
- Result: Both AIs succeed. They just use their backpacks.
Scenario B: The clue is 500 steps away. The AI's backpack only holds 20 steps.
- Result: The "Backpack" AI (Transformer) fails miserably. It forgot the clue. The "Filing Cabinet" AI (RNN) succeeds because it stored the clue in its long-term memory.

The Discovery:
Many previous studies claimed Transformers had "long-term memory" because they tested them on easy mazes where the clues were close by. The authors showed that if you use their new "ruler" (the Correlation Horizon) to force the clues to be far away, the Transformers fail. They don't actually have long-term memory; they just have a really good short-term memory.

Why Does This Matter?

This paper is like a quality control inspector for AI.

Before, if you bought a "Memory-Enhanced Robot," you might not know if it could actually remember things from last week or if it just had a really good short-term focus. This paper gives us a standardized test to say:

"This robot is great at remembering the last 10 seconds (Short-Term)."
"This robot is great at remembering lessons from last year (Long-Term)."

By defining these terms clearly, researchers can stop building robots that are "fake" experts and start building ones that truly understand how to remember, learn, and adapt to the world around them.

Summary in One Sentence

The paper says we need to stop calling everything "memory" and start measuring exactly how far back an AI can look, using a strict ruler, so we know if it's actually smart or just lucky.

1. Problem Statement

Reinforcement Learning (RL) agents often require memory to handle Partially Observable Markov Decision Processes (POMDPs), adapt to novel environments, and improve sample efficiency. However, the field suffers from a lack of unified definitions and standardized evaluation protocols for "memory."

Ambiguity: The term "memory" is used inconsistently, ranging from simple context windows (e.g., Transformer attention) to cross-episode skill transfer (Meta-RL).
Misleading Evaluations: Without rigorous isolation of memory mechanisms, researchers often conflate short-term context retention with true long-term memory. For instance, an agent might appear to have long-term memory simply because the task allows shortcuts or overlaps with the short-term context window.
Consequence: This leads to erroneous judgments about agent capabilities, prevents fair comparisons between architectures (e.g., Transformers vs. RNNs), and hinders the development of truly memory-capable agents.

2. Methodology

The authors propose a unified framework grounded in cognitive science (neuroscience) but formalized specifically for RL settings. The methodology consists of three main pillars:

A. Formal Definitions of Memory Types

The paper distinguishes memory based on temporal dependencies and the nature of information:

Declarative vs. Procedural Memory:
- Declarative Memory: Knowledge transfer within a single environment and single episode ( $n_{envs} \times n_{eps} = 1$ ). The agent recalls specific facts or events (e.g., "the key was in the red room").
- Procedural Memory: Skill transfer across multiple environments or episodes ( $n_{envs} \times n_{eps} > 1$ ). The agent reuses policies or strategies (e.g., "how to navigate a maze").
- Note: This distinguishes Memory Decision-Making (Memory DM) tasks (Declarative) from Meta-RL tasks (Procedural).
Short-Term (STM) vs. Long-Term Memory (LTM):
- Defined by the relationship between the Agent Context Length ( $K$ ) and the Correlation Horizon ( $\xi$ ).
- Correlation Horizon ( $\xi$ ): The time delay between an event (information acquisition) and the decision point where that information is required.
- STM: The event falls within the agent's context window ( $\xi \le K$ ). The agent relies on the immediate history provided by the architecture.
- LTM: The event falls outside the context window ( $\xi > K$ ). The agent must utilize explicit memory mechanisms (e.g., recurrent hidden states, external memory) to bridge the gap.

B. Memory-Intensive Environments

The authors define a Memory-Intensive Environment ( $\tilde{M}_P$ ) as a POMDP where the minimum correlation horizon $\min(\xi) > 1$ . This ensures the task cannot be solved by a Markovian policy (which only looks at the current state).

C. Experimental Protocol (Algorithm 1)

To rigorously evaluate memory, the paper proposes a controlled experimental setup:

Estimate $\xi$ : Calculate the correlation horizons for the specific task.
Determine the Border ( $\bar{K}$ ): Calculate the context memory border $\bar{K} = \min(\xi) - 1$ .
Configure $K$ :
- To test LTM: Set agent context $K \le \bar{K}$ . This forces the agent to rely on memory mechanisms to recall information beyond its immediate context.
- To test STM: Set agent context $K \ge \max(\xi)$ . This ensures all relevant information is within the context window, isolating the ability to process local correlations.
Evaluate: Compare performance under these specific constraints to determine if the agent possesses true LTM capabilities or merely relies on large context windows.

3. Key Contributions

Unified Taxonomy: A formal classification system separating memory into Declarative/Procedural and STM/LTM, providing clear mathematical definitions based on $n_{envs}$ , $n_{eps}$ , $K$ , and $\xi$ .
Decoupling Framework: A clear distinction between Memory DM (intra-episode recall) and Meta-RL (inter-episode adaptation), clarifying that many previous "memory" evaluations were actually testing Meta-RL.
Evaluation Methodology: A principled algorithm (Algorithm 1) for designing experiments that isolate memory types, preventing the conflation of context size with memory capability.
Theoretical Bounds: Theorem 2 establishes the mathematical conditions for isolating LTM evaluation, proving that if $K < \min(\xi)$ , the task exclusively tests long-term memory mechanisms.

4. Experimental Results

The authors evaluated several state-of-the-art agents (DTQN, DQN-GPT-2, SAC-GPT-2, Decision Transformer, BC-LSTM) on memory-intensive tasks like Passive T-Maze, Minigrid-Memory, and POPGym.

The "Naive" Trap: When evaluated with variable or large context windows, agents often appear to have strong memory. However, when the context $K$ is constrained to be smaller than the correlation horizon $\xi$ (forcing LTM), performance drops significantly for many models.
Transformer Limitations: Models like Decision Transformer (DT) and DTQN, which rely on fixed attention windows, perform well when $\xi \le K$ (STM) but fail catastrophically when $\xi > K$ . They lack true LTM mechanisms.
RNN Strength: BC-LSTM (and similar recurrent models) demonstrated the ability to generalize to sequence lengths far beyond their training range when $\xi > K$ , confirming they possess effective LTM capabilities via hidden state propagation.
Relative Nature of Memory: The results prove that memory is not an intrinsic absolute property of an agent but a relative property dependent on the interplay between the agent's context $K$ and the environment's horizon $\xi$ .

5. Significance and Impact

Standardization: This work provides the first rigorous, reproducible framework for classifying and evaluating memory in RL, moving the field away from vague claims toward precise, testable hypotheses.
Architectural Insights: It clarifies the specific strengths and weaknesses of different architectures (e.g., Transformers are excellent for STM but struggle with LTM without specific modifications, while RNNs excel at LTM).
Future Directions: The framework guides the development of new memory mechanisms and suggests future research into adaptive memory representations and the integration of other cognitive concepts (e.g., working memory, episodic memory) into RL.
Reproducibility: The authors provide a comprehensive glossary, theoretical proofs, and open-source code to ensure their methodology can be adopted by the broader community.

In summary, the paper argues that memory in RL is not a monolithic concept but a spectrum of capabilities that must be evaluated with strict control over context windows and task horizons. Failure to do so leads to misleading conclusions about an agent's intelligence and adaptability.