AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

🧠 The Big Idea: The "Forgetful Super-Intelligence"

Imagine you hire a brilliant, super-smart personal assistant (an AI Agent) to help you with a massive, multi-week project. This assistant is incredibly smart at solving problems right now, but they have a terrible memory.

If you ask them, "What did we decide three days ago?" or "What was the state of the database before we clicked that button?", they might guess, hallucinate, or just say, "I don't know."

This paper argues that current AI agents are failing not because they aren't smart enough, but because their "filing cabinets" (memory systems) are broken. They are trying to remember complex, machine-generated tasks using tools designed for chatting with humans.

🚫 The Problem: Chatting vs. Doing

The authors point out a major mismatch:

Old Benchmarks (The "Chat" Test): Most tests for AI memory are like asking a friend, "Do you remember what we talked about yesterday?" These tests focus on casual conversation, where people repeat themselves, use slang, and talk about feelings.
Real Agent Work (The "Job" Test): Real-world AI agents (like ones that browse the web, write code, or play games) don't chat. They generate machine logs. These are strict, technical records like {"status": "error", "code": 404} or {"action": "click", "id": "button_5"}.

The Analogy:
Imagine trying to find a specific screw in a garage.

The Old Way (Chat Memory): You ask the AI to remember the story of how you got the screw. "Well, I was thinking about the garden, then I saw a red truck..." The AI gets lost in the story and forgets the screw.
The Real Way (Agent Memory): You need the AI to remember the exact coordinates of the screw in a digital blueprint. The story is irrelevant; the data is everything.

Current AI memory systems are like a librarian who only knows how to organize books by the plot summary, but the AI agent needs a librarian who can organize blueprints and code.

🛠️ The Solution: AMA-Bench (The New Test)

To fix this, the researchers built AMA-Bench (Agent Memory with Any length). Think of this as a new, much harder driving test for AI.

Instead of asking the AI to remember a conversation, AMA-Bench gives it a long, complex mission (like "Navigate a robot through a maze" or "Fix a bug in a software project") and asks specific questions about the technical details of the journey.

The Test has two parts:

Real-World Data: They took actual logs from real AI agents doing real jobs (like browsing websites or playing games) and asked experts to write questions about them.
Synthetic Data: They built a "simulator" where they can create infinite scenarios of any length to test how well the AI remembers things after 100,000 steps.

The Result: When they ran this new test, even the smartest AI models (like GPT-5) scored poorly. They realized that simply making the AI "smarter" or giving it a bigger "brain" (more context) wasn't working. The memory system itself was the bottleneck.

🏆 The New Champion: AMA-Agent

The authors didn't just find the problem; they built a solution called AMA-Agent. They fixed the memory system using two clever tricks:

1. The Causality Graph (The "Cause-and-Effect Map")

The Old Way: Most AI memory works like a magnet. It grabs pieces of text that look similar to the question. If you ask "What happened after the crash?", it might grab a paragraph that mentions "crash" but talks about a car accident, not the software crash.
The New Way: AMA-Agent builds a flowchart of cause and effect. It doesn't just store text; it stores the logic.
- Analogy: Instead of a pile of loose papers, imagine a family tree. If you ask, "Who is the great-grandfather?", the system doesn't guess based on similar names; it traces the exact line of descent. It knows that Action A caused State B, which led to Action C. This preserves the "truth" of what happened.

2. Tool-Augmented Retrieval (The "Swiss Army Knife")

The Old Way: The AI tries to find the answer just by "feeling" (similarity) which words match.
The New Way: If the AI isn't sure, it stops guessing and uses tools.
- It can run a keyword search (like Ctrl+F in a document).
- It can run a code script to count things or check specific numbers.
- It can traverse the flowchart (the Causality Graph) to find the exact step where something changed.
- Analogy: Instead of trying to remember the phone number of a friend from 10 years ago, the AI opens its phone book, types the name, and gets the exact number. It doesn't rely on fuzzy memory; it relies on tools.

📉 The Results

When they tested this new system (AMA-Agent) against the old ones:

Old Systems: Scored around 45% accuracy. They were getting lost in the details and forgetting the logic.
AMA-Agent: Scored 57.22%.
The Gap: This might not sound like a huge jump, but in AI research, beating the best existing system by 11% is a massive victory. It proves that how you organize memory matters more than how big the AI's brain is.

💡 The Takeaway

This paper teaches us that to build AI agents that can truly work for us in the real world (fixing code, managing databases, navigating robots), we can't just make them "chat" better. We need to give them structured, logical memory that understands cause and effect, not just vague similarities.

In short: Don't just give the AI a bigger notebook; give it a better filing system and a set of tools to look up the facts.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex, long-horizon tasks (e.g., web navigation, software engineering, embodied AI). While these agents require robust memory to manage context over thousands of steps, current evaluation standards are misaligned with reality.

Key Limitations of Existing Benchmarks:

Dialogue-Centric Bias: Most benchmarks (e.g., LoCoMo, MemoryAgentBench) focus on human-agent chat interactions, which are rich in natural language but often contain redundant "chit-chat."
Lack of Machine-Generated Representations: Real agents interact with environments via structured, machine-generated data (JSON, HTML, code, ASCII tables), not just free-form text.
Missing Causality: Agent trajectories are causally grounded (Action $A$ leads to State $B$ , which constrains Action $C$ ). Existing benchmarks often treat interactions as unconstrained linguistic flows.
Sparse Objective Information: Agent memory requires tracking precise state transitions and objective facts, whereas dialogue benchmarks often rely on subjective or phatic content.

Consequently, existing memory systems (RAG, compression-based agents) often fail in real-world agentic scenarios because they rely on lossy compression and similarity-based retrieval, which struggle with dense, causal, machine-generated data.

2. Methodology

The paper introduces a two-pronged approach: a new benchmark (AMA-Bench) and a novel memory architecture (AMA-Agent).

A. AMA-Bench (The Benchmark)

AMA-Bench is designed to evaluate memory in "Agent-Environment" interactions with any horizon length. It consists of two complementary subsets:

Real-World Subset:
- Source: Curated from six representative domains: Web Navigation, Open-World QA, Text2SQL, Software Engineering, Gaming, and Embodied AI.
- Data: 2,496 expert-curated Question-Answer (QA) pairs derived from actual agent trajectories (logs of actions and observations).
- Characteristics: Contains diverse machine-generated representations (HTML, JSON, code) and ensures answers are grounded in explicit, unambiguous evidence within the trajectory.
Synthetic Subset:
- Source: Programmatically generated environments (BabyAI and TextWorld).
- Features: Allows for controlled scaling of trajectory length (from 8K to 128K tokens) and complexity.
- Mechanism: Uses a "Needle-in-a-Haystack" paradigm where the ground truth (latent state) is programmatically known, enabling automatic generation of QA pairs and verification of "needles" (critical information turns).

Evaluation Dimensions:
The benchmark categorizes memory capabilities into four types:

Recall: Identifying temporal/sequential information.
Causal Inference: Verifying action preconditions and state dependencies.
State Updating: Tracking changes in hidden or explicit states.
State Abstraction: Filtering redundancy while extracting key condensed information.

B. AMA-Agent (The Proposed Solution)

To address the failures of existing systems, the authors propose AMA-Agent, which moves beyond simple similarity search and lossy compression. It features two core mechanisms:

Causality Graph (Memory Construction):
- Instead of compressing text into summaries, AMA-Agent parses trajectory turns $(o_{t-1}, a_t, o_t)$ to extract objective states and latent causal dependencies.
- It constructs a directed graph where nodes represent states/objects and edges represent causal transitions and associations. This preserves the integrity of objective information and causal chains.
Tool-Augmented Retrieval (Memory Retrieval):
- Hybrid Search: Combines standard embedding similarity search with two specialized tools:
  - Graph Node Search: Performs depth-controlled neighborhood traversal to aggregate multi-hop context and causal relations.
  - Keyword Search: Executes scripts to perform precise keyword matching and statistical aggregation on the raw trajectory data.
- Self-Evaluation: The agent first attempts to answer using retrieved context. If evidence is insufficient, it dynamically invokes the graph or keyword tools to fill gaps.

3. Key Contributions

AMA-Bench: The first benchmark suite specifically designed for evaluating long-horizon memory in agentic applications, bridging the gap between natural language benchmarks and machine-generated agent trajectories.
Comprehensive Evaluation: A systematic study revealing that existing memory systems underperform long-context baselines in agentic tasks due to the accumulation of errors from lossy compression and the inadequacy of similarity-based retrieval for causal data.
AMA-Agent Framework: A novel architecture demonstrating that preserving causal structure and utilizing tool-augmented retrieval significantly outperforms state-of-the-art baselines.

4. Experimental Results

The authors evaluated 15 memory methods (including RAG, Memory Agents, and Long-Context models) on AMA-Bench using Qwen3-32B and Qwen3-8B backbones.

Performance Gap: Existing memory systems generally underperform compared to raw long-context baselines in agentic tasks. For instance, on the Real-World subset, the strongest existing memory baseline (HippoRAG2) achieved 44.80% average accuracy, while the long-context baseline (GPT-5.2) reached 72.26%.
AMA-Agent Superiority: AMA-Agent achieved a new state-of-the-art average accuracy of 57.22% on the Real-World subset (using Qwen3-32B), surpassing the strongest memory baseline (HippoRAG2) by 11.16% and the leading memory method (MemoRAG) by 11.16%.
Ablation Studies:
- Removing the Causality Graph dropped performance by 24.6% (Avg: 0.57 $\to$ 0.43), proving that causal structure is critical.
- Removing Tool-Augmented Retrieval dropped performance by 22.8%, confirming that similarity search alone is insufficient for precise state tracking.
Scalability: AMA-Agent maintained robust accuracy even as trajectory lengths increased to 128K tokens, whereas long-context models degraded significantly beyond 32K.

5. Significance and Impact

Paradigm Shift: The paper argues that the bottleneck in agentic memory is not the base model's capacity but the memory architecture design. It highlights that "lossy compression" and "similarity retrieval" are fundamentally mismatched with the dense, causal, and objective nature of agent-environment interactions.
New Standard: AMA-Bench provides a rigorous, reproducible standard for evaluating agent memory, moving the field away from synthetic dialogue tasks toward realistic, machine-generated interaction logs.
Architectural Insight: The success of AMA-Agent suggests that future agent memory systems must prioritize causal graph construction and hybrid retrieval strategies (combining semantic, structural, and programmatic search) to handle long-horizon reasoning effectively.

In conclusion, this work establishes that for autonomous agents to succeed in complex, long-term tasks, memory systems must evolve from simple text summarization to structured, causality-aware, and tool-augmented reasoning frameworks.