AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

This paper introduces AMA-Bench, a novel benchmark designed to evaluate long-horizon memory in agentic applications using real-world and synthetic machine-generated trajectories, and proposes AMA-Agent, a causality-driven memory system that significantly outperforms existing baselines by addressing the limitations of current similarity-based retrieval methods.

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao

Published 2026-03-05
📖 5 min read🧠 Deep dive

🧠 The Big Idea: The "Forgetful Super-Intelligence"

Imagine you hire a brilliant, super-smart personal assistant (an AI Agent) to help you with a massive, multi-week project. This assistant is incredibly smart at solving problems right now, but they have a terrible memory.

If you ask them, "What did we decide three days ago?" or "What was the state of the database before we clicked that button?", they might guess, hallucinate, or just say, "I don't know."

This paper argues that current AI agents are failing not because they aren't smart enough, but because their "filing cabinets" (memory systems) are broken. They are trying to remember complex, machine-generated tasks using tools designed for chatting with humans.

🚫 The Problem: Chatting vs. Doing

The authors point out a major mismatch:

  • Old Benchmarks (The "Chat" Test): Most tests for AI memory are like asking a friend, "Do you remember what we talked about yesterday?" These tests focus on casual conversation, where people repeat themselves, use slang, and talk about feelings.
  • Real Agent Work (The "Job" Test): Real-world AI agents (like ones that browse the web, write code, or play games) don't chat. They generate machine logs. These are strict, technical records like {"status": "error", "code": 404} or {"action": "click", "id": "button_5"}.

The Analogy:
Imagine trying to find a specific screw in a garage.

  • The Old Way (Chat Memory): You ask the AI to remember the story of how you got the screw. "Well, I was thinking about the garden, then I saw a red truck..." The AI gets lost in the story and forgets the screw.
  • The Real Way (Agent Memory): You need the AI to remember the exact coordinates of the screw in a digital blueprint. The story is irrelevant; the data is everything.

Current AI memory systems are like a librarian who only knows how to organize books by the plot summary, but the AI agent needs a librarian who can organize blueprints and code.

🛠️ The Solution: AMA-Bench (The New Test)

To fix this, the researchers built AMA-Bench (Agent Memory with Any length). Think of this as a new, much harder driving test for AI.

Instead of asking the AI to remember a conversation, AMA-Bench gives it a long, complex mission (like "Navigate a robot through a maze" or "Fix a bug in a software project") and asks specific questions about the technical details of the journey.

The Test has two parts:

  1. Real-World Data: They took actual logs from real AI agents doing real jobs (like browsing websites or playing games) and asked experts to write questions about them.
  2. Synthetic Data: They built a "simulator" where they can create infinite scenarios of any length to test how well the AI remembers things after 100,000 steps.

The Result: When they ran this new test, even the smartest AI models (like GPT-5) scored poorly. They realized that simply making the AI "smarter" or giving it a bigger "brain" (more context) wasn't working. The memory system itself was the bottleneck.

🏆 The New Champion: AMA-Agent

The authors didn't just find the problem; they built a solution called AMA-Agent. They fixed the memory system using two clever tricks:

1. The Causality Graph (The "Cause-and-Effect Map")

  • The Old Way: Most AI memory works like a magnet. It grabs pieces of text that look similar to the question. If you ask "What happened after the crash?", it might grab a paragraph that mentions "crash" but talks about a car accident, not the software crash.
  • The New Way: AMA-Agent builds a flowchart of cause and effect. It doesn't just store text; it stores the logic.
    • Analogy: Instead of a pile of loose papers, imagine a family tree. If you ask, "Who is the great-grandfather?", the system doesn't guess based on similar names; it traces the exact line of descent. It knows that Action A caused State B, which led to Action C. This preserves the "truth" of what happened.

2. Tool-Augmented Retrieval (The "Swiss Army Knife")

  • The Old Way: The AI tries to find the answer just by "feeling" (similarity) which words match.
  • The New Way: If the AI isn't sure, it stops guessing and uses tools.
    • It can run a keyword search (like Ctrl+F in a document).
    • It can run a code script to count things or check specific numbers.
    • It can traverse the flowchart (the Causality Graph) to find the exact step where something changed.
    • Analogy: Instead of trying to remember the phone number of a friend from 10 years ago, the AI opens its phone book, types the name, and gets the exact number. It doesn't rely on fuzzy memory; it relies on tools.

📉 The Results

When they tested this new system (AMA-Agent) against the old ones:

  • Old Systems: Scored around 45% accuracy. They were getting lost in the details and forgetting the logic.
  • AMA-Agent: Scored 57.22%.
  • The Gap: This might not sound like a huge jump, but in AI research, beating the best existing system by 11% is a massive victory. It proves that how you organize memory matters more than how big the AI's brain is.

💡 The Takeaway

This paper teaches us that to build AI agents that can truly work for us in the real world (fixing code, managing databases, navigating robots), we can't just make them "chat" better. We need to give them structured, logical memory that understands cause and effect, not just vague similarities.

In short: Don't just give the AI a bigger notebook; give it a better filing system and a set of tools to look up the facts.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →