Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

Imagine you are the new manager of a massive, chaotic, 100-person company. You have thousands of emails, Slack messages, and meeting notes spanning an entire year. People are constantly changing their minds, updating plans, arguing in different group chats, and working on projects that overlap in confusing ways.

Now, imagine you ask your AI assistant: "Who was responsible for the final version of the marketing budget, and what was the exact date we approved it?"

If your AI is like current systems, it might get lost. It might confuse the draft with the final version, mix up who said what in a group of 20 people, or forget that a rule changed three months ago.

This paper introduces EverMemBench, a new "stress test" for AI memory, designed specifically to see if AI can handle this kind of real-world, multi-person chaos.

Here is the breakdown using simple analogies:

1. The Problem: The "Single-Thread" vs. The "Web"

Most current AI memory tests are like a one-on-one diary. They ask: "You and the AI talked for a long time. Do you remember what I said on Tuesday?"

The Reality: Real life isn't a diary; it's a busy subway station. Hundreds of people are talking at once. Information is scattered across different groups (the "Marketing Group," the "Engineering Group"). People change their minds, and a decision made in one group affects another group.
The Gap: Current AI benchmarks don't test if the AI can navigate this subway station. They only test if it can read a long book.

2. The Solution: EverMemBench (The "Chaos Simulator")

The authors built a fake company with 170 employees and 5 massive projects (like a tech startup, a bank, and a marketing firm). They generated over 4 million words of simulated chat history.

The Analogy: Think of this as a giant, living simulation game. They didn't just write a story; they programmed the characters to have personalities, skills, and roles. They made them argue, revise plans, and update rules over time.
The Goal: To see if an AI can act like a competent human employee who remembers who said what, when, and why, even when the information is buried in a sea of noise.

3. The Three Big Challenges (The "Boss Levels")

The paper tests the AI on three specific skills, which they call "Dimensions":

A. Fine-Grained Recall (The "Detective")

The Test: "Find the specific link to the final budget document."
The Trap: There are 10 links to budget documents. One is a draft, one is a Figma design, and one is the final Confluence link. They were all sent by the same person, just two days apart.
The Failure: Current AIs often grab the wrong link because it looks similar or was mentioned more recently. They can't tell the difference between a "draft" and a "final version."
The Analogy: It's like asking a detective to find the original murder weapon in a room full of identical-looking knives. Current AIs grab the first shiny knife they see.

B. Memory Awareness (The "Smart Assistant")

The Test: "The boss wants to use a new tool, but does that break our old rules?"
The Trap: The AI needs to remember a rule established months ago ("We only use Tool X") and realize that the new request violates it, even if the boss sounds very convincing.
The Failure: AIs often forget the rule or get tricked by the boss's urgency. They act like a sycophant who agrees with everything, rather than a smart assistant who says, "Wait, we agreed on this last year."
The Analogy: It's like a GPS that ignores the "Do Not Enter" sign you set last week because the traffic looks good right now.

C. Profile Understanding (The "Chameleon")

The Test: "Write an email for 'Sarah'."
The Trap: Sarah is a casual, emoji-loving engineer. The AI needs to write like her, not like a generic robot.
The Failure: Even if the AI knows the facts, it writes in a stiff, formal tone because it forgot Sarah's personality.
The Analogy: It's like hiring an actor who knows the script perfectly but forgets to wear the costume and speak with the character's accent. They sound like a robot reading a script, not the character.

4. The Results: The "Reality Check"

When they ran these tests on the smartest AI models available (like GPT-4 and Gemini):

The Good: They are great at remembering simple facts if the answer is right in front of them.
The Bad: They fall apart when things get complex.
- Multi-Party Confusion: If three people are talking, the AI gets confused about who owns which idea.
- Time Travel: They struggle to understand that a plan from January is different from a plan in June. They treat time like a flat list, not a timeline with updates.
- The "Needle in a Haystack" isn't the problem: The problem isn't that the memory is too long; it's that the AI doesn't know how to search through the mess.

5. Why This Matters

This paper is a wake-up call. It says: "We can't just make AI read longer books. We need to teach it how to be a human teammate."

To build AI that can actually work in offices, hospitals, or schools, we need systems that understand:

Context: Who is talking to whom?
Evolution: How did this idea change over time?
Identity: Who is this person, and how do they speak?

In short: EverMemBench is the "driver's license test" for AI memory. It proves that while AI can read a whole library, it still struggles to navigate the messy, changing, multi-person reality of a real workplace.

Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

1. The Problem: The "Single-Thread" vs. The "Web"

2. The Solution: EverMemBench (The "Chaos Simulator")

3. The Three Big Challenges (The "Boss Levels")

A. Fine-Grained Recall (The "Detective")

B. Memory Awareness (The "Smart Assistant")

C. Profile Understanding (The "Chameleon")

4. The Results: The "Reality Check"

5. Why This Matters

1. Problem Statement

2. Methodology: EverMemBench

A. Data Construction Pipeline

B. Evaluation Dimensions

3. Key Contributions

4. Empirical Results

Key Findings:

5. Significance and Future Directions

Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

1. The Problem: The "Single-Thread" vs. The "Web"

2. The Solution: EverMemBench (The "Chaos Simulator")

3. The Three Big Challenges (The "Boss Levels")

A. Fine-Grained Recall (The "Detective")

B. Memory Awareness (The "Smart Assistant")

C. Profile Understanding (The "Chameleon")

4. The Results: The "Reality Check"

5. Why This Matters

1. Problem Statement

2. Methodology: EverMemBench

A. Data Construction Pipeline

B. Evaluation Dimensions

3. Key Contributions

4. Empirical Results

Key Findings:

5. Significance and Future Directions

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance