Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues
This paper introduces EverMemBench, the first benchmark designed to evaluate long-horizon memory in multi-party collaborative dialogues, revealing that current LLM systems struggle with multi-hop reasoning, temporal versioning, and implicit relevance retrieval in realistic, complex interaction scenarios.