A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

Imagine you are hiring a brilliant, super-fast personal assistant (let's call them "AI") to help you build a massive, complex house (your software project). You want this AI to not only understand blueprints but also remember every conversation you've had about the project over the last few months.

The Problem: The "Forgetful" Genius
Recently, these AI assistants have gotten incredibly smart. They can write code, fix bugs, and understand complex instructions. But there's a catch: they have a limited "short-term memory" (called a context window).

If you talk to them for 50 or 100 turns about your house project, the conversation gets so long that the AI starts to forget the beginning. It's like trying to read a 500-page novel, but the book only lets you see the last 50 pages at a time. The AI might remember you wanted a red door, but it forgets why you wanted it, or that you already decided to paint the walls blue. It gets confused, makes mistakes, and wastes a lot of energy trying to read the whole book every time you ask a question.

The Missing Piece: A New Test Drive
Until now, researchers didn't have a good way to test how well these AIs handle long, messy, real-world conversations about code. Most tests were like asking the AI simple, short questions. They didn't simulate the chaos of a real development project where requirements change, people ask follow-up questions, and information is scattered everywhere.

This paper introduces LoCoEval (Long-Horizon Conversational Context Management Evaluation). Think of this as a new, super-difficult driving test specifically designed for AI assistants working on software projects.

How the Test Works (The Recipe)
The researchers didn't just make up random questions. They built a machine that creates realistic scenarios:

The Script: They took real code projects and figured out what information was needed to build a specific feature (like a "format date" function).
The Distraction: They intentionally added "noise" and wrong information to the conversation, just like real life. Maybe the user says, "Let's use a red door," then later says, "Actually, I think blue is better," and then asks, "Wait, what color did we decide?"
The Marathon: They created conversations that are incredibly long (up to 256,000 words long!).
The Trap: At the end of the conversation, they ask the AI to build the feature or answer a specific question that requires remembering details from 50 turns ago.

The Results: Who Passed?
They tested 7 different AI strategies (including big models like GPT and specialized memory systems) on this test.

The "Raw" AI: When they just let the AI read the whole conversation without help, it struggled. It got lost in the noise, forgot critical details, and the cost to run it was astronomical (like paying for a library of books just to read one page).
The "General" Memory Systems: Some AIs tried to use a "memory bank" (like a notebook) to summarize the chat. Surprisingly, the simple ones worked better than the fancy, complex ones. The fancy ones often got confused because they weren't designed to handle code files mixed with chat files.
The Winner (Mem0R): The researchers built a new version of a memory system called Mem0R. Imagine a standard notebook that only writes down what you say. Mem0R is like a notebook that not only writes down what you say but also physically links your words to the blueprints and tools on your desk.
- Analogy: If you say, "Change the door," a normal AI just remembers the word "door." Mem0R remembers "Door" and points directly to the specific door file in the project folder. This allowed it to win the test, outperforming all other methods.

Why This Matters
This paper is a wake-up call. It shows that while AI is great at short tasks, it's currently terrible at long, complex software projects where context is everything.

The Benchmark: LoCoEval is now the standard "ruler" for measuring how good an AI is at remembering long conversations about code.
The Solution: The new method (Mem0R) proves that if we teach AIs to link their conversation memory directly to the actual code files, they become much more reliable.

In a Nutshell
The authors built a realistic, chaotic, long-winded test to see if AI assistants can remember what they talked about weeks ago while building software. They found that current AIs are easily overwhelmed, but a new method that connects "chat memory" directly to "code files" works much better. This helps developers build better tools for the future.

Here is a detailed technical summary of the paper "A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management" (LoCoEval).

1. Problem Statement

Large Language Models (LLMs) have significantly advanced code generation, yet they face a critical bottleneck in repository-oriented development: the "long-horizon conversational context" dilemma.

The Challenge: Real-world software development involves multi-turn conversations spanning dozens or hundreds of turns, often with iterative requirements, noisy inputs, and retrospective questions. As the conversation length grows, it exceeds the LLM's context window, leading to information truncation, reasoning degradation, and high computational costs.
The Gap: Existing context management methods (e.g., memory systems, RAG) are primarily designed for general-purpose conversations. They fail to effectively integrate two intertwined information sources specific to coding: the conversational history and the static code artifacts within the repository.
The Barrier: There is a lack of reliable, scalable benchmarks to evaluate how well context management methods handle these specific, complex, repository-oriented scenarios.

2. Methodology: LoCoEval Benchmark

The authors propose LoCoEval, the first benchmark tailored for repository-oriented long-horizon context management. It is constructed via an LLM-driven automated pipeline adhering to three key principles: Correctness, Realism, and Diversity.

A. Construction Pipeline

Sample Selection: Based on the Deveval dataset (1,825 samples from 117 repos), samples solvable solely by repository retrieval are filtered out to ensure the conversation is the critical information source.
Information Extraction & Mutation:
- Ground-truth Items: Atomic facts crucial for function completion (e.g., dependencies, parameters, logic) are extracted.
- Distracting Items: These are mutated versions of ground-truth items (e.g., incorrect file paths, wrong parameter names) to simulate real-world noise and iterative debugging.
- Dependency Graphs: Information items are organized into a Directed Acyclic Graph (DAG) to model logical and temporal dependencies.
Query Outline Construction:
- Samples are grouped (1–4 samples per group) to simulate parallel task development.
- Information items are dispersed into a Query Outline containing 30–70 turns.
- Topics: Includes "Task Topics" (related to the target function) and "Non-Task Topics" (irrelevant code questions) to introduce noise.
- Recap Queries: Retrospective questions are inserted to test memory recall.
Dynamic Generation: During evaluation, a "Mock User" (LLM) dynamically generates queries based on the outline and the agent's previous responses, ensuring semantic coherence and realism.

B. Benchmark Structure

Scale: 128 samples divided into two subsets:
- Single-hop: Information is concentrated in one topic.
- Multi-hop: Information is dispersed across multiple topics, requiring cross-turn tracking.
Context Length: Ranges from 64K to 256K tokens.
Evaluation Tasks:
1. Topic Awareness: Summarize conversation themes (F1 score).
2. Information Item Extraction: Extract specific ground-truth facts (F1 score).
3. Function Generation: Generate the target code implementation based on conversation + repo (Pass@k metric).

3. Key Contributions

LoCoEval Benchmark: The first scalable benchmark for evaluating long-horizon context management in repository development, featuring realistic, noisy, and diverse interaction patterns.
Comprehensive Evaluation: A systematic evaluation of 7 baselines (including standalone LLMs, Vanilla RAG, MemGPT, LD-Agent, and Mem0) across 3 advanced backbone LLMs (GPT-5 mini, DeepSeek-V3.2, Qwen3).
Mem0R (Proposed Method): A novel improvement to the general-purpose memory system Mem0.
- Innovation: Instead of storing purely textual memories, Mem0R stores composite memories containing both a textual description and explicit links to repository artifacts (file paths).
- Mechanism: During retrieval, it fetches the actual code content associated with the memory link, enabling precise context-aware retrieval.
Empirical Insights: Detailed analysis of how conversation length, task diversity, and backbone models affect performance.

4. Results and Analysis

A. Standalone LLMs (RQ1)

Performance: Even the best models (GPT-5 mini) suffer significant performance degradation (normalized scores < 50%) on fine-grained tasks (extraction/generation) in long contexts.
Cost: Processing 64K–256K contexts incurs prohibitive token costs (often >$1 per conversation).
Observation: Larger context windows help but do not solve the problem; fine-grained information loss is the primary bottleneck.

B. Existing Context Management Methods (RQ2)

Surprising Finding: Simple Vanilla RAG (retrieving top-k similar turns) often outperforms sophisticated memory systems (MemGPT, LD-Agent, Mem0).
Failure of Memory Systems: General-purpose memory systems struggle to adapt to repository scenarios, often failing to leverage repository code effectively or introducing excessive overhead.
Conclusion: Existing methods are poorly optimized for the dual-source nature (conversation + code) of repository development.

C. Mem0R Performance (RQ3)

Superiority: Mem0R outperforms all baselines (including Vanilla RAG) in most settings, particularly on the Multi-hop subset.
Efficiency: It achieves high Pass@1 scores while maintaining a strong compression ratio (reducing token usage by ~5x compared to Full context).
Robustness: Mem0R shows the least performance degradation as conversation length increases, proving the value of integrating explicit repository links into memory.

D. Hyperparameter Impact (RQ4)

Task Diversity ( $k$ ): The number of tasks per sample has no statistically significant impact on performance.
Conversation Length ( $l$ ): As conversation length increases, performance drops significantly for most agents. Mem0R degrades the least, highlighting the need for methods that scale with interaction volume.

5. Significance

Research Direction: The paper identifies that the future of code assistants lies not just in larger context windows, but in hybrid memory systems that structurally link conversational intent with repository artifacts.
Standardization: LoCoEval provides a rigorous, reproducible standard for evaluating long-horizon coding agents, moving beyond static code generation benchmarks.
Practical Impact: The findings suggest that current "state-of-the-art" context managers are insufficient for real-world development workflows, necessitating the adoption of repository-aware memory architectures like Mem0R to improve developer productivity and code quality.

Availability: The benchmark and the Mem0R implementation are open-sourced at https://anonymous.4open.science/r/LoCoEval.

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

1. Problem Statement

2. Methodology: LoCoEval Benchmark

A. Construction Pipeline

B. Benchmark Structure

3. Key Contributions

4. Results and Analysis

A. Standalone LLMs (RQ1)

B. Existing Context Management Methods (RQ2)

C. Mem0R Performance (RQ3)

D. Hyperparameter Impact (RQ4)

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation