Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Imagine you are trying to find a specific recipe for "Grandma's Apple Pie" using a massive, ever-changing library of cookbooks.

In the world of computer science, Information Retrieval (IR) is exactly that: a system designed to find the right "recipe" (document) for a specific question. Usually, scientists test these systems using a Benchmark. Think of a benchmark as a fixed, frozen snapshot of that library. You ask a question, the computer searches, and you see if it found the right page.

But here's the problem: Real life isn't frozen. Cookbooks get updated, recipes get moved to new sections, and sometimes entire chapters are thrown out and replaced by a different author. This is called Temporal Drift.

This paper, "Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks," asks a simple but crucial question: If we freeze a library today, will the tests we run on it still work next year when the library has changed?

The authors decided to test this using FreshStack, a benchmark focused on LangChain (a popular toolkit for building AI applications). They treated the documentation for LangChain like a living, breathing organism that changes every day.

Here is the story of their experiment, broken down into simple concepts:

1. The Time Travel Experiment

The researchers took two snapshots of the LangChain documentation library:

Snapshot A (October 2024): The "old" library.
Snapshot B (October 2025): The "new" library, one year later.

They took 203 specific questions (queries) that people asked in 2024 and tried to answer them using both libraries.

The Big Surprise:
You might expect that because the library changed so much (67% of the LangChain documentation was reorganized or deleted!), most questions would become unanswerable.

Reality: 202 out of 203 questions were still fully answerable in the 2025 library!

The Magic Trick (Content Migration):
Why didn't the questions break? Because the information didn't disappear; it moved.
Imagine you were looking for a specific tool in the "Garden Shed" (LangChain). In 2024, it was there. In 2025, the Garden Shed was reorganized, and that tool was moved to the "Greenhouse" (a competitor framework called LlamaIndex).
The retrieval system didn't fail because it found the tool in the new location. The "knowledge" migrated to a different repository, but the answer was still there.

2. The "Shuffle" of the Deck

The researchers looked closely at where the answers came from.

In 2024: Half of the answers came directly from the main LangChain repository.
In 2025: The main LangChain repository provided less than a quarter of the answers. The rest were scattered across other repositories like LlamaIndex, Chroma, and Transformers.

It's like a game of musical chairs. The players (the documents) moved to different chairs (repositories), but the game (answering the question) could still be played.

3. Did the Search Engines Get Confused?

The final question was: Did the AI search engines get confused by all this moving around?

They tested various "search engines" (different AI models) on both the 2024 and 2025 libraries. They wanted to see if the "best" search engine in 2024 was still the "best" in 2025.

The Result: Yes! The rankings stayed almost exactly the same.
The Analogy: Imagine a race where the track changes slightly (some curves are tighter, some hills are steeper). You might expect the fastest runner to change. But in this study, the same runners finished in the same order. The correlation between the rankings was incredibly high (97.8%).

This means that even though the "library" changed drastically, the quality of the search tools remained consistent. If a model was good at finding answers in 2024, it was still good in 2025, even if the answers were in different places.

The Takeaway

The paper concludes with a comforting message for developers and researchers:

"Don't panic about your benchmarks getting stale."

Even in a fast-moving, chaotic world of technical code where documentation is constantly being rewritten, moved, or deleted, well-designed retrieval benchmarks can still be reliable. The information might "migrate" to new homes, but as long as the search system is smart enough to look in the right places, the answers are still "Fresh."

In short: The library moved the furniture, but the search engine still found the book. The test is still valid!

Here is a detailed technical summary of the paper "Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks."

1. Problem Statement

Information Retrieval (IR) benchmarks traditionally rely on the Cranfield paradigm, which utilizes static, predefined corpora and relevance judgments. However, technical domains (e.g., software documentation, APIs, and code repositories) are highly dynamic. Documents are frequently created, deleted, deprecated, or reorganized.

The Core Issue: This "temporal corpus drift" can render existing benchmarks stale. If a benchmark is built on a snapshot of a repository from 2024, but the relevant code or documentation has migrated to a different repository or been restructured by 2025, the benchmark may no longer accurately evaluate retrieval systems.
Research Gap: While prior work has examined temporal drift in news or generic web data, there is limited understanding of how benchmarks behave when the underlying technical corpus undergoes massive structural reorganization (e.g., API deprecations and framework migrations).

2. Methodology

The authors evaluated temporal drift using FreshStack, a retrieval benchmark focused on technical domains, specifically the LangChain ecosystem.

Experimental Setup

Corpus Construction: Two independent snapshots of ten GitHub repositories (including LangChain, LlamaIndex, Chroma, Transformers, etc.) were collected:
- Snapshot 1: October 2024.
- Snapshot 2: October 2025.
Queries: 203 queries originally derived from Stack Overflow (tagged with LangChain) were used.
Nugget Generation: Instead of judging whole documents, the authors used nuggets (atomic key facts extracted from answers) as the unit of relevance. This was done using GPT-4o to break down long-form answers.
Oracle Retrieval (Judgment Pool Construction): To ensure diverse relevance judgments, a hybrid fusion approach was used to retrieve candidate documents. Models included:
- BM25 (lexical).
- Dense retrievers: BGE (Gemma-2), E5 Mistral (7B), and Qwen3 (4B).
- Query variations: Original answers, generated nuggets, and sub-questions generated via Chain-of-Thought prompting.
Relevance Assessment: An automated judge (Cohere Command A, 111B parameters) evaluated whether retrieved documents supported specific nuggets. A document was deemed relevant if it supported at least one nugget for a given query.

Research Questions (RQs)

RQ1: Can existing queries be grounded in a corpus that changes dynamically over time?
RQ2: How does the distribution of relevant documents across repositories change over time?
RQ3: Do model rankings remain consistent under temporal drift in technical documentation?

3. Key Results

RQ1: Query Grounding

Finding: Despite massive restructuring, 202 out of 203 queries remained fully grounded in the 2025 corpus.
Detail: Only 1 nugget out of 640 total nuggets lacked support in the 2025 snapshot. This indicates that even when specific files are deleted or deprecated, the necessary information often migrates to related repositories (e.g., from LangChain to LlamaIndex) rather than disappearing entirely.

RQ2: Corpus Temporal Changes & Content Migration

Significant Shift: The source of relevant documents shifted dramatically.
- 2024: The langchain repository provided 50.9% of all relevant documents.
- 2025: The langchain repository dropped to 24.8%, while langchainjs rose to 25.5% and llama_index increased significantly.
Case Study (Query 75864073): A query regarding UnstructuredURLLoader showed a migration of the relevant class from langchain (2024) to llama_index (2025). The number of relevant documents for this query actually increased from 12 to 26, suggesting that competing frameworks expanded their documentation to cover similar workflows, creating redundancy rather than information loss.
Implication: Retrieval systems must look beyond the original repository name and understand the modular ecosystem of technical documentation.

RQ3: Model Ranking Consistency

Finding: Model rankings remained highly stable despite the corpus changes.
Metrics:
- Recall@50: Kendall $\tau$ correlation of 0.978 (extremely strong).
- $\alpha$ -nDCG@10: Kendall $\tau$ correlation of 0.846.
- Coverage@20: Kendall $\tau$ correlation of 0.692 (weaker, suggesting models struggle more with retrieving diverse passages needed for comprehensive answers).
Performance: Dense embedding models (specifically Qwen3 4B/8B) consistently outperformed others in both snapshots. While absolute scores dropped slightly in 2025 for most models, the relative ordering of models remained consistent.

4. Key Contributions

First Evaluation of Temporal Drift in Niche Technical Domains: Unlike previous studies focusing on news or generic web data, this work specifically targets the highly volatile landscape of technical documentation and code repositories.
Demonstration of Benchmark Robustness: The study provides empirical evidence that retrieval benchmarks can remain reliable even when the underlying corpus undergoes significant reorganization, provided the evaluation considers the broader ecosystem of related repositories.
Insight into Content Migration: The paper highlights that "deprecation" in one repository often leads to "migration" in another, challenging the assumption that temporal drift always equates to information loss.
Open Release: All artifacts, including the dual-temporal test collections and evaluation scripts, are publicly released to facilitate future research.

5. Significance and Conclusion

The paper concludes that FreshStack (LangChain subset) is robust to dynamic changes in technical code documentation.

For Practitioners: It suggests that retrieval systems do not need to be retrained or re-evaluated constantly to maintain their relative ranking, even as documentation evolves. However, systems must be capable of cross-repository retrieval to capture migrated content.
For Benchmark Designers: It validates the use of static benchmarks for evaluating retrieval systems in dynamic domains, provided the corpus construction accounts for the ecosystem of related repositories.
Future Work: The authors note that while this holds for code (where functionality is often preserved across frameworks), domains like Wikipedia (where facts themselves may change) might require generating new nuggets at every temporal snapshot.

In summary, the paper argues that while technical corpora are "stale" in terms of file paths and repository structures, the informational content remains fresh through migration, allowing retrieval benchmarks to remain effective tools for evaluation.