Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

Imagine you are hiring a personal assistant to help you manage a very chaotic, fast-moving life. Every day, new information arrives: your boss changes the meeting time, your friend moves to a new house, and the weather forecast updates every hour.

Your goal is to test if this assistant can keep up. Can they instantly forget the old info and remember the new info without getting confused? Can they answer questions correctly right now, even if the answer was different five minutes ago?

This paper introduces a new test called OAKS (Online Adaptation to Continual Knowledge Streams) to see if modern AI models (Large Language Models) can actually do this.

Here is the breakdown of their findings using simple analogies:

1. The Test: A Story That Keeps Changing

The researchers created two "storybooks" for the AI to read, but with a twist:

The Synthetic Book (OAKS-B): Imagine a story where a character named "Bob" moves from the kitchen to the living room, then to the garage, then back to the kitchen, and then to the basement. Every few sentences, Bob moves again. The AI is asked, "Where is Bob?" after every single sentence.
The Real Book (OAKS-N): They took real novels (like Pride and Prejudice or Frankenstein) and broke them into small chunks. In these stories, characters' feelings, relationships, and locations change constantly.

The Challenge: The AI has to read the story as it comes, chunk by chunk. It cannot go back and re-read the whole book. It has to answer the question based only on what it has read so far, updating its answer every time the story changes.

2. The Results: The AI Gets Lost in the Noise

The researchers tested 14 different AI models, including the smartest ones available today (like Gemini 3 and Qwen). The results were surprising and a bit disappointing:

The "Forgetful" Problem: Even the smartest models struggled. They often got the answer right at the beginning, but when the story changed, they either didn't update (stuck on the old answer) or updated too much (changed their answer when they shouldn't have).
The "Distracted" Problem: As the story got longer, the AI started to get confused by all the text it had already read. It's like trying to listen to a friend talk about a new job while they are also shouting about their old job, the weather, and what they had for lunch. The AI often lost track of the current fact.
The Numbers: The best models only got about 66% of the answers right on the synthetic test and 75% on the real novels. That sounds okay, but for a "super-intelligent" AI, it means they are failing nearly 1 out of every 3 or 4 times in a dynamic situation.

3. Why Did They Fail? (The "Over-Thinker" vs. The "Stubborn" Robot)

The researchers analyzed how the AI failed and found two main personality types of errors:

The "Over-Thinker" (Volatility): Some models were too jumpy. If the story mentioned a character moving, they would immediately change their answer, even if the story later said, "Just kidding, he stayed put." They couldn't distinguish between a temporary mention and a permanent change.
The "Stubborn" Robot (Obstinacy): Other models were too slow. They would keep saying "Bob is in the kitchen" even after the story clearly stated, "Bob is now in the basement." They refused to let go of the old information.

4. Did "Thinking Harder" Help?

The researchers tried turning on a "Thinking Mode" (where the AI pauses to reason before answering).

Good News: It helped the AI get better at complex logic puzzles (like comparing two characters).
Bad News: It didn't fix the core problem. The AI still got distracted by the long stream of text. Thinking harder didn't stop them from forgetting the most recent update.

5. The "Retrieval" Shortcut Didn't Work Either

The researchers tried giving the AI a "search engine" (RAG) so it could look back at the story to find the answer, rather than just remembering it.

The Result: It didn't help much. When the story is constantly changing, searching for the "right" piece of information is like trying to find a specific page in a book that is being rewritten while you are reading it. The AI often picked the wrong page or got confused by the search results.

The Big Takeaway

Current AI is great at reading a static book and answering questions about it. But if you put that same AI in a real-world scenario where facts change every second (like a stock market, a live news feed, or a real-time conversation), it falls apart.

It's like having a librarian who has read every book in the world but gets confused if you ask them, "What is the weather right now?" because they are still reciting the weather report from last week.

Conclusion: We need a new generation of AI that doesn't just "know" things, but can live in a changing world, updating its memory in real-time without getting distracted or stubborn. We aren't there yet.

Here is a detailed technical summary of the paper "Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams" (OAKS).

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in dynamic real-world environments (e.g., conversational agents, embodied robots) where knowledge evolves continuously. However, current evaluation benchmarks primarily focus on static knowledge or offline tasks, failing to assess a model's ability to perform online adaptation.

The core challenge addressed is Online Adaptation to Continual Knowledge Streams: the ability of a model to ingest information sequentially, track fine-grained state changes of specific facts over time, and update its reasoning immediately as new, potentially contradictory, information arrives. Existing benchmarks often lack the granularity to track when and how facts change, or they focus on divergent facts rather than the evolution of the same underlying fact.

2. Methodology

A. The OAKS Benchmark

The authors introduce OAKS (Online Adaptation to Continual Knowledge Streams), a benchmark designed to evaluate models in a streaming setting where facts arrive sequentially and may supersede prior information.

Evaluation Protocol: Models are queried with the same set of questions at every time interval $t$ . At each interval, the model receives the accumulated context up to that point ( $c_1 \dots c_t$ ) and must predict the answer based only on the information available up to that moment.
Metrics: Performance is measured by Interval-Level Accuracy, comparing the model's prediction against the ground truth specific to that time interval. This captures the model's ability to maintain temporal consistency.

B. Datasets

Two new datasets were constructed to support this setting:

OAKS-BABI (OAKS-B): A synthetic dataset derived from the BABILong benchmark. It focuses on state tracking (e.g., object locations, ownership) with high-frequency updates.
- Structure: 1,200 questions across 65 context chunks (2k tokens each).
- Features: Facts evolve dynamically; average of 4.7 answer changes per question. Includes tracking, counting, bridge, and comparison question types.
OAKS-Novel (OAKS-N): A human-curated dataset sourced from 39 full-length literary novels.
- Structure: 870 multiple-choice questions across ~78 chunks per book.
- Features: Natural narratives with complex, concurrent plotlines and temporal jumps. Annotators verified answers and evidence at every interval, ensuring exactly one correct option per chunk.

Both datasets are stratified into Sparse, Moderate, and Frequent subsets based on the frequency of answer changes per question to test robustness under varying update rates.

C. Experimental Setup

Models: 14 LLMs were evaluated, including open-source models (Qwen3 family, GPT-OSS, Gemma 3) and proprietary models (Gemini 2.5, Gemini 3).
Inference Strategies:
- Base: Concatenating all previous chunks (truncating if necessary).
- RAG: Retrieval-Augmented Generation using a vector retriever (Qwen3-Embedding) to fetch top-k relevant chunks.
- Agentic Memory: Systems like HippoRAG-V2, MemAgent, and A-Mem that maintain incremental memory.
- Thinking Mode: Evaluation of models with and without "Chain-of-Thought" or explicit reasoning steps.

3. Key Contributions

First Unified Benchmark: OAKS is the first benchmark to unify continual knowledge learning (tracking evolving facts) and online adaptation (stepwise evaluation over streaming inputs) at the granularity of individual facts.
Granular Datasets: The creation of OAKS-B and OAKS-N, which provide dense annotations for every time interval, allowing for the analysis of transitions rather than just final accuracy.
Behavioral Taxonomy: The authors define specific behavioral archetypes for model failures, such as:
- Adaptability: Correctly changing the answer when the state changes.
- Volatility: Unnecessarily changing the answer when the state remains stable.
- Obstinacy/Stubbornness: Failing to update the answer when the state changes.
- Lag: Delayed adaptation to a state change.

4. Key Results

A. Overall Performance

Significant Limitations: Even state-of-the-art (SOTA) models struggle with OAKS.
- OAKS-B: Average accuracy of 39.4% (Open-source: 33.0%, Closed-source: 60.9%).
- OAKS-N: Average accuracy of 57.5% (Open-source: 52.9%, Closed-source: 72.6%).
Model Scaling: Performance generally scales with model size, but even the strongest model (Gemini 3 Pro) achieves only 66.3% on OAKS-B and 75.5% on OAKS-N.
Frequency Sensitivity: Accuracy drops significantly as the frequency of knowledge updates increases. On the "Frequent" subset, accuracy for OAKS-B drops to 33.3%.

B. Inference Strategies

Naive RAG: Simple RAG shows limited effectiveness, sometimes degrading performance compared to the Base setting, particularly in the "Frequent" subset. Retrieval struggles when facts are distributed across overlapping chunks or when the model cannot effectively reason over retrieved context.
Agentic Memory: Systems like MemAgent show promise on "Frequent" updates but generally underperform naive RAG in aggregate. Their training on static question types limits their ability to handle real-time interval-based tracking.
Thinking Mode: Enabling "Thinking" (reasoning) modes significantly improves performance, especially on complex Bridge questions requiring multi-hop reasoning. However, it does not fully solve the problem of Volatility (unnecessary updates).

C. Failure Mode Analysis

Over-updating vs. Under-updating: Different models exhibit distinct biases. GPT-OSS and Qwen3 tend to over-update (high Volatility), changing answers unnecessarily. Gemini and Gemma tend to under-update (high Obstinacy), failing to track state changes.
Intra-Phase Errors:
- Acquisition Latency (AL): Delay in adopting a new correct state.
- Distraction Susceptibility (DS): Losing track of a correct state due to new, irrelevant context.
- Phase Miss (PM): Completely failing to capture a correct state for an entire phase.
Context Length: Performance degrades as the time interval increases (longer context), indicating that models lose track of earlier states as the stream grows.

5. Significance and Conclusion

The paper demonstrates that current LLMs are not robust enough for online adaptation to continual knowledge streams. The primary failure modes are not just a lack of knowledge, but an inability to stably track state transitions without being distracted by new context or failing to update when necessary.

Implications: Simply increasing context window size or using standard RAG is insufficient for dynamic environments. Future systems require specialized mechanisms for temporal consistency and state management.
Future Work: The authors suggest exploring parametric online learning (updating weights) and extending benchmarks to more complex, non-synthetic natural language tasks with higher update frequencies.

In summary, OAKS reveals a critical gap in the current capabilities of LLMs: while they are excellent at static reasoning, they struggle to function as reliable agents in environments where the "truth" changes incrementally and continuously.