Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

This paper introduces OAKS, a benchmark designed to evaluate large language models' ability to adapt to continuously evolving knowledge streams, revealing that current state-of-the-art models and agentic memory systems struggle with accurate state-tracking and are highly susceptible to distraction in dynamic environments.

Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Sungmin Cha, Minjoon Seo

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are hiring a personal assistant to help you manage a very chaotic, fast-moving life. Every day, new information arrives: your boss changes the meeting time, your friend moves to a new house, and the weather forecast updates every hour.

Your goal is to test if this assistant can keep up. Can they instantly forget the old info and remember the new info without getting confused? Can they answer questions correctly right now, even if the answer was different five minutes ago?

This paper introduces a new test called OAKS (Online Adaptation to Continual Knowledge Streams) to see if modern AI models (Large Language Models) can actually do this.

Here is the breakdown of their findings using simple analogies:

1. The Test: A Story That Keeps Changing

The researchers created two "storybooks" for the AI to read, but with a twist:

  • The Synthetic Book (OAKS-B): Imagine a story where a character named "Bob" moves from the kitchen to the living room, then to the garage, then back to the kitchen, and then to the basement. Every few sentences, Bob moves again. The AI is asked, "Where is Bob?" after every single sentence.
  • The Real Book (OAKS-N): They took real novels (like Pride and Prejudice or Frankenstein) and broke them into small chunks. In these stories, characters' feelings, relationships, and locations change constantly.

The Challenge: The AI has to read the story as it comes, chunk by chunk. It cannot go back and re-read the whole book. It has to answer the question based only on what it has read so far, updating its answer every time the story changes.

2. The Results: The AI Gets Lost in the Noise

The researchers tested 14 different AI models, including the smartest ones available today (like Gemini 3 and Qwen). The results were surprising and a bit disappointing:

  • The "Forgetful" Problem: Even the smartest models struggled. They often got the answer right at the beginning, but when the story changed, they either didn't update (stuck on the old answer) or updated too much (changed their answer when they shouldn't have).
  • The "Distracted" Problem: As the story got longer, the AI started to get confused by all the text it had already read. It's like trying to listen to a friend talk about a new job while they are also shouting about their old job, the weather, and what they had for lunch. The AI often lost track of the current fact.
  • The Numbers: The best models only got about 66% of the answers right on the synthetic test and 75% on the real novels. That sounds okay, but for a "super-intelligent" AI, it means they are failing nearly 1 out of every 3 or 4 times in a dynamic situation.

3. Why Did They Fail? (The "Over-Thinker" vs. The "Stubborn" Robot)

The researchers analyzed how the AI failed and found two main personality types of errors:

  • The "Over-Thinker" (Volatility): Some models were too jumpy. If the story mentioned a character moving, they would immediately change their answer, even if the story later said, "Just kidding, he stayed put." They couldn't distinguish between a temporary mention and a permanent change.
  • The "Stubborn" Robot (Obstinacy): Other models were too slow. They would keep saying "Bob is in the kitchen" even after the story clearly stated, "Bob is now in the basement." They refused to let go of the old information.

4. Did "Thinking Harder" Help?

The researchers tried turning on a "Thinking Mode" (where the AI pauses to reason before answering).

  • Good News: It helped the AI get better at complex logic puzzles (like comparing two characters).
  • Bad News: It didn't fix the core problem. The AI still got distracted by the long stream of text. Thinking harder didn't stop them from forgetting the most recent update.

5. The "Retrieval" Shortcut Didn't Work Either

The researchers tried giving the AI a "search engine" (RAG) so it could look back at the story to find the answer, rather than just remembering it.

  • The Result: It didn't help much. When the story is constantly changing, searching for the "right" piece of information is like trying to find a specific page in a book that is being rewritten while you are reading it. The AI often picked the wrong page or got confused by the search results.

The Big Takeaway

Current AI is great at reading a static book and answering questions about it. But if you put that same AI in a real-world scenario where facts change every second (like a stock market, a live news feed, or a real-time conversation), it falls apart.

It's like having a librarian who has read every book in the world but gets confused if you ask them, "What is the weather right now?" because they are still reciting the weather report from last week.

Conclusion: We need a new generation of AI that doesn't just "know" things, but can live in a changing world, updating its memory in real-time without getting distracted or stubborn. We aren't there yet.