RIVER: A Real-Time Interaction Benchmark for Video LLMs

This paper introduces RIVER, a novel benchmark and framework designed to evaluate and improve the real-time interactive capabilities of video large language models by addressing their current limitations in online processing, long-term memory, and proactive anticipation through a three-task system of Retrospective Memory, Live-Perception, and Proactive Anticipation.

Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are watching a movie with a friend who has never seen it before. You want to chat about it while it's playing, not after it's finished.

  • Friend A (The Old Way): This friend waits until the movie is over, then says, "Okay, what was that thing the guy did 20 minutes ago?" They have to rewind the whole movie in their head to remember. They are great at summarizing the whole story, but they are terrible at chatting in real-time.
  • Friend B (The New Way): This friend watches the movie as it happens. They can say, "Look, the lion just sat down!" or "Wait, did you see where I put my keys 5 minutes ago?" or even "I think the hero is about to jump!" They are reacting to the now, remembering the past, and guessing the future all at once.

This paper introduces RIVER, a new "test" (benchmark) designed to see how good AI models are at being Friend B.

The Problem: The "Offline" AI

Currently, most powerful AI video models are like Friend A. They are "offline." They need to see the entire video file before they can answer a single question. If you try to talk to them while a video is streaming (like on a live camera feed), they get confused, forget what happened 30 seconds ago, or just freeze. They are great at analyzing a finished movie, but bad at living in the moment.

The Solution: The RIVER Test

The authors created RIVER (Real-tIme intERaction Bench-mark for Video LLMs). Think of RIVER as a video game level designed specifically to test if an AI can handle a live conversation.

The test has three main "levels" or challenges:

  1. The "Retrospective Memory" Level (Looking Back):

    • The Challenge: The AI is watching a video. Suddenly, you ask, "What color was the shirt the guy was wearing 5 minutes ago?"
    • The Goal: The AI needs to have a good memory. It shouldn't just guess; it needs to recall specific details from the past without seeing the whole video at once.
    • Analogy: It's like playing a game of "20 Questions" about a story you are hearing for the first time, but you have to remember details from the beginning of the story while listening to the end.
  2. The "Live-Perception" Level (Living in the Now):

    • The Challenge: You ask, "What is happening right this second?"
    • The Goal: The AI must process the video frame-by-frame as it arrives and answer instantly. No waiting, no rewinding.
    • Analogy: This is like a sports commentator. They have to describe the goal as the ball crosses the line, not 10 seconds later.
  3. The "Proactive Response" Level (Predicting the Future):

    • The Challenge: You tell the AI, "Tell me the moment the dog starts barking."
    • The Goal: The AI has to keep watching silently until the specific event happens, then speak up immediately. It's not just answering; it's waiting for the right moment.
    • Analogy: Imagine you are a security guard watching a camera. You are told, "Yell 'Intruder!' the second someone opens the back door." You have to wait patiently and then react instantly.

The Results: Who Passed the Test?

The researchers tested many different AI models on this new RIVER game.

  • The "Big Brains" (Offline Models): Models like GPT-4o are very smart. If you give them the whole video file, they ace the test. But if you try to talk to them while the video is streaming, they struggle. They forget things quickly and can't react fast enough.
  • The "Specialized" Models: Some models were built specifically for streaming. They are better at the "Live" and "Proactive" parts, but they often have short memories. They might forget what happened an hour ago.
  • The "New Training" (The Fix): The authors didn't just test; they built a training dataset (a practice book) specifically for this kind of real-time interaction. When they taught an AI using this new book, the AI got much better at:
    • Remembering the past (Long-term memory).
    • Reacting to the present (Live perception).
    • Waiting for the right moment to speak (Proactive response).

Why Does This Matter?

Right now, we are moving toward a world where AI helps us in real-time:

  • Robots: A robot that needs to understand a human's hand gestures while they are moving, not after they stop.
  • Augmented Reality (AR): Glasses that tell you, "That's a rare bird!" the moment you look at it.
  • Safety: A system that watches a factory floor and instantly shouts, "Stop! That machine is about to break!"

RIVER is the first ruler that can accurately measure if an AI is ready for this real-world, real-time life. It shows us that while AI is getting smarter, it still needs to learn how to "live in the moment" rather than just "reviewing the past."

In a Nutshell

The paper says: "We built a new test called RIVER to see if AI can chat with us while watching a video, not just after. We found that most AIs are bad at this, but we also found a way to train them to get much better at remembering the past, seeing the present, and predicting the future."