RIVER: A Real-Time Interaction Benchmark for Video LLMs

Imagine you are watching a movie with a friend who has never seen it before. You want to chat about it while it's playing, not after it's finished.

Friend A (The Old Way): This friend waits until the movie is over, then says, "Okay, what was that thing the guy did 20 minutes ago?" They have to rewind the whole movie in their head to remember. They are great at summarizing the whole story, but they are terrible at chatting in real-time.
Friend B (The New Way): This friend watches the movie as it happens. They can say, "Look, the lion just sat down!" or "Wait, did you see where I put my keys 5 minutes ago?" or even "I think the hero is about to jump!" They are reacting to the now, remembering the past, and guessing the future all at once.

This paper introduces RIVER, a new "test" (benchmark) designed to see how good AI models are at being Friend B.

The Problem: The "Offline" AI

Currently, most powerful AI video models are like Friend A. They are "offline." They need to see the entire video file before they can answer a single question. If you try to talk to them while a video is streaming (like on a live camera feed), they get confused, forget what happened 30 seconds ago, or just freeze. They are great at analyzing a finished movie, but bad at living in the moment.

The Solution: The RIVER Test

The authors created RIVER (Real-tIme intERaction Bench-mark for Video LLMs). Think of RIVER as a video game level designed specifically to test if an AI can handle a live conversation.

The test has three main "levels" or challenges:

The "Retrospective Memory" Level (Looking Back):
- The Challenge: The AI is watching a video. Suddenly, you ask, "What color was the shirt the guy was wearing 5 minutes ago?"
- The Goal: The AI needs to have a good memory. It shouldn't just guess; it needs to recall specific details from the past without seeing the whole video at once.
- Analogy: It's like playing a game of "20 Questions" about a story you are hearing for the first time, but you have to remember details from the beginning of the story while listening to the end.
The "Live-Perception" Level (Living in the Now):
- The Challenge: You ask, "What is happening right this second?"
- The Goal: The AI must process the video frame-by-frame as it arrives and answer instantly. No waiting, no rewinding.
- Analogy: This is like a sports commentator. They have to describe the goal as the ball crosses the line, not 10 seconds later.
The "Proactive Response" Level (Predicting the Future):
- The Challenge: You tell the AI, "Tell me the moment the dog starts barking."
- The Goal: The AI has to keep watching silently until the specific event happens, then speak up immediately. It's not just answering; it's waiting for the right moment.
- Analogy: Imagine you are a security guard watching a camera. You are told, "Yell 'Intruder!' the second someone opens the back door." You have to wait patiently and then react instantly.

The Results: Who Passed the Test?

The researchers tested many different AI models on this new RIVER game.

The "Big Brains" (Offline Models): Models like GPT-4o are very smart. If you give them the whole video file, they ace the test. But if you try to talk to them while the video is streaming, they struggle. They forget things quickly and can't react fast enough.
The "Specialized" Models: Some models were built specifically for streaming. They are better at the "Live" and "Proactive" parts, but they often have short memories. They might forget what happened an hour ago.
The "New Training" (The Fix): The authors didn't just test; they built a training dataset (a practice book) specifically for this kind of real-time interaction. When they taught an AI using this new book, the AI got much better at:
- Remembering the past (Long-term memory).
- Reacting to the present (Live perception).
- Waiting for the right moment to speak (Proactive response).

Why Does This Matter?

Right now, we are moving toward a world where AI helps us in real-time:

Robots: A robot that needs to understand a human's hand gestures while they are moving, not after they stop.
Augmented Reality (AR): Glasses that tell you, "That's a rare bird!" the moment you look at it.
Safety: A system that watches a factory floor and instantly shouts, "Stop! That machine is about to break!"

RIVER is the first ruler that can accurately measure if an AI is ready for this real-world, real-time life. It shows us that while AI is getting smarter, it still needs to learn how to "live in the moment" rather than just "reviewing the past."

In a Nutshell

The paper says: "We built a new test called RIVER to see if AI can chat with us while watching a video, not just after. We found that most AIs are bad at this, but we also found a way to train them to get much better at remembering the past, seeing the present, and predicting the future."

1. Problem Statement

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in offline video understanding, where the entire video is processed before generating an answer. However, this paradigm fails to address the requirements of online, real-time interaction, where AI systems must process streaming visual inputs while maintaining temporal awareness and responding dynamically to user queries.

Existing benchmarks and models suffer from several limitations:

Lack of Temporal Dynamics: Most benchmarks evaluate holistic understanding rather than the ability to track evolving narratives in real-time.
Memory Deficiencies: Models struggle with long-term memory retention (forgetting past events) and future anticipation (predicting upcoming events) in a streaming context.
Inadequate Evaluation Metrics: Current metrics often ignore the trade-off between response accuracy and timeliness, failing to quantify "temporal degradation" of memory or the precision of proactive responses.
Gap in Interaction Types: Existing works do not sufficiently distinguish between recalling past events, perceiving current states, and anticipating future states within a continuous dialogue.

2. Methodology: RIVER Bench

The authors introduce RIVER Bench (Real-tIme intERaction Bench-mark for Video LLMs), a comprehensive framework designed to evaluate the real-time interaction capabilities of MLLMs.

A. Task Formulation

RIVER Bench formulates online interaction as a window-based video-text-to-text task. It categorizes interactions into three core competency types based on the temporal relationship between the clue (event), the query (question), and the response:

Retro-Memory (Retrospective): The model must recall events from the past ( $t_{event} < t_{query}$ $t_{e v e n t} < t_{q u er y}$ ).
- Evaluation: Measures memory persistence over varying time intervals (Short: 15-30s, Medium: 30-60s, Long: 5-15min, Very Long: 30-60min).
Live-Perception: The model responds immediately to current or short-term visual inputs ( $t_{query} \approx t_{event}$ $t_{q u er y} \approx t_{e v e n t}$ ).
- Evaluation: Focuses on real-time multimodal understanding with latency-accuracy trade-offs.
Pro-Response (Proactive Anticipation): The model monitors the stream and responds precisely when a specific condition is met in the future ( $t_{event} > t_{query}$ $t_{e v e n t} > t_{q u er y}$ ).
- Sub-types:
  - Instant: Single-answer scenarios (e.g., "Alert me when the wrench appears").
  - Streaming: Continuous narration or guidance (e.g., "Keep describing what you see").

B. Data Construction

Sources: The benchmark aggregates and rigorously filters data from diverse sources: Vript-RR, LVBench, LongVideoBench, Ego4D, and QVHighlights.
Annotation: Questions are reconstructed with precise timestamps for the query, the visual clue, and the expected answer.
Quality Control: A multi-stage filtering process using LLMs and human review removes questions answerable without visual input (language priors), overly broad queries, and ambiguous cues.
Scale: The final dataset contains 1,067 videos and 4,278 questions, covering video durations from seconds to over an hour.

C. Proposed Model Architecture & Training

To address the limitations of existing models, the authors propose a general improvement method:

Long-Short Term Memory Module:
- Short-term: Processes the current sliding window of video frames (e.g., 1 fps).
- Long-term: Maintains a compressed memory bank of past frames. It uses a nearest-neighbor averaging strategy to merge similar events, preventing memory overflow and maintaining semantic coherence.
Training Paradigm:
- The model is fine-tuned on a specialized dataset designed for online interaction.
- Key Innovation: Unlike previous methods that anchor queries at $t=0$ , this training data uses randomized timestamps for user queries to improve generalization across diverse interaction scenarios.
- Loss Function: Combines standard Language Modeling (LM) loss with a streaming-specific loss to enhance temporal responsiveness.

3. Key Contributions

RIVER Bench: The first benchmark to provide precise, quantitative evaluation of MLLMs across Retro-Memory, Live-Perception, and Pro-Response tasks, explicitly modeling the temporal dynamics of online interaction.
General Improvement Framework: A novel architecture integrating a Long-Short Term Memory module with sliding-window sampling, enabling offline models to support robust online inference without catastrophic forgetting.
Specialized Training Dataset: A curated dataset for fine-tuning that significantly boosts a model's ability to anticipate future events and maintain long-term memory, validated across multiple model architectures.
Comprehensive Analysis: Detailed empirical studies on memory decay curves (forgetting curves) and the impact of different visual cue types (Fine-grained, Causal, Background) on model performance.

4. Experimental Results

The authors evaluated four categories of models: Commercial Closed-Source (GPT-4o, Gemini-1.5), Open-Source Native Online Models, Open-Source Offline Models (adapted), and their proposed Fine-tuned Online Models.

Performance Gap: While offline models excel in single-question answering using full context, they struggle significantly in strict real-time scenarios.
Memory Capability: Models equipped with the proposed Long-Short Term memory module showed a 12% reduction in performance decay compared to models without memory. Notably, these models exhibited superior retention stability within 1-hour timeframes compared to the classic Ebbinghaus forgetting curve observed in human cognition.
Pro-Response: Fine-tuning existing models (e.g., VideoLLM-Online) on the RIVER dataset improved Pro-Response accuracy by 11.28% over the baseline.
Cue Analysis: Models performed poorly on Causal Cues (requiring reasoning about event dynamics) compared to Fine-grained or Background cues, highlighting a current bottleneck in temporal reasoning.
State-of-the-Art: The proposed fine-tuned models achieved competitive results, often outperforming native online models in Live-Perception and significantly improving in Pro-Response tasks.

5. Significance and Future Work

Advancement of Real-Time AI: This work bridges the gap between offline video understanding and the dynamic requirements of real-world applications like augmented reality navigation, robotic supervision, and live video assistance.
Standardization: RIVER Bench provides a rigorous, standardized metric for evaluating the "temporal awareness" of MLLMs, moving beyond simple accuracy to include timing and memory retention.
Future Directions: The authors identify the lack of audio modality in the current dataset as a limitation. Future work aims to integrate audio to better simulate real-world multimodal interaction, as sound is a critical component of real-time communication.

In summary, RIVER Bench establishes a new paradigm for evaluating Video LLMs, emphasizing that true intelligence in streaming environments requires not just visual recognition, but robust memory, anticipation, and precise temporal grounding.