Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Zhuoran Jin (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yupu Hao (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yubo Chen (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Kang Liu (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yulong Ao (Beijing Academy of Artificial Intelligence), Jun Zhao (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China)

Published 2026-03-13

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

The Big Problem: The "Stop-and-Go" Bottleneck

Imagine you are watching a live sports broadcast with a friend who is an expert analyst.

The Old Way (Interleaved): Every time a new play happens on the field, your friend has to stop watching, turn to you, explain what just happened, write it down, and then turn back to the screen to watch the next play.
- The Result: By the time they finish talking about the first play, the second play has already happened. They miss details. If the game gets fast, they get overwhelmed, forget what happened five minutes ago, and start giving wrong answers. This is what current AI models do: they watch a little, talk a little, watch a little, talk a little. They can't do both at once.
The Paper's Solution (Think While Watching): Imagine your friend is a super-human who can watch the game and talk to you at the exact same time. They keep a running mental notebook. As the game plays, they jot down quick notes ("Player A is tired," "The ball is on the left"). When you ask a question, they instantly flip to the right page in their notebook and answer, all while their eyes never leave the screen.

The Core Idea: "Segment-Level Memory"

The paper proposes a system called Think While Watching. Here is how it works, broken down into three simple concepts:

1. The "Post-it Note" System (Segment-Level Memory)

Instead of trying to remember the entire video in high definition (which is impossible for a computer to hold in its short-term memory), the AI breaks the video into small chunks called segments (like 30-second clips).

The Analogy: Think of the video as a long movie. Every time a new 30-second clip plays, the AI doesn't try to memorize every pixel. Instead, it writes a Post-it note.
What's on the note? Just the important stuff: "A magician in a black coat," "The judge clapped," "The train is heading north."
The Benefit: These notes are stored in a "Memory Bank." When you ask a question later (even 10 minutes into the video), the AI doesn't need to re-watch the whole movie. It just looks at its Post-it notes to find the answer. This prevents Memory Erosion (forgetting the beginning of the video).

2. The "Dual-Track" Highway (Parallel Processing)

Most AI models are like a single-lane road: they can either Watch (ingest video) or Think (generate text), but not both. This causes traffic jams (latency).

The Innovation: The authors built a "Dual-Track Highway."
- Track A (The Eyes): Continuously watches the video and writes Post-it notes.
- Track B (The Brain): Simultaneously reads the notes and answers your questions.
The Result: The AI never stops watching to think, and it never stops thinking to watch. This solves the Serialization Bottleneck, making the AI feel truly real-time.

3. The "Training Camp" (Three-Stage Learning)

You can't just teach an AI to do this overnight. The authors created a special training camp with three levels:

Level 1 (The Note-Taker): Teaches the AI how to watch a short clip and write a good summary note.
Level 2 (The Conversationalist): Teaches the AI how to handle a conversation where you ask multiple questions, forcing it to use its old notes to answer new questions.
Level 3 (The Marathon Runner): Teaches the AI how to handle very long videos (like a whole lecture or a movie) without getting confused by distractions or forgetting the start.

Why This Matters (The Results)

The paper tested this new method against existing AI models on two major benchmarks (StreamingBench and OVO-Bench).

Accuracy: The new method got significantly better at answering questions about live video streams. It didn't forget the beginning of the video like the old models did.
Efficiency: Because the AI is so smart about what to remember (the Post-it notes), it didn't need to generate as much text to explain its reasoning. It saved 56% of the computing power (tokens) while keeping the same accuracy.
Speed: It reduced the "Time to First Token" (how long you wait for the first word of an answer) by over 90% compared to older methods.

Summary in One Sentence

"Think While Watching" is a new AI system that acts like a super-attentive human: it continuously watches a video, takes quick notes on a digital notepad, and answers your questions in real-time without ever stopping to "think" or forgetting what happened at the start.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved strong performance in offline video understanding, where the entire video is available before inference. However, they struggle in online streaming scenarios (e.g., live broadcasting, robotic assistants, monitoring) characterized by:

Continuously arriving streams: Videos are not fully available at the start; segments arrive sequentially.
Multi-turn interaction: Users can ask questions at arbitrary timestamps, requiring the model to answer based only on observed history.
Limitations of existing approaches: Current streaming methods typically use an interleaved perception-generation paradigm (process a segment, generate text, process next segment). This causes two critical issues:
1. Memory Erosion: As the stream grows, the model forgets early visual cues because the interleaved generation blocks long-range dependency modeling.
2. Serialization Bottleneck: Autoregressive text generation halts further video ingestion, causing increasing latency (queueing delay) as the number of interaction turns accumulates.

2. Methodology: Think While Watching (TWW)

The authors propose Think While Watching, a memory-anchored framework that decouples perception (watching) from generation (thinking) to enable real-time, multi-turn reasoning.

A. Core Architecture

Segment-Level Memory: Instead of processing frame-by-frame, the video is divided into segments ( $S_1, S_2, \dots$ ). For every observed segment, the model explicitly writes a compact memory note (e.g., key entities, actions, scene changes) to a persistent memory bank.
Decoupled Ingestion and Generation: The system allows the model to continue "watching" (ingesting new segments) while "thinking" (generating answers to previous questions). This is achieved via a Dual KV Cache mechanism, separating the visual input stream from the text decoding stream.
Implicit Retrieval: When a question arrives, the model answers by attending to the current question, dialogue history, and the relevant subset of the memory bank, rather than re-processing raw video frames.

B. Training Strategy

To enforce strict causality and long-term consistency, the authors designed a Three-Stage Training pipeline with a specialized dataset:

Stage 1 (Single-Round CoT): Trains the model to write memory notes for individual segments and answer single-round questions.
Stage 2 (Multi-Round CoT): Trains the model to maintain consistency across multiple turns, ensuring later answers correctly reference earlier memory notes without peeking at future data.
Stage 3 (Long-Range Capability): Focuses on long videos (100–300+ frames) to improve long-term evidence recall, uncertainty handling (deferring answers when evidence is insufficient), and distractor robustness (ignoring irrelevant segments).

Key Technical Components:

Segment-Level Streaming Causal Mask: A custom attention mask that prevents generated tokens from attending to future received segments or future generated tokens, ensuring strict streaming causality.
Streaming Positional Encoding (MRoPE): The authors decouple positional encoding for input and output streams. Input segments use cumulative offsets based on arrival time, while output tokens start from 0 independently. This allows new input segments to be assigned correct positions even while the output length is unknown.
Adaptive Attention Backend: To handle the non-standard causal patterns (where query length $\neq$ key length), the system dynamically switches between Flash Attention (for standard causal steps) and memory-efficient attention (for custom streaming masks).

3. Key Contributions

Framework: Proposes Think While Watching, the first framework to maintain persistent segment-level memory and decouple perception from generation for online multi-turn video reasoning.
Dataset: Constructs a three-stage, stage-aligned Streaming Chain-of-Thought (CoT) dataset containing over 9,000 instances (including long-range dialogues from YouTube) specifically designed to train streaming causality and memory retention.
Efficiency & Performance: Demonstrates that decoupling ingestion and generation significantly reduces latency and token usage while improving accuracy compared to interleaved baselines.

4. Experimental Results

The method was evaluated on StreamingBench and OVO-Bench using Qwen3-VL backbones (2B, 4B, 8B).

Single-Round Accuracy:
- Improved accuracy by 2.6% on StreamingBench and 3.79% on OVO-Bench (Qwen3-VL-4B) compared to the baseline Thinking model.
- Significantly outperformed naive online baselines (e.g., Qwen3-VL-4B online dropped ~40% in accuracy without TWW; TWW recovered this gap).
Multi-Round Efficiency:
- In multi-turn protocols, TWW maintained competitive accuracy while reducing output tokens by 56% compared to the baseline.
- Latency: Reduced Time-To-First-Token (TTFT) by 92.6% compared to batch processing, effectively eliminating the serialization bottleneck.
Offline Transfer: The streaming-trained model also showed improved performance on offline long-video benchmarks (Video-MME, LV-Bench), indicating that streaming supervision enhances general temporal reasoning.

5. Significance

Solving the "Streaming Gap": This work addresses the critical gap between offline MLLM capabilities and the requirements of real-world streaming applications (live streaming, robotics).
Paradigm Shift: It moves away from the "interleaved" paradigm (which causes memory decay and latency) to a "parallel" paradigm (watching while thinking), which is essential for scalable, real-time multimodal agents.
Practical Deployment: By reducing token usage and latency, the method makes real-time video reasoning feasible on standard hardware, paving the way for interactive video assistants that can handle long, dynamic streams without losing context.

In summary, Think While Watching provides a robust architectural and training solution for MLLMs to reason over continuous video streams, effectively balancing long-term memory retention with real-time responsiveness.