Thinking in Streaming Video

Imagine you are watching a live cooking show with a friend who is asking you questions about what's happening.

The Old Way (Batch Processing):
Most current AI video models work like a student who waits until the entire movie is over before they are allowed to open their notebook and write an answer. They watch the whole 2-hour video, then spend a long time thinking, and finally say, "Okay, I know what happened."

The Problem: If your friend asks, "Where did the chef put the cutting board?" while the chef is still chopping, the AI has to wait until the end of the video to answer. By then, the moment has passed, and the AI is too slow to be helpful in real-time. Also, trying to remember every single frame of a 2-hour video takes up a massive amount of brainpower (memory).

The New Way (ThinkStream):
The paper introduces ThinkStream, which works more like a human watching that same cooking show. Instead of waiting, you watch, think, and speak as the video plays.

Here is how it works, broken down with simple analogies:

1. The "Watch–Think–Speak" Loop

Imagine you are a detective watching a crime scene unfold on a live feed.

Watch: You see a new piece of evidence (a video chunk) arrive.
Think: Instead of freezing, you immediately update your mental notes. You say to yourself, "Okay, the chef just moved the board to the sink. That's important."
Speak (or Stay Silent): You then decide: "Do I have enough info to answer my friend's question yet?"
- If Yes: You speak up immediately: "He put it near the sink!"
- If No: You stay silent and keep watching, waiting for the next clue.

This happens in a split second, over and over, as the video streams. The AI doesn't wait for the end; it reasons while it watches.

2. The "Magic Backpack" (Reasoning-Compressed Streaming Memory)

Here is the biggest challenge: If you watch a video for an hour, you can't remember every single pixel of every frame. Your brain (and the computer's memory) would explode.

The Old Problem: Traditional AI tries to save every single photo (visual token) from the video in its memory. Eventually, the backpack gets so heavy it can't move.
The ThinkStream Solution: ThinkStream uses a "Magic Backpack."
- As the video plays, the AI looks at the old scenes. Once it understands what happened (e.g., "The chef moved the board"), it writes a short summary note (a reasoning trace) in its backpack.
- Then, it throws away the heavy, detailed photos of that old scene to make space.
- The Magic: The summary note is tiny but contains all the meaning of the photo. The AI keeps the "story" but deletes the "heavy pictures." This way, it can watch a 10-hour video without its backpack ever getting too heavy.

3. Training the AI (Streaming Reinforcement Learning)

How do you teach an AI to do this? You can't just tell it "be smart." You have to train it like a video game character.

The Game: The AI plays a game where it watches a video stream.
The Rules (Rewards):
1. Don't talk too early: If it answers before it has enough clues, it loses points.
2. Don't talk too late: If it waits too long, it loses points.
3. Be right: If it finally answers, it must be correct.
The Result: Through thousands of tries, the AI learns the perfect rhythm: Watch a bit, think a bit, decide if I know the answer, and then speak.

Why Does This Matter?

Real-Time Assistants: Imagine a robot butler helping you in your kitchen. If you drop a plate, it needs to know immediately and tell you, not wait until you finish cooking dinner to say, "Oh, you dropped a plate."
Low Cost: Because it throws away the "heavy photos" and keeps only the "light summaries," it runs on smaller, cheaper computers without getting slow.
Better Memory: It can remember the story of a long video (like a whole day in a factory) without needing a supercomputer to store every second of footage.

In a nutshell: ThinkStream teaches AI to stop being a passive student who waits for the test to end, and start being an active observer who thinks, updates, and acts in real-time, all while keeping its memory light and efficient.

tags) that integrates the new evidence with the accumulated context. 3. **Speak:** Based on the updated reasoning state, the model decides on an action ( $a_t$ ): * **:** Continue observing (insufficient evidence). * **`:** Output the final answer (sufficient evidence accumulated).

This structure ensures strict streaming causality, where reasoning evolves alongside the input stream.

B. Reasoning-Compressed Streaming Memory (RCSM)

To address the memory bottleneck of long videos, the authors introduce RCSM.

Concept: Instead of retaining all dense visual tokens in the KV cache, the model treats intermediate reasoning traces as compact semantic memory.
Mechanism: As the video stream progresses, outdated visual tokens are evicted from the KV cache. However, the reasoning tokens generated during those steps are retained as long-term "semantic anchors."
Benefit: This replaces high-dimensional visual data with low-dimensional semantic summaries, keeping the effective context length and inference cost stable regardless of video duration.

C. Streaming Reinforcement Learning with Verifiable Rewards (RLVR)

To train the model to generate appropriate reasoning traces and timing decisions, the authors employ a specialized RLVR framework:

Reward Components:
- Accuracy Reward ( $R_{acc}$ ): Ensures the final answer matches the ground truth (using deterministic formats like multiple-choice or binary questions for verifiability).
- Format Reward ( $R_{format}$ ): Enforces strict adherence to the <think>/<response>/<silent> structure.
- Time Reward ( $R_{time}$ ): Penalizes premature responses or excessive latency by measuring the temporal discrepancy between the model's response and the ground-truth event time.
Optimization: Uses Group Relative Policy Optimization (GRPO) to align the policy with these streaming constraints.

D. Efficient Streaming Inference Backend

To support the dynamic eviction of visual tokens and the insertion of reasoning tokens, the authors built a custom inference backend using CUDA Graphs.

It separates the process into a Prefill Phase (eager mode for new visual tokens) and a Decode-and-Prune Phase (captured as a replayable CUDA graph).
This allows for efficient in-place memory shifting and KV cache manipulation, enabling high-throughput decoding with dynamic context updates.

3. Key Contributions

Watch–Think–Speak Paradigm: A novel formulation of streaming video understanding as an incremental reasoning and interaction process, allowing models to decide when to respond based on accumulated evidence.
ThinkStream Framework & RCSM: A unified system that uses reasoning traces as compressed semantic memory to replace evicted visual tokens, solving the long-horizon memory problem without sacrificing coherence.
Streaming RLVR: A training scheme using automatically verifiable rewards to align incremental reasoning and response timing with real-time interaction requirements.
Dataset & Infrastructure: Construction of a large-scale dataset (110K cold-start instances, 9K RLVR instances) with time-grounded reasoning traces and an efficient CUDA-based streaming inference engine.

4. Experimental Results

The framework was evaluated on multiple streaming and offline benchmarks using a compact 3B parameter model (based on Qwen2.5-VL).

Streaming Performance:
- On OVO-Bench, ThinkStream-3B achieved an average score of 59.66, significantly outperforming its base model (51.00) and larger online models like Streamo-3B (51.64) and Flash-VStream-7B (28.37).
- On StreamingBench Real-Time, it scored 75.00, surpassing proprietary models like GPT-4o (73.28) and all other open-source online MLLMs.
Offline Performance:
- Despite aggressive visual token eviction, ThinkStream-3B maintained strong performance on offline benchmarks (VideoMME: 61.9, Long VideoBench: 56.4), proving it preserves long-horizon understanding capabilities.
Efficiency & Latency:
- Throughput: The custom CUDA Graph engine achieved a 5x speedup in token decoding compared to standard eager implementations (e.g., 154 tokens/s vs. 30 tokens/s at batch size 1).
- Latency: While baseline models violated real-time thresholds (latency > 1.0s) as video length increased, ThinkStream maintained a flat end-to-end latency below 0.5s (required for 2 FPS inputs), regardless of video duration.

5. Significance

ThinkStream represents a paradigm shift in multimodal AI, moving from passive, batch-oriented video analysis to active, real-time reasoning. By treating reasoning traces as a form of compressed memory, it solves the critical scalability issue of long video streams. This enables the deployment of intelligent agents in dynamic environments (e.g., robotics, real-time monitoring, and interactive assistants) that can "think while watching," make timely decisions, and operate efficiently within strict hardware constraints. The release of the code, models, and dataset further accelerates research in streaming video reasoning.

Thinking in Streaming Video

1. The "Watch–Think–Speak" Loop

2. The "Magic Backpack" (Reasoning-Compressed Streaming Memory)

3. Training the AI (Streaming Reinforcement Learning)

Why Does This Matter?

B. Reasoning-Compressed Streaming Memory (RCSM)

C. Streaming Reinforcement Learning with Verifiable Rewards (RLVR)

D. Efficient Streaming Inference Backend

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks