Thinking in Streaming Video

This paper introduces ThinkStream, a framework that enables real-time video reasoning through an incremental "Watch--Think--Speak" paradigm, supported by a novel Reasoning-Compressed Streaming Memory mechanism and a specialized reinforcement learning scheme to achieve low-latency, high-performance understanding in continuous video streams.

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are watching a live cooking show with a friend who is asking you questions about what's happening.

The Old Way (Batch Processing):
Most current AI video models work like a student who waits until the entire movie is over before they are allowed to open their notebook and write an answer. They watch the whole 2-hour video, then spend a long time thinking, and finally say, "Okay, I know what happened."

  • The Problem: If your friend asks, "Where did the chef put the cutting board?" while the chef is still chopping, the AI has to wait until the end of the video to answer. By then, the moment has passed, and the AI is too slow to be helpful in real-time. Also, trying to remember every single frame of a 2-hour video takes up a massive amount of brainpower (memory).

The New Way (ThinkStream):
The paper introduces ThinkStream, which works more like a human watching that same cooking show. Instead of waiting, you watch, think, and speak as the video plays.

Here is how it works, broken down with simple analogies:

1. The "Watch–Think–Speak" Loop

Imagine you are a detective watching a crime scene unfold on a live feed.

  • Watch: You see a new piece of evidence (a video chunk) arrive.
  • Think: Instead of freezing, you immediately update your mental notes. You say to yourself, "Okay, the chef just moved the board to the sink. That's important."
  • Speak (or Stay Silent): You then decide: "Do I have enough info to answer my friend's question yet?"
    • If Yes: You speak up immediately: "He put it near the sink!"
    • If No: You stay silent and keep watching, waiting for the next clue.

This happens in a split second, over and over, as the video streams. The AI doesn't wait for the end; it reasons while it watches.

2. The "Magic Backpack" (Reasoning-Compressed Streaming Memory)

Here is the biggest challenge: If you watch a video for an hour, you can't remember every single pixel of every frame. Your brain (and the computer's memory) would explode.

  • The Old Problem: Traditional AI tries to save every single photo (visual token) from the video in its memory. Eventually, the backpack gets so heavy it can't move.
  • The ThinkStream Solution: ThinkStream uses a "Magic Backpack."
    • As the video plays, the AI looks at the old scenes. Once it understands what happened (e.g., "The chef moved the board"), it writes a short summary note (a reasoning trace) in its backpack.
    • Then, it throws away the heavy, detailed photos of that old scene to make space.
    • The Magic: The summary note is tiny but contains all the meaning of the photo. The AI keeps the "story" but deletes the "heavy pictures." This way, it can watch a 10-hour video without its backpack ever getting too heavy.

3. Training the AI (Streaming Reinforcement Learning)

How do you teach an AI to do this? You can't just tell it "be smart." You have to train it like a video game character.

  • The Game: The AI plays a game where it watches a video stream.
  • The Rules (Rewards):
    1. Don't talk too early: If it answers before it has enough clues, it loses points.
    2. Don't talk too late: If it waits too long, it loses points.
    3. Be right: If it finally answers, it must be correct.
  • The Result: Through thousands of tries, the AI learns the perfect rhythm: Watch a bit, think a bit, decide if I know the answer, and then speak.

Why Does This Matter?

  • Real-Time Assistants: Imagine a robot butler helping you in your kitchen. If you drop a plate, it needs to know immediately and tell you, not wait until you finish cooking dinner to say, "Oh, you dropped a plate."
  • Low Cost: Because it throws away the "heavy photos" and keeps only the "light summaries," it runs on smaller, cheaper computers without getting slow.
  • Better Memory: It can remember the story of a long video (like a whole day in a factory) without needing a supercomputer to store every second of footage.

In a nutshell: ThinkStream teaches AI to stop being a passive student who waits for the test to end, and start being an active observer who thinks, updates, and acts in real-time, all while keeping its memory light and efficient.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →