Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

The Big Problem: The "Wait-and-See" Bottleneck

Imagine you are watching a live cooking show on TV.

The Old Way (Batch Inference): The current AI models act like a very strict, slow student. They watch the entire video from start to finish, pause the show, and then say, "Okay, now I will think about what happened." They only start reasoning after the video is completely over.
- The Result: If the video is 10 minutes long, the AI waits 10 minutes before saying a single word. By the time it speaks, the "live" moment is gone. It's like trying to comment on a soccer game after the final whistle blows.
The "Naive" Streaming Way: Some newer models try to watch and think at the same time, but they do it clumsily. They watch one second, stop to write a sentence, watch the next second, stop to write another sentence.
- The Result: It's like a driver who has to stop the car completely every time they want to check the rearview mirror. It's safer than waiting, but it's still jerky, slow, and inefficient.

The Solution: Think-as-You-See (TaYS)

The authors propose a new framework called Think-as-You-See (TaYS). Imagine a professional sports commentator who is so skilled they can describe the play while it is happening, without ever missing a beat or stopping the flow.

TaYS allows the AI to watch and think simultaneously, just like a human does. As the video frames arrive, the AI processes them and generates thoughts instantly.

How It Works: The Three Magic Tricks

To make this happen, the researchers built three specific "gadgets" inside the AI's brain:

1. The "One-Way Glass" (Streaming Attention Mask)

The Problem: In a normal video, if the AI looks at the future (what happens next), it cheats. It's like a student peeking at the answer key before taking the test.
The Fix: TaYS installs a "one-way glass" in the AI's attention mechanism. The AI can look at everything it has already seen, but it is physically blocked from seeing the future frames. This forces the AI to reason based only on the current reality, keeping its thoughts grounded in what is actually happening right now.

2. The "Dual Address Book" (Decoupled Positional Encoding)

The Problem: Imagine you are writing a story where you are adding new pages (video frames) and new sentences (thoughts) at the same time. If you use one single numbering system, the numbers get confused. "Is page 5 the 5th second of the video, or the 5th thought?" This confusion makes the AI lose its place in time.
The Fix: TaYS gives the video and the thoughts their own separate "address books." The video frames have their own timeline, and the thoughts have their own. They can grow independently without bumping into each other, ensuring the AI always knows exactly when something happened relative to its thoughts.

3. The "Two-Lane Highway" (Parallel Dual KV-Cache)

The Problem: In old systems, the AI has a single-lane road for memory. It has to finish loading the video into memory before it can start writing thoughts. This creates a traffic jam.
The Fix: TaYS builds a two-lane highway.
- Lane 1 (Visual): The AI is constantly downloading new video frames into its memory.
- Lane 2 (Reasoning): At the exact same time, the AI is generating thoughts based on what it just downloaded.
- Because these lanes are separate but connected, the AI never has to stop to "load" the video. It's like a chef who is chopping vegetables (watching) while simultaneously stirring the pot (thinking).

Why Does This Matter? (The Results)

The researchers tested TaYS on a benchmark called VideoEspresso (a test of video reasoning skills). Here is what happened:

Speed: The "Time-to-First-Token" (how long it takes to say the first word) dropped from 10.6 seconds (waiting for the whole video) to near zero. It's instant.
Accuracy: The AI got 2.9% more accurate. Because it wasn't waiting until the end, it didn't "forget" the beginning of the video (a problem called "temporal drift").
Realism: The AI's thoughts matched the video events much more closely. Instead of saying "The guy cooked the whole meal," it could say, "Right now, he is chopping the onions," with perfect timing.

The Bottom Line

Think-as-You-See changes AI from a "post-game analyst" who waits for the game to end, into a "live commentator" who understands the action as it unfolds.

This is a massive step toward real-time AI that can help robots navigate the world, assist in live surgeries, or manage traffic systems, where waiting for the "whole picture" before thinking is simply too slow to be useful.

1. Problem Statement

Current Large Vision-Language Models (LVLMs) for video reasoning predominantly rely on a batch inference paradigm ("Wait-and-See"). In this approach, the model must ingest the entire video sequence before initiating any reasoning or generating a response. This creates three critical issues:

High Latency: The system cannot respond until the full video is processed, leading to significant delays (e.g., Time-to-First-Token or TTFT > 10s).
Temporal Drift: As video length increases, the delay between a visual event and the model's reasoning step grows, causing the model to lose track of early cues and resulting in hallucinations or loss of context.
Inefficiency: Batch processing forces a sequential dependency where visual encoding and text generation are tightly coupled, preventing concurrent processing.

Real-world applications (robotics, autonomous driving, live surveillance) require streaming reasoning, where the model processes and reasons about visual data incrementally as it arrives, mimicking human cognition.

2. Methodology: Think-as-You-See (TaYS)

The authors propose TaYS, a framework that shifts LVLMs from batch processing to a streaming reasoning paradigm. The core philosophy is "Think While Seeing," enabling continuous, incremental inference synchronized with the visual stream.

Key Technical Innovations

To achieve true parallel streaming without sacrificing causal consistency, TaYS introduces three architectural and algorithmic innovations:

Streaming Attention Mask:
- Enforces strict temporal causality. A reasoning token generated at time $t$ can only attend to visual frames and reasoning tokens available up to time $t$ .
- Unlike standard batch masks that expose all tokens, this mask creates a sliding window over visual tokens relative to the current reasoning step, preventing information leakage from future frames.
Decoupled Positional Encoding:
- Solves the index conflict arising from the concurrent growth of visual and reasoning streams.
- Standard Rotary Position Embeddings (RoPE) offset reasoning tokens by the total visual length ( $N_v$ ), causing dynamic shifts in relative positions as $N_v$ grows.
- TaYS assigns independent positional axes for vision ( $pos(v_s) = s$ ) and reasoning ( $pos(r_t) = t$ ). This ensures the relative temporal distance ( $t-s$ ) remains semantically consistent regardless of sequence length, stabilizing the model's temporal perception.
Parallel Dual KV-Cache Mechanism:
- Decouples visual encoding from reasoning generation to enable true concurrency.
- Maintains two separate caches: a Video Cache ( $C_v$ ) for visual states and a Reasoning Cache ( $C_r$ ) for text states.
- Workflow: New frames are asynchronously appended to $C_v$ while the model generates reasoning tokens using $C_r$ . The system performs a "merge-split" operation at the pointer level (zero-copy) to compute attention over the logical concatenation of both caches. This allows the model to ingest new frames while simultaneously decoding tokens, eliminating the "blocking" bottleneck of interleaved approaches.

Data Construction

The authors constructed a Streaming Video CoT dataset based on the VideoEspresso benchmark.

Process: They used timestamp-based resampling (2 FPS) to align frames with keyframe annotations.
Trajectory Generation: Used LLMs to generate temporally grounded question-reasoning-answer triplets $(Q_t, R_t, A_t)$ for each keyframe.
Quality Control: Applied semantic consistency checks and temporal filtering to ensure reasoning steps are strictly aligned with observed visual evidence, discarding redundant or misaligned samples.

3. Key Contributions

Paradigm Shift: Introduced a principled streaming reasoning paradigm for LVLMs, enabling incremental, temporally grounded inference that evolves with visual evidence.
Architecture Design: Designed a cohesive training and inference architecture featuring causal masking, decoupled positional encoding, and a parallel dual-cache mechanism.
Empirical Validation: Conducted comprehensive evaluations demonstrating that streaming reasoning significantly outperforms batch and naive interleaved baselines in both accuracy and responsiveness.
Open Source: Released the code and dataset to facilitate further research in real-time multimodal intelligence.

4. Experimental Results

The framework was evaluated on the extended VideoEspresso benchmark using the Qwen2.5-VL family (3B and 7B models).

Reasoning Accuracy: TaYS improved reasoning accuracy by +2.9% over batch Chain-of-Thought (CoT) baselines. In human-aligned GPT-5 evaluations, TaYS achieved a 43.7% win rate, significantly outperforming Batch (31.4%) and Interleaved (21.7%) methods.
Latency Reduction:
- TTFT (Time-to-First-Token): Reduced from 10.6s (Batch) to near-zero ( $\approx 10^{-6}$ s) in TaYS.
- Delay: Maintained a stable end-to-end delay of ~12s across varying frame rates (1-5 FPS), whereas interleaved methods suffered from cumulative delay growth.
Temporal Grounding: TaYS reduced the deviation between reasoning events and actual visual events from 1.52s (Interleaved) to 0.69s. 86% of TaYS reasoning steps fell within 1 second of keyframes, compared to 62.4% for baselines.
Coherence: TaYS produced smoother semantic similarity profiles between consecutive reasoning steps, avoiding the redundant, repetitive peaks observed in interleaved models.

5. Significance and Impact

Real-Time Viability: TaYS bridges the gap between offline analysis and real-time interaction, making LVLMs viable for latency-sensitive applications like autonomous driving and robotic teleoperation.
Biological Plausibility: The approach aligns AI reasoning with human cognitive processes, where understanding is updated incrementally rather than waiting for a sequence to end.
Future of Embodied AI: By decoupling perception from reasoning, TaYS lays the foundation for "thinking on the feet" agents capable of dynamic interaction in open-world environments without waiting for complete context encoding.

In summary, Think-as-You-See represents a fundamental shift from static, batch-oriented video understanding to dynamic, streaming reasoning, solving critical latency and temporal drift issues while enhancing the accuracy and coherence of multimodal AI.