Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Imagine you are trying to remember a very long movie so you can answer specific questions about it later.

The Problem: The "Overloaded Brain"
Current AI models trying to understand long videos are like students trying to study for a massive exam by reading every single word of a 1,000-page book at once. They try to keep every detail in their short-term memory (called a KV-Cache).

The researchers found a weird glitch: when they tried to give the AI more details (more "tokens" or words per frame) to make it smarter, the AI actually got dumber.

The Glitch: Instead of remembering the whole movie, the AI started acting like it only remembered the end of the movie. It was like a student who, when asked about the beginning of the story, just guessed based on the last chapter because their brain was so full of recent details that the old ones got fuzzy.
The Result: If you asked, "What happened in the first scene?" the AI would look at the last scene and get it wrong.

The Solution: MemStream (The Smart Librarian)
The authors created a new system called MemStream. Think of it as hiring a super-smart librarian to organize the movie for you. They solved the problem in two clever ways:

1. Adaptive Key Selection (AKS) – "The Trash Can for Boring Parts"

Instead of trying to remember every single pixel of every frame (which is like trying to memorize the color of every brick in a wall), the AI now acts like a smart editor.

How it works: As the video plays, the AI looks at the frames. If two frames are almost identical (like a person standing still for 5 seconds), it says, "I don't need to remember both of these; they are redundant." It throws away the boring, repetitive parts and keeps only the unique, important details.
The Analogy: Imagine you are taking notes on a lecture. Instead of writing down every "um" and "ah," you only write down the key concepts. This keeps your notebook (memory) clean and focused, so you don't get overwhelmed.

2. Retrieval Mixture-of-Experts (MoE) – "The Panel of Judges"

When you ask the AI a question (e.g., "How many cucumbers did the character pick?"), the AI needs to find the right part of the video to answer.

The Old Way: The AI tried to find the answer using only its own internal memory. Sometimes, it would look in the wrong place because its internal "search engine" was biased toward the end of the video.
The New Way: MemStream brings in external experts (other specialized AI models) to help.
- Expert A (The Internal AI) says: "I think it's in the middle."
- Expert B (An external visual model) says: "I see a cucumber scene near the start."
- The Decision: Instead of picking just one, MemStream combines their opinions. It's like a panel of judges voting. If two experts agree on a specific scene, that's the one they go with. This makes the search much more accurate and less likely to be biased toward the end of the video.

The Result
By cleaning up the memory (AKS) and using a team of experts to search it (MoE), MemStream can watch long videos without getting confused.

Real-world win: In a test, when asked about a video of someone picking vegetables, the old AI guessed "6 cucumbers" (looking at the wrong part), while MemStream correctly identified "3 cucumbers" by finding the exact moment it happened.

In a Nutshell
MemStream stops the AI from trying to memorize everything and instead teaches it to filter out the noise and ask for help from specialists when it needs to find an answer. This allows it to handle long, complex videos with much better accuracy.

1. Problem Statement

The paper addresses the challenges of streaming video understanding, specifically the ability of Multimodal Large Language Models (MLLMs) to process continuous, long-form video streams for Video Question Answering (VQA).

Context Limitations: Current state-of-the-art models have limited context lengths, forcing them to either subsample frames (losing temporal granularity) or reduce tokens per frame (losing fine-grained visual details).
KV-Cache Limitations: Existing streaming approaches (e.g., ReKV) use Key-Value (KV) caching to store frame features. However, the authors identify two critical failure modes when scaling token budgets (increasing tokens per frame to capture more detail):
1. Temporal Bias: As the token budget increases, the similarity scores between a query and video frames systematically increase over time. This causes the model to bias retrieval toward the end of the video rather than the relevant segment.
2. Redundancy & Entropy: Higher token counts lead to increased self-similarity among frame representations and higher attention entropy, meaning the sliding window attention mechanism fails to distinguish discriminative features, leading to poor retrieval fidelity.
3. Unreliable Internal Retrieval: Internal attention mechanisms within MLLM layers vary wildly in their ability to retrieve relevant frames; some layers miss the target entirely, while others succeed, leading to inconsistent performance.

2. Methodology: MemStream

The authors propose MemStream, a training-free framework designed to enable high-resolution (dense) token processing in streaming video without sacrificing retrieval accuracy. It consists of two main stages:

A. Encoding Stage: Adaptive Key Selection (AKS)

To address the redundancy and temporal bias caused by dense token streams, the authors introduce a sparse compression strategy within the sliding window attention.

Mechanism: Instead of storing all tokens for every frame in the sliding window, the model performs Adaptive Key Selection.
Process: For adjacent frames, the system computes patch-wise cosine similarity. It retains only the top- $k$ patches from the current frame that are most distinct (least similar) from the previous frame.
Goal: This preserves local spatiotemporal information while discarding redundant signals, effectively reducing the token budget in the attention mechanism while maintaining the full feature representation in the KV-cache for downstream tasks.

B. Retrieval Stage: Training-Free Mixture-of-Experts (MoE)

To address the unreliability of internal layer-wise retrieval, MemStream fuses internal and external retrieval signals.

Internal Retrieval: Uses the MLLM's internal attention maps to score frame relevance.
External Retrieval: Leverages pre-trained Vision-Language Models (e.g., CLIP or PECore) to generate independent query-frame scores.
Fusion Strategy: Instead of concatenating raw embeddings (which assumes comparable distance metrics), the authors use Reciprocal Rank Fusion (RRF).
- RRF combines the rankings from the internal model and the external expert.
- Formula: $RRFScore(t) = \sum \frac{1}{k + r(t)}$ , where $r(t)$ is the rank of frame $t$ .
- Benefit: This late-fusion approach allows strong signals from one expert to compensate for weak signals from another, stabilizing retrieval across different layers and improving overall consistency.

3. Key Contributions

Analysis of Token Scaling: The paper provides a rigorous analysis revealing that simply increasing the token budget in KV-cache-based methods leads to temporal bias and reduced retrieval recall, contrary to the intuition that more tokens equal better performance.
Adaptive Key Selection (AKS): A novel, training-free strategy for sparse sliding-window attention that dynamically selects the most informative patches, reducing redundancy while preserving discriminative features.
Mixture-of-Experts Retrieval: A robust retrieval mechanism that fuses internal MLLM attention with external vision encoders using Reciprocal Rank Fusion, overcoming the instability of layer-wise internal retrieval.
MemStream Framework: An end-to-end solution that enables high-fidelity streaming video understanding without requiring model retraining.

4. Experimental Results

The authors evaluated MemStream on several benchmarks using the Qwen2.5-VL-7B model, comparing it against the state-of-the-art ReKV and other baselines.

Offline Benchmarks:
- CG-Bench: +8.0% improvement over ReKV.
- LVBench: +8.5% improvement over ReKV.
- VideoMME (Long): +2.4% improvement over ReKV.
- Note: The MoE retrieval strategy (combining internal and external) provided the highest gains, particularly on CG-Bench and LVBench.
Online Benchmarks (Streaming VQA):
- On RVS-Ego, MemStream improved accuracy by 3.6% over ReKV with minimal latency increase.
- On RVS-Movie, there was a slight drop (2%), attributed to potentially aggressive compression, but overall performance remained competitive.
- Efficiency: MemStream maintains similar memory usage and latency to ReKV while supporting higher token budgets and better retrieval.
Ablation Studies:
- Encoding: AKS outperformed static selection (average pooling, dilated sampling) and other dynamic strategies (token merging, k-means).
- Retrieval: The MoE approach using RRF significantly outperformed using internal retrieval alone or external retrieval alone. RRF was superior to simple L2-concatenation of embeddings.

5. Significance

Overcoming the "Token vs. Retrieval" Trade-off: The paper demonstrates that high-resolution video understanding does not require sacrificing retrieval accuracy. By intelligently compressing the attention mechanism (AKS) and diversifying the retrieval source (MoE), models can process dense streams effectively.
Training-Free Efficiency: The proposed methods (AKS and RRF) do not require fine-tuning the large language model, making them immediately applicable to existing pre-trained MLLMs.
Robustness for Long Videos: By solving the temporal bias issue inherent in sliding-window attention, MemStream enables more reliable reasoning over long-form content, a critical step toward real-world video AI applications.

In summary, MemStream redefines how streaming video is processed by decoupling the need for dense visual features from the limitations of KV-cache retrieval, achieving state-of-the-art performance through adaptive compression and ensemble retrieval.

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

1. Adaptive Key Selection (AKS) – "The Trash Can for Boring Parts"

2. Retrieval Mixture-of-Experts (MoE) – "The Panel of Judges"

1. Problem Statement

2. Methodology: MemStream

A. Encoding Stage: Adaptive Key Selection (AKS)

B. Retrieval Stage: Training-Free Mixture-of-Experts (MoE)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration