From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Imagine you are trying to remember a whole movie you watched yesterday to answer a tricky question about it.

If you try to remember every single frame (every blink, every background detail), your brain gets overwhelmed, and you can't find the important stuff quickly. That's like the current "Vision-Centric" AI models: they try to save everything, get slow, and get confused.

On the other hand, if you only remember a text summary like "A guy walked into a room and yelled," you save space, but you lose the details. If someone asks, "What color was his shirt?" or "Did he drop a cup?", you have no idea. That's the "Text-Centric" models: they are fast but often make things up (hallucinate) because they forgot the visual proof.

The Paper's Solution: MM-Mem
This paper introduces a new AI memory system called MM-Mem. It's inspired by how human brains actually work, based on a theory called Fuzzy-Trace Theory.

Think of MM-Mem not as a single hard drive, but as a three-story library or a pyramid that organizes your memories in a smart way.

1. The Three Levels of Memory (The Pyramid)

Level 1: The Sensory Buffer (The "Raw Footage" Basement)
- What it is: This is where the AI keeps the "verbatim" details. Think of it as a warehouse full of raw video clips and exact subtitles.
- Analogy: It's like keeping the original, unedited 4K video files on a hard drive. It's huge and detailed, but you don't look here unless you really need to.
Level 2: The Episodic Stream (The "Highlight Reel" Middle Floor)
- What it is: This is a summary of events. The AI groups similar moments together.
- Analogy: This is like a "Best Of" DVD or a highlight reel. Instead of remembering every second of a soccer game, you remember "Goal scored at 10 mins," "Red card at 45 mins." It captures the gist of what happened without the noise.
Level 3: The Symbolic Schema (The "Book Index" Top Floor)
- What it is: This is the high-level abstract knowledge. It's a map of the story, characters, and main themes.
- Analogy: This is the table of contents or the index at the back of a book. It tells you where to find things. "The villain appears in Chapter 3." It doesn't have the details, but it knows the structure.

2. How It Builds Memory (The "Smart Sifter")

The paper introduces a special tool called SIB-GRPO. Imagine you are a chef making a soup.

Old way: Throw everything into the pot (too much, messy). Or, just guess the recipe from memory (tastes wrong).
MM-Mem way: You have a smart strainer. As you cook, you constantly ask: "Do I need this ingredient for the final taste?" If it's just redundant (like adding salt when it's already salty), you throw it away. If it's a key flavor, you keep it.
The Result: The AI learns to keep only the "flavor" (important meaning) and discard the "water" (boring, repetitive details), saving space while keeping the taste perfect.

3. How It Finds Answers (The "Drill-Down" Strategy)

When you ask the AI a question, it doesn't just dump all its memory on you. It uses a Top-Down strategy, guided by a "confidence meter" (Entropy).

Step 1: It starts at the Top Floor (Symbolic Schema). "Do I know the answer from the index?"
- Example: "Who is the main character?" -> "Yes, it's John." (Fast! Done.)
Step 2: If the AI feels uncertain (the confidence meter drops), it "drills down" to the Middle Floor (Episodic Stream).
- Example: "Did John wear a hat?" -> The index doesn't say. The AI checks the highlight reel. "Ah, I see a scene where he wears a hat."
Step 3: If it's still unsure, it goes all the way down to the Basement (Sensory Buffer).
- Example: "What color was the hat?" -> The highlight reel just said "hat." The AI now pulls up the specific raw video frame to check the exact shade of blue.

Why is this cool?
It saves energy. The AI doesn't waste time looking at the raw video for simple questions. It only dives deep into the details when it's necessary to be precise.

The Big Picture

This paper solves the problem of AI getting "dumb" with long videos.

Old AI: Either forgets the details (text-only) or gets confused by too much data (video-only).
MM-Mem: Acts like a human. It remembers the story (gist) efficiently, but keeps the evidence (verbatim) ready in the back pocket just in case you ask for proof.

It's the difference between a student who memorized a textbook word-for-word (and is slow to answer) versus a student who understands the concepts, knows where to look in the book for details, and can answer both simple and complex questions perfectly.

1. Problem Statement

Multimodal Large Language Models (MLLMs) excel at short-term reasoning but struggle with long-horizon video understanding. Current approaches face two primary limitations:

Context Window Limits: Standard MLLMs cannot process infinite video streams due to fixed context windows.
Inefficient Memory Mechanisms: Existing memory paradigms fall into two extremes:
- Vision-Centric: Methods like LongVA or VideoRAG accumulate dense visual frames. While faithful, they incur high latency, computational redundancy, and cognitive overload.
- Text-Centric: Methods convert videos to captions or knowledge graphs. While efficient, this "lossy compression" discards fine-grained visual details, leading to hallucinations and ambiguity.
Lack of Cognitive Alignment: Current systems fail to mimic human cognitive efficiency, specifically the ability to retain abstract semantic meaning ("gist") while accessing specific visual evidence ("verbatim") only when necessary.

2. Methodology: MM-Mem

The authors propose MM-Mem, a hierarchical pyramidal multimodal memory architecture inspired by Fuzzy-Trace Theory (FTT). FTT posits that human memory consists of parallel "verbatim" (perceptual details) and "gist" (semantic abstraction) traces. MM-Mem operationalizes this through three distinct layers and a dynamic management system.

A. Pyramidal Memory Structure (Bottom-Up Construction)

The memory is organized into three layers, transforming raw perception into cognitive knowledge:

Sensory Buffer (Verbatim): Stores fine-grained visual evidence.
- Mechanism: Uses content-adaptive temporal segmentation to identify salient frames based on inter-frame variation.
- Content: Stores visual representations ( $v$ ), associated text traces ( $l$ , e.g., subtitles), and timestamps ( $\tau$ ).
Episodic Stream (Event-Level Gist): Consolidates sensory entries into event-level summaries.
- Mechanism: A decision operator ( $\psi$ ) determines whether to ADD_NEW, MERGE, or DISCARD sensory items based on novelty and redundancy.
- Content: Clusters visual representations to form prototypes, aggregating associated text into event-level abstractions.
Symbolic Schema (High-Level Abstraction): A knowledge graph for cross-episode reasoning.
- Mechanism: Extracts salient entities and relations from the Episodic Stream using an LVLM.
- Content: Forms a graph $\mathcal{G}=(\mathcal{N}, \mathcal{E})$ linking episodic nodes to global semantic prototypes, enabling high-level semantic retrieval.

B. Semantic Information Bottleneck (SIB-GRPO)

To manage the transition from the Sensory Buffer to the Episodic Stream, the authors introduce SIB-GRPO (Semantic-Information Bottleneck Group Relative Policy Optimization).

Objective: Optimize the trade-off between compressing redundant information and retaining task-relevant semantics.
Formulation: Based on the Information Bottleneck (IB) principle: $\min [I(X; M) - \beta I(M; Y)]$ , where $X$ is sensory input, $M$ is the episodic memory, and $Y$ is the ground-truth answer.
Implementation: The memory manager is trained as a policy $\pi_\theta$ $π_{θ}$ using Reinforcement Learning (PPO-style).
- Reward Function: Combines a task reward (VQA accuracy), a length penalty (to encourage compression), and a KL-divergence penalty (to anchor the policy to a trusted reference distribution).
- Goal: Dynamically generate information-dense episodic traces that discard noise while preserving semantic fidelity.

C. Entropy-Driven Top-Down Retrieval

Instead of retrieving all memory layers, the agent uses a coarse-to-fine retrieval strategy guided by predictive entropy (Reverse Hierarchy Theory).

Start: Query the Symbolic Schema (high-level gist) first.
Check Uncertainty: Calculate the entropy $H_s(\mathcal{Q})$ of the answer distribution.
Drill Down:
- If uncertainty is low, return the answer.
- If uncertainty is high, descend to the Episodic Stream for event-level details.
- If ambiguity persists, descend to the Sensory Buffer to retrieve specific visual keyframes (verbatim) for verification.

Benefit: This ensures computational efficiency by only accessing high-cost visual data when the abstract semantic context is insufficient.

3. Key Contributions

MM-Mem Architecture: A novel pyramidal memory system that decouples "verbatim" visual details from "gist" semantic schemas, bridging the gap between perception and cognition.
SIB-GRPO: An information-theoretic reinforcement learning framework that optimizes memory construction, effectively distilling essential knowledge from redundancy.
Adaptive Retrieval: An entropy-driven retrieval strategy that dynamically adjusts the depth of memory access, balancing accuracy and computational cost.
Comprehensive Evaluation: Extensive experiments demonstrating state-of-the-art (SOTA) performance across offline and streaming scenarios.

4. Experimental Results

The model was evaluated on four benchmarks: Video-MME, MLVU, VStream-QA (streaming), and a new derived dataset HD-EPIC++.

Performance:
- Video-MME: MM-Mem achieved 78.1% overall accuracy, outperforming strong open-source baselines (e.g., Qwen2-VL-72B, LLaVA-Video-72B) and competing with proprietary models like Gemini 1.5 Pro.
- MLVU: Achieved 77.2% M-Avg, a significant gain over the best agent baseline (Vgent).
- Streaming (VStream-QA-Ego): Improved accuracy by 5.9% over the previous SOTA (Flash-VStream).
- HD-EPIC++: Achieved 30.28% accuracy, surpassing the base model (Qwen3-VL-8B) by +4.4 points.
Ablation Studies:
- Removing SIB-GRPO caused a significant drop in long-video performance, highlighting its role in handling long-term dependencies.
- Removing the Pyramidal Structure (hierarchical memory) further degraded performance, proving the necessity of multi-granularity organization.
- Visual Memory was identified as the most critical component; removing it caused the largest performance drop, especially on long videos.

5. Significance and Impact

Cognitive Inspiration: The work successfully translates human cognitive theories (Fuzzy-Trace Theory) into a functional AI architecture, moving beyond simple "more context" approaches to "smarter context."
Efficiency vs. Fidelity: MM-Mem solves the fundamental trade-off in long-video agents: it avoids the redundancy of dense visual accumulation while preventing the hallucinations of aggressive text summarization.
Scalability: The top-down retrieval mechanism allows agents to handle unbounded video streams efficiently, making it a viable architecture for real-world autonomous agents that require long-term memory and reasoning.
Future Directions: The authors suggest extending this framework to unsupervised "lifelong" learning scenarios and optimizing the computational overhead for edge deployment.

In summary, MM-Mem represents a paradigm shift from static, monolithic memory to a dynamic, cognitively-inspired system that efficiently manages the "verbatim" and "gist" of video data, enabling robust long-horizon reasoning.

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

1. The Three Levels of Memory (The Pyramid)

2. How It Builds Memory (The "Smart Sifter")

3. How It Finds Answers (The "Drill-Down" Strategy)

The Big Picture

1. Problem Statement

2. Methodology: MM-Mem

A. Pyramidal Memory Structure (Bottom-Up Construction)

B. Semantic Information Bottleneck (SIB-GRPO)

C. Entropy-Driven Top-Down Retrieval

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Demystifying When Pruning Works via Representation Hierarchies