From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

This paper introduces MM-Mem, a cognition-inspired pyramidal multimodal memory architecture that leverages Fuzzy-Trace Theory and a Semantic Information Bottleneck to progressively distill verbatim visual details into abstract semantic schemas, thereby enabling efficient long-horizon video understanding through hierarchical storage and entropy-driven retrieval.

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are trying to remember a whole movie you watched yesterday to answer a tricky question about it.

If you try to remember every single frame (every blink, every background detail), your brain gets overwhelmed, and you can't find the important stuff quickly. That's like the current "Vision-Centric" AI models: they try to save everything, get slow, and get confused.

On the other hand, if you only remember a text summary like "A guy walked into a room and yelled," you save space, but you lose the details. If someone asks, "What color was his shirt?" or "Did he drop a cup?", you have no idea. That's the "Text-Centric" models: they are fast but often make things up (hallucinate) because they forgot the visual proof.

The Paper's Solution: MM-Mem
This paper introduces a new AI memory system called MM-Mem. It's inspired by how human brains actually work, based on a theory called Fuzzy-Trace Theory.

Think of MM-Mem not as a single hard drive, but as a three-story library or a pyramid that organizes your memories in a smart way.

1. The Three Levels of Memory (The Pyramid)

  • Level 1: The Sensory Buffer (The "Raw Footage" Basement)
    • What it is: This is where the AI keeps the "verbatim" details. Think of it as a warehouse full of raw video clips and exact subtitles.
    • Analogy: It's like keeping the original, unedited 4K video files on a hard drive. It's huge and detailed, but you don't look here unless you really need to.
  • Level 2: The Episodic Stream (The "Highlight Reel" Middle Floor)
    • What it is: This is a summary of events. The AI groups similar moments together.
    • Analogy: This is like a "Best Of" DVD or a highlight reel. Instead of remembering every second of a soccer game, you remember "Goal scored at 10 mins," "Red card at 45 mins." It captures the gist of what happened without the noise.
  • Level 3: The Symbolic Schema (The "Book Index" Top Floor)
    • What it is: This is the high-level abstract knowledge. It's a map of the story, characters, and main themes.
    • Analogy: This is the table of contents or the index at the back of a book. It tells you where to find things. "The villain appears in Chapter 3." It doesn't have the details, but it knows the structure.

2. How It Builds Memory (The "Smart Sifter")

The paper introduces a special tool called SIB-GRPO. Imagine you are a chef making a soup.

  • Old way: Throw everything into the pot (too much, messy). Or, just guess the recipe from memory (tastes wrong).
  • MM-Mem way: You have a smart strainer. As you cook, you constantly ask: "Do I need this ingredient for the final taste?" If it's just redundant (like adding salt when it's already salty), you throw it away. If it's a key flavor, you keep it.
  • The Result: The AI learns to keep only the "flavor" (important meaning) and discard the "water" (boring, repetitive details), saving space while keeping the taste perfect.

3. How It Finds Answers (The "Drill-Down" Strategy)

When you ask the AI a question, it doesn't just dump all its memory on you. It uses a Top-Down strategy, guided by a "confidence meter" (Entropy).

  • Step 1: It starts at the Top Floor (Symbolic Schema). "Do I know the answer from the index?"
    • Example: "Who is the main character?" -> "Yes, it's John." (Fast! Done.)
  • Step 2: If the AI feels uncertain (the confidence meter drops), it "drills down" to the Middle Floor (Episodic Stream).
    • Example: "Did John wear a hat?" -> The index doesn't say. The AI checks the highlight reel. "Ah, I see a scene where he wears a hat."
  • Step 3: If it's still unsure, it goes all the way down to the Basement (Sensory Buffer).
    • Example: "What color was the hat?" -> The highlight reel just said "hat." The AI now pulls up the specific raw video frame to check the exact shade of blue.

Why is this cool?
It saves energy. The AI doesn't waste time looking at the raw video for simple questions. It only dives deep into the details when it's necessary to be precise.

The Big Picture

This paper solves the problem of AI getting "dumb" with long videos.

  • Old AI: Either forgets the details (text-only) or gets confused by too much data (video-only).
  • MM-Mem: Acts like a human. It remembers the story (gist) efficiently, but keeps the evidence (verbatim) ready in the back pocket just in case you ask for proof.

It's the difference between a student who memorized a textbook word-for-word (and is slow to answer) versus a student who understands the concepts, knows where to look in the book for details, and can answer both simple and complex questions perfectly.