From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
This paper introduces MM-Mem, a cognition-inspired pyramidal multimodal memory architecture that leverages Fuzzy-Trace Theory and a Semantic Information Bottleneck to progressively distill verbatim visual details into abstract semantic schemas, thereby enabling efficient long-horizon video understanding through hierarchical storage and entropy-driven retrieval.