Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

The Big Picture: From "Short Clips" to "A Whole Life"

Imagine you are trying to understand a person's life.

Old Way (Current AI): Most AI models today are like people who only watch movie trailers. They see 30-second clips, understand what happens in that minute, and then forget everything. They are great at answering, "What color was the car in this 5-second clip?" but terrible at answering, "What did this person eat for breakfast three weeks ago?"
The Problem: Real life isn't a movie trailer. It's a continuous stream with huge gaps. You sleep, you go to work, you travel, and you don't record every second. Current AI gets confused when asked to remember things from days or months ago because it tries to hold the entire video in its brain at once, which causes it to crash or hallucinate (make things up).

The Solution: MM-Lifelong (The "Life Log" Dataset)

The researchers created a new dataset called MM-Lifelong. Think of this as a giant, messy diary made of video.

Instead of just showing a model a 10-minute movie, they gave it:

Day Scale: A full day of a gamer playing a video game (24 hours of continuous play).
Week Scale: A week of someone's daily life from their own camera (sleeping, eating, working).
Month Scale: 51 days of a live streamer's life, but with huge gaps (they stream for 10 hours, then disappear for 3 days, then come back).

The Key Challenge: The dataset forces the AI to deal with "The Missing Time." The AI has to remember that on Day 1, the streamer bought a red hat, and on Day 15, they wore that same red hat, even though there were 13 days of "black screen" (unrecorded time) in between.

The Two Big Failures of Current AI

The paper tested the smartest AI models available and found two ways they fail at this "Life Log" task:

The "Overstuffed Backpack" (Working Memory Bottleneck):
Imagine trying to carry a backpack filled with 100 hours of video. As you add more video, the backpack gets so heavy and full that you can't think anymore. The AI tries to read the whole video at once, gets overwhelmed by "noise" (irrelevant details), and starts guessing. It's like trying to find a specific needle in a haystack by staring at the whole haystack at once; you just get dizzy.
The "Lost in the Library" (Global Localization Collapse):
Imagine an AI trying to find a specific book in a library that is the size of a city. If it tries to walk through every single aisle (every frame of the video) to find the book, it gets lost. Current "Agent" AIs (robots that try to search) often give up when the timeline is too long and sparse.

The Hero: ReMA (The "Smart Librarian")

To fix this, the authors built a new system called ReMA (Recursive Multimodal Agent).

Instead of trying to carry the whole library in its head, ReMA acts like a super-smart librarian with a filing system.

How it works:
1. Summarize: It watches the video in chunks and writes a short, smart summary of what happened, putting it into a "Memory Bank" (like a filing cabinet).
2. Ask & Search: When you ask a question ("Where did the streamer sing that song?"), it doesn't re-watch the whole video. It checks its filing cabinet first.
3. Zoom In: If the summary isn't enough, it goes back and re-watches only the specific 5-minute clip where the song might have been played.
4. Update: It updates its notes and tries again.

The Analogy:

Old AI: Tries to memorize every single word of a 1,000-page book to answer one question. It gets a headache and gives the wrong answer.
ReMA: Reads the book, writes a detailed index and summary notes, and when you ask a question, it looks up the page number in the index, flips to that page, and reads just that paragraph.

The Results

When they tested this new "Librarian" (ReMA) against the "Backpack Carriers" (standard AI):

Standard AI: Got about 15% of the answers right. They were mostly guessing.
ReMA: Got about 18-19% of the answers right.
Why is 19% better? In a task this hard (finding needles in a haystack of 100 hours of video), jumping from 15% to 19% is a massive leap. It proves that organizing memory is more important than just having a bigger brain.

The Takeaway

This paper teaches us that to build AI that can truly understand our lives (like a personal assistant that remembers your habits from last month), we can't just make the AI "look" at more video. We have to teach it how to take notes, organize its memories, and know when to look up old information.

We need AI that doesn't just "see" the world, but lives in it by building a persistent, organized story of what happened, even when the camera is off.

1. Problem Definition: The Gap in Lifelong Understanding

The paper addresses a critical limitation in current Multimodal Large Language Models (MLLMs) and video understanding benchmarks. While existing datasets have scaled to hour-long durations, they typically consist of densely concatenated clips that fail to represent natural, unscripted daily life.

The authors introduce a formal distinction between two metrics to define the "Lifelong Horizon":

Observational Duration ( $T_{dur}$ ): The sum of the actual video playback time.
Physical Temporal Span ( $T_{span}$ ): The total chronological time covered by the dataset (from start to end).

The Core Problem: Traditional datasets operate in a regime where $T_{span} \approx T_{dur}$ (dense clips). Real-world lifelong understanding requires navigating a regime where $T_{span} \gg T_{dur}$ (high temporal sparsity). Models must bridge "unobserved gaps" (e.g., sleep, travel, days between events) and handle concept drift over weeks or months. Current paradigms fail here due to:

Working Memory Bottleneck: End-to-end MLLMs suffer from context saturation and noise accumulation when fed massive streams, leading to performance degradation.
Global Localization Collapse: Agentic baselines struggle to navigate sparse, month-long timelines, failing to retrieve specific evidence from vast temporal distances.

2. Methodology

A. The MM-Lifelong Dataset

To bridge this gap, the authors introduce MM-Lifelong, a multi-scale dataset comprising 181.1 hours of footage structured across three temporal scales to simulate the entropy of a continuous lifespan:

Day-Scale (Gamer's Journey): ~24h continuous gameplay tracking avatar inventory and skills.
Week-Scale (Egocentric Life): ~7 days of first-person daily routines and household interactions.
Month-Scale (Live Stream): ~51 days of unscripted livestreams (IRL) tracking a streamer's travel, social events, and content creation.

Key Features:

Clue-Grounded Annotation: Unlike standard QA pairs, every question is annotated with specific Causal Clues (video intervals containing visual evidence).
Task Types:
- Needle-in-a-Lifestream: Locating fleeting, low-frequency events in 100+ hour streams.
- Multi-Hop Reasoning: Aggregating information across disjoint intervals separated by days.
Rigorous Splits: The dataset uses temporal partitioning (sorting by clue position) to prevent data leakage, ensuring models must generalize from early experiences to unseen future segments.

B. The Recursive Multimodal Agent (ReMA)

To address the limitations of end-to-end processing, the authors propose ReMA, an agentic baseline that treats the video stream as an active knowledge base rather than a static input.

Architecture:
ReMA operates in two phases using a Dynamic Memory Management strategy:

Perception Phase: The input video is segmented into clips ( $\Delta t$ ). A passive perception tool (MMInspect) extracts multimodal summaries, which are incrementally consolidated into a Memory Bank ( $B$ ) via a MemManage module. This creates a compact, language-augmented belief state.
Control Phase: An LLM controller iteratively reasons over the user query and the Memory Bank. It selects discrete primitives:
- Answer: Terminate and output.
- MMInspect: Re-examine specific temporal intervals for fine-grained evidence.
- MemSearch: Retrieve and summarize relevant entries from the memory bank.

This recursive loop allows the agent to refine its belief state, effectively managing the "Working Memory Bottleneck" by offloading long-term storage to a structured memory bank while keeping the context window manageable.

3. Key Contributions

Formal Definition of Lifelong Horizon: Established the distinction between $T_{dur}$ and $T_{span}$ , defining the unique challenges of high temporal sparsity ( $T_{span} \gg T_{dur}$ ).
MM-Lifelong Dataset: A comprehensive benchmark with 181.1 hours of diverse content, 1,289 clue-grounded questions, and rigorous train/val/test splits designed to isolate temporal and domain biases.
ReMA Baseline: A novel agentic framework that outperforms both end-to-end MLLMs and existing video agents by employing recursive reasoning and dynamic memory management.
New Evaluation Metrics: Introduced Ref@N, a quantized temporal Intersection over Union metric, to robustly evaluate grounding in ultra-long streams where standard IoU fails due to scale.

4. Experimental Results

The paper evaluates various state-of-the-art models on MM-Lifelong:

End-to-End MLLMs: Models like GPT-5, Qwen3-VL, and VideoXL show a saturation point. As context length increases, performance oscillates or degrades due to noise. While they achieve marginal accuracy (~15%), their grounding scores (Ref@300) are near zero, indicating they rely on semantic priors rather than visual evidence.
Existing Agentic Baselines: Models like VideoMind and LongVT suffer from "Global Localization Collapse," failing to maintain coherence over month-long spans.
ReMA Performance:
- Accuracy: ReMA achieves the highest accuracy across all splits (e.g., 18.62% on Val@Month vs. ~15% for the best MLLM).
- Grounding: ReMA achieves a dominant Ref@300 score of 16.37%, significantly outperforming all baselines (next best is ~8% for DeepVideoDiscovery).
- Ablation:
  - Recursive Depth: Performance peaks around 3-4 reasoning rounds.
  - Granularity: Finer perception intervals (2-5 min) yield better results than processing full videos.
  - Controller: MLLMs significantly outperform text-only controllers as the "brain," confirming the need for multimodal alignment in planning.

5. Significance and Future Directions

This work marks a paradigm shift from passive context extension (simply making context windows larger) to active, persistent memory agents.

Theoretical Impact: It demonstrates that simply scaling MLLM context windows hits a "Working Memory Bottleneck" due to noise and computational overhead. True lifelong understanding requires dynamic memory management and recursive reasoning.
Practical Impact: The MM-Lifelong dataset provides a rigorous testbed for developing AI assistants capable of "living" alongside users over extended periods, handling concept drift, and bridging unobserved temporal gaps.
Future Work: The authors suggest that future research should focus on refining memory orchestration layers and exploring how unobserved periods (gaps) influence observed events, moving beyond simple retrieval to causal inference over time.

In conclusion, the paper argues that the path to true multimodal lifelong understanding lies not in bigger models, but in agentic frameworks that can actively curate, retrieve, and reason over long-term, sparse, and evolving multimodal memories.