Agentic Very Long Video Understanding

Imagine you have a friend who wears smart glasses 24/7. They record everything they see and hear for an entire week: every conversation, every meal, every time they walk into a room, and every time they lose their keys.

Now, imagine asking that friend: "Who was sitting next to me when we took the taxi on Tuesday, and did we talk about the dog before or after that?"

Trying to answer that by watching 50 hours of video is impossible for a human, and it's even harder for current AI. Most AI models are like people with very short-term memory; they can only hold a few minutes of video in their "mind" at once. If you show them a whole week, they get overwhelmed and forget the beginning by the time they reach the end.

This paper introduces EGAgent, a new kind of AI detective designed specifically to solve this "long-term memory" problem.

Here is how it works, using some simple analogies:

1. The Problem: The "Firehose" of Memory

Think of a week-long video stream as a massive firehose of water (information). Current AI tries to drink from this firehose by taking a sip every now and then (sampling a few frames). But if the answer to your question is hidden in a tiny drop of water that happened 3 days ago, the AI misses it. It's like trying to find a specific needle in a haystack by only looking at the top inch of the hay.

2. The Solution: The "Social Rolodex" (The Entity Graph)

Instead of trying to remember every single second of the video, EGAgent builds a Social Rolodex (called an Entity Scene Graph).

Imagine a giant, digital notebook where the AI doesn't write down every frame of the video. Instead, it only writes down the important connections:

Who: Jake, Lucia, Shure.
What: The Car, The Dog, The Kitchen.
When: "Tuesday at 2 PM."
How they relate: "Jake talked to Lucia," "Jake used the Car."

This notebook is organized like a map. It doesn't care about the background scenery; it only cares about the relationships between people and things over time. This is the "Entity Graph."

3. The Detective: The "Planning Agent"

When you ask a question, EGAgent doesn't just guess. It uses a Planning Agent (a smart project manager) that breaks your big question into small, manageable clues.

The Analogy:
Imagine you are a detective trying to solve a mystery. You don't just stare at the crime scene for 10 hours. You:

Check the logs: "Who was in the room at 2 PM?" (Audio Search)
Look at the photos: "Who was standing near the car?" (Visual Search)
Check the Rolodex: "Did Jake ever talk to Shure about the car?" (Entity Graph Search)

EGAgent does this automatically. It asks itself: "To answer this, I need to know who was in the car. Let me check the graph first." If the graph says "Jake and Shure were in the car on Tuesday," it then goes to the video to find the exact moment to confirm.

4. The Superpower: "Time Travel"

The coolest part of this system is that it understands time.

Old AI: "I see a car."
EGAgent: "I see a car, and I know that Jake used that car between 2:00 PM and 2:15 PM on Tuesday, and Shure was talking to him during that time."

Because it stores these relationships with timestamps, it can answer complex questions like: "How many times did I drink water this week?" or "Who did I talk to right before I went to the grocery store?"

5. The Results: Smarter than the Rest

The researchers tested EGAgent on a dataset called EgoLife, which is exactly that week-long video of people living their lives.

Previous AI: Got about 36% of the questions right. They got lost in the details or forgot who was who.
EGAgent: Got 57.5% of the questions right.

It didn't just get lucky; it got better at the hardest questions—the ones that required connecting dots across different days (e.g., "Who did I meet on Monday that I also saw on Friday?").

Summary

Think of EGAgent as the difference between a camcorder and a biographer.

A camcorder just records everything blindly. If you ask it a question, it has to re-watch the whole tape.
A biographer (EGAgent) watches the tape, takes notes on the important relationships, builds a timeline, and creates a map. When you ask a question, the biographer doesn't need to re-watch the tape; they just look at their notes and their map to give you the answer instantly.

This technology is a huge step toward creating personal AI assistants that can actually remember your life, help you find lost items, or remind you of conversations you had weeks ago, just like a human friend would.

1. Problem Definition

The paper addresses the challenge of Very Long Video Understanding, specifically targeting continuous, longitudinal streams of egocentric video (e.g., from smart glasses like Ray-Ban Meta) that span days or weeks.

Limitations of Current Methods: Existing approaches, including Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), struggle with:
- Context Window Constraints: They cannot process hours of video input simultaneously.
- Temporal Coherence: They fail to maintain coherent reasoning about entities and their relationships over extended time horizons (e.g., tracking a habit over a week).
- Compositional Reasoning: They lack the ability to perform multi-hop reasoning across different modalities (visual, audio, text) and timeframes (e.g., "Who did I talk to every time I used my car this week?").
The Gap: While benchmarks like EgoLife offer week-long datasets, current agents cannot effectively link information across days or handle "lulls" in video streams without losing entity identity.

2. Methodology: EGAgent

The authors propose EGAgent, an enhanced agentic framework centered on Entity Scene Graphs. The system decomposes complex queries into sub-tasks and leverages structured search and reasoning tools.

A. Core Component: Entity Scene Graph

Unlike unstructured captions, EGAgent constructs a time-aware graph $G=(V, E)$ :

Nodes ( $V$ ): Entities categorized as Person, Object, or Location.
Edges ( $E$ ): Relationships such as talks-to, interacts-with, mentions, and uses.
Temporal Annotation: Crucially, every edge is annotated with a temporal interval $(t_{start}, t_{end})$ indicating when the relationship holds.
Construction: The graph is built incrementally from fused audio transcripts and visual scene descriptions (captions) using an LLM. It is stored in a SQLite database, allowing for efficient SQL-based querying.

B. The Agentic Framework

EGAgent operates via a Planning Agent that orchestrates a multi-step reasoning loop (Algorithm 1):

Decomposition: The Planning Agent breaks a complex user query $Q$ into a sequence of sub-tasks.
Tool Selection & Retrieval: For each sub-task, the agent selects one of three specialized tools:
- Visual Search Tool: Uses hybrid semantic/attribute search (SQL + vector embeddings) on video frames sampled at 1 FPS.
- Audio Transcript Search Tool: Searches transcribed speech (using either LLM-based semantic search or BM25 lexical search).
- Entity Graph Search Tool: Executes SQL queries on the entity graph to find specific relationships, filtering by time, entity type, and relation type.
- Strategy: The graph search employs a "strict-to-relaxed" strategy, starting with exact matches and progressively broadening time windows or relaxing constraints if no results are found.
Analysis: An Analyzer Tool (LLM) filters retrieved data, extracts evidence, and updates the Working Memory.
Synthesis: A VQA Agent consumes the accumulated cross-modal evidence from the working memory to generate the final answer.

3. Key Contributions

Entity Graph Representation: Introduction of a temporally annotated entity scene graph that explicitly models people, places, objects, and their dynamic relationships over long time horizons, enabling structured indexing.
Agentic Framework (EGAgent): A novel pipeline that integrates the entity graph with visual and audio search tools, allowing for compositional, multi-hop reasoning across modalities.
State-of-the-Art Performance: Demonstration that structured reasoning over entity graphs significantly outperforms uniform sampling and standard RAG approaches on very long video benchmarks.
Efficiency: The method achieves high performance while processing significantly fewer frames (e.g., 10x fewer than some baselines) by relying on the graph for coarse temporal localization.

4. Experimental Results

The system was evaluated on two benchmarks: EgoLifeQA (50 hours of continuous egocentric video) and Video-MME (Long).

EgoLifeQA Performance:
- EGAgent achieved 57.5% accuracy, setting a new state-of-the-art.
- It outperformed the previous best (EgoButler) by 20.6% overall.
- Critical Gains: The model showed massive improvements in categories requiring multi-hop relational reasoning:
  - RelationMap: +32% improvement over previous SOTA.
  - TaskMaster: +39.7% improvement.
- Ablation: Removing the entity graph caused a significant drop in performance, particularly in RelationMap and TaskMaster, proving the graph's necessity for cross-modal reasoning.
Video-MME (Long) Performance:
- Achieved 74.1% accuracy, competitive with the strongest baselines (Gemini 2.5 Pro).
- Notably, EGAgent matched the performance of AdaVideoRAG while processing 10x fewer frames, highlighting its efficiency in long-context scenarios.

5. Significance and Impact

Enabling Always-On AI: This work provides a technical pathway for "always-on" personal AI assistants (e.g., smart glasses) to understand user lives over weeks, not just minutes.
Beyond Context Windows: It demonstrates that for very long videos, structured representation (graphs) combined with agentic planning is superior to simply extending LLM context windows or using naive retrieval.
Cross-Modal Reasoning: The framework proves that linking visual, audio, and relational data through a structured graph is essential for answering complex, longitudinal questions about human habits and social interactions.
Scalability: The use of a lightweight SQLite database for the graph (only ~2MB for 50 hours of video) suggests this approach is highly scalable for real-world deployment compared to storing massive vector embeddings for every frame.

In conclusion, EGAgent represents a paradigm shift from "processing all frames" to "reasoning over structured entity interactions," solving the fundamental bottleneck of temporal coherence in very long video understanding.