FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Imagine you are trying to solve a mystery, but instead of a few clues, you are handed a 3-hour security camera tape of someone's entire day. Your job is to answer a specific question like, "What did the person do right before they picked up the coffee cup?"

If you tried to watch the whole tape at normal speed, you'd get tired, bored, and likely miss the crucial moment. If you tried to watch every single second in slow motion, your brain (or in this case, a computer) would crash from the sheer amount of data.

This is the problem FocusGraph solves. It's a new way for AI to watch long videos without getting overwhelmed. Here is how it works, broken down into simple analogies:

1. The Problem: The "Information Flood"

Current AI models (called Multimodal Large Language Models) are like brilliant detectives who can read and understand text perfectly. But when you show them a long video, they try to look at every single frame (every split-second image).

The Issue: It's like asking a detective to read every single page of a 1,000-page book to find one specific sentence. It takes forever, costs a lot of money (computing power), and the detective often gets confused by all the extra noise.

2. The Solution: FocusGraph's Two-Step Strategy

FocusGraph acts like a smart Editor and a Spotlight. It doesn't watch the whole video; it figures out where to look and what to look at.

Step 1: The "Summary Editor" (Scene-Caption LLM Selector)

Instead of showing the AI the raw video frames, FocusGraph first breaks the video into short chunks (clips).

The Analogy: Imagine you have a 3-hour movie. Instead of showing the AI the movie, you ask a fast, lightweight assistant to write a one-sentence summary for every 10-second chunk.
- Chunk 1 Summary: "Person walks into kitchen."
- Chunk 2 Summary: "Person opens fridge."
- Chunk 3 Summary: "Person grabs milk."
The Magic: The AI then looks at these text summaries (which are tiny and easy to read) and asks, "Which of these summaries sounds like it contains the answer to my question?"
The Result: It ignores the boring parts (like the person walking down the hall for 5 minutes) and only keeps the interesting clips (like the fridge scene). This is the "Graph" part—it organizes these summaries into a map of objects and actions.

Step 2: The "Spotlight" (PSFR Algorithm)

Now the AI has a few short clips, but they still have too many frames. We need to pick the best few pictures to show the main detective.

The Analogy: Imagine you have a video of a bird flying. If you take a photo every second, 90% of the photos look exactly the same. You only need a photo when the bird flaps its wings or changes direction.
The Method: FocusGraph uses a clever trick called Patchwise Sparse-Flow Retention (PSFR). It doesn't need to be "taught" how to do this; it just watches for movement and change.
- It divides the screen into a grid.
- It tracks tiny dots (corners) on objects.
- If the dots move significantly or disappear (meaning the scene changed), it says, "Aha! This is a key moment! Take a picture!"
- If the dots stay still, it says, "Nothing new here, skip this frame."

3. The Final Answer

Now, the main AI detective (the big MLLM) only has to look at:

The text summaries to know where to look.
A handful of key photos (the ones where things actually happened) to see the details.

Why is this a big deal?

Speed: It's like going from reading a whole library to reading a single index card. The AI answers questions much faster (in seconds instead of minutes).
Smarter: Because it focuses on the right moments, it doesn't get confused by irrelevant details. It actually gets better at answering hard questions about long videos.
Efficient: It saves a massive amount of computer power, making it possible to run these smart agents on devices that aren't supercomputers.

In short: FocusGraph teaches AI to stop staring at the whole ocean and instead learn to spot the specific waves that matter. It turns a 3-hour movie into a 5-minute highlight reel, ensuring the AI never misses the plot twist.

1. Problem Statement

The paper addresses the challenge of Long Video Question Answering (LVQA) specifically within embodied environments (e.g., robots or agents navigating real-world spaces).

The Core Issue: While Multimodal Large Language Models (MLLMs) excel at understanding natural language and short videos, they struggle with long-form videos (hours long). Directly feeding all frames to an MLLM leads to:
- Context Dilution: The model's attention is spread too thin, degrading answer quality.
- Computational Cost: Inference time and token usage scale linearly or worse with video length.
Embodied Specifics: Egocentric videos (first-person view) present unique difficulties, including frequent camera motion, occlusions, viewpoint changes, and repetitive visual content. Crucially, the information needed to answer a query is often sparsely distributed across the video, requiring long-horizon temporal reasoning.
Limitations of Existing Methods:
- Compression: Reducing visual tokens often loses critical fine-grained details.
- Naive Frame Selection: Uniform sampling misses key events.
- Existing Two-Stage Methods: Many rely on processing raw low-resolution frame sequences, which remains computationally expensive and limits the number of frames an MLLM can process.

2. Methodology: FocusGraph

FocusGraph is a modular, two-stage framework designed to decouple semantic relevance selection from visual keyframe extraction. It operates on the principle of using a compact textual representation for high-level reasoning and a lightweight visual algorithm for frame selection.

Stage 1: Graph-Based Scene Representation & Clip Selection

Instead of processing raw frames, the system converts the video into a hierarchical textual representation.

Clip-Level Scene Graph Construction:
- The input video is split into fixed-length clips.
- A pretrained MLLM (Qwen2.5-VL-7B) processes each clip to generate a Hierarchical Textual Scene Graph. This includes:
  - A list of detected objects and subjects.
  - A natural language description of the scene and actions.
  - Spatial relations between objects.
  - Temporal Augmentation: The absolute time range of the clip is appended to the caption.
Scene-Caption LLM Selector:
- This is a trainable module (fine-tuned via Supervised Fine-Tuning on the GenS-Video-150K dataset).
- It takes the sequence of graph-based captions (converted to embeddings via ModernBERT and projected to the LLM space) as input.
- Goal: Given a user query (e.g., "Navigate to the room where..."), it selects the specific subset of clips ( $M$ clips) that contain the relevant information.
- Advantage: By operating on text/graphs rather than raw pixels, this stage is lightweight and allows the model to reason over long temporal horizons without token limits.

Stage 2: Training-Free Keyframe Selection (PSFR)

Once the relevant clips are identified, the system must select specific frames from these clips to feed into the final MLLM for the answer.

Patchwise Sparse-Flow Retention (PSFR):
- A training-free algorithm that selects $K$ keyframes from the selected clips.
- Mechanism: It tracks Shi-Tomasi corners across patches using sparse optical flow (Lucas-Kanade).
- Logic: It monitors the "retention" of corners between frames. If a patch loses its tracked corners (indicating significant motion, occlusion, or scene change), it triggers a "PSFR event."
- Selection: It combines these motion signals with other cues (edge density, entropy, central window corners) to select frames that maximize information content and diversity while minimizing redundancy.
- Optimization: The specific selection function (how to weight these cues) is discovered via Program Evolution (using OpenEvolve) to optimize for ground-truth frame inclusion under a strict CPU time budget.

Final Inference

The selected $K$ keyframes are fed into the final MLLM to generate the answer (e.g., navigation goal or multiple-choice QA).

3. Key Contributions

FocusGraph Framework: A novel architecture that combines query-conditioned clip selection (using a trainable Scene-Caption LLM) with training-free keyframe identification (PSFR).
Hierarchical Textual Scene Graphs: Introduction of a clip-level graph representation that encodes objects, interactions, and temporal relationships. This allows for scalable, long-horizon reasoning independent of raw frame sequences.
Patchwise Sparse-Flow Retention (PSFR): A fast, training-free algorithm for keyframe selection based on optical flow tracking, optimized via program evolution to balance precision and coverage.
State-of-the-Art Performance: Demonstrated superior results on challenging embodied benchmarks while significantly reducing inference time compared to dense frame processing.

4. Experimental Results

The method was evaluated on two major benchmarks: FindingDory (mobile manipulation tasks) and HourVideo (Ego4D dataset).

Accuracy:
- FindingDory: FocusGraph achieved state-of-the-art results using only 8 frames and the Qwen-2.5-VL-7B model. It outperformed baselines like ViaRL and matched the performance of GenS (which uses more complex retrieval) while being faster.
- HourVideo: Achieved competitive overall scores (33.4%) across navigation, perception, and reasoning tasks, outperforming uniform sampling and MaxInfo.
Efficiency:
- Inference Time: FocusGraph significantly reduced inference time. The Scene-Caption LLM Selector took 0.6s per question, compared to 80s for agent-based methods (ReMEmbR) and 103s for GenS.
- Token Usage: Reduced token usage to <1 token per frame for the selection stage, a massive reduction compared to methods using 16 tokens per frame.
- PSFR Speed: The PSFR algorithm runs on CPU in 0.021s per frame, nearly twice as fast as the MaxInfo baseline.
Ablation Studies:
- Removing PSFR caused a drop in performance, confirming its necessity for refining the selected clips.
- Including time-range modeling in the captions significantly improved performance on temporal reasoning tasks.
- The program-evolved PSFR selector (optimized for "Inclusion") successfully retrieved at least one ground-truth frame in 86.6% of cases on FindingDory.

5. Significance

Scalability for Embodied AI: FocusGraph provides a practical solution for agents that need to reason over hours of video without prohibitive computational costs. It proves that semantic abstraction (graphs/text) can effectively filter long videos before visual processing.
Decoupling Reasoning and Redundancy: The work demonstrates that semantic relevance (what to look at) can be separated from visual redundancy reduction (which specific frames to keep). This allows for specialized, efficient modules for each task.
Efficiency vs. Performance: It challenges the notion that high accuracy in long-video QA requires processing dense frames or using massive context windows. By using structured representations and smart selection, high performance is achievable with minimal inference overhead.

In summary, FocusGraph represents a shift towards structured, selective temporal abstraction, enabling embodied agents to efficiently accumulate and leverage long-term perceptual memories for complex reasoning tasks.