FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

FocusGraph is a novel framework for embodied long video question answering that combines a lightweight Scene-Caption LLM Selector for identifying query-relevant clips and a training-free Patch-wise Sparse-Flow Retention method for keyframe selection, achieving state-of-the-art performance on egocentric benchmarks while significantly reducing inference time.

Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a mystery, but instead of a few clues, you are handed a 3-hour security camera tape of someone's entire day. Your job is to answer a specific question like, "What did the person do right before they picked up the coffee cup?"

If you tried to watch the whole tape at normal speed, you'd get tired, bored, and likely miss the crucial moment. If you tried to watch every single second in slow motion, your brain (or in this case, a computer) would crash from the sheer amount of data.

This is the problem FocusGraph solves. It's a new way for AI to watch long videos without getting overwhelmed. Here is how it works, broken down into simple analogies:

1. The Problem: The "Information Flood"

Current AI models (called Multimodal Large Language Models) are like brilliant detectives who can read and understand text perfectly. But when you show them a long video, they try to look at every single frame (every split-second image).

  • The Issue: It's like asking a detective to read every single page of a 1,000-page book to find one specific sentence. It takes forever, costs a lot of money (computing power), and the detective often gets confused by all the extra noise.

2. The Solution: FocusGraph's Two-Step Strategy

FocusGraph acts like a smart Editor and a Spotlight. It doesn't watch the whole video; it figures out where to look and what to look at.

Step 1: The "Summary Editor" (Scene-Caption LLM Selector)

Instead of showing the AI the raw video frames, FocusGraph first breaks the video into short chunks (clips).

  • The Analogy: Imagine you have a 3-hour movie. Instead of showing the AI the movie, you ask a fast, lightweight assistant to write a one-sentence summary for every 10-second chunk.
    • Chunk 1 Summary: "Person walks into kitchen."
    • Chunk 2 Summary: "Person opens fridge."
    • Chunk 3 Summary: "Person grabs milk."
  • The Magic: The AI then looks at these text summaries (which are tiny and easy to read) and asks, "Which of these summaries sounds like it contains the answer to my question?"
  • The Result: It ignores the boring parts (like the person walking down the hall for 5 minutes) and only keeps the interesting clips (like the fridge scene). This is the "Graph" part—it organizes these summaries into a map of objects and actions.

Step 2: The "Spotlight" (PSFR Algorithm)

Now the AI has a few short clips, but they still have too many frames. We need to pick the best few pictures to show the main detective.

  • The Analogy: Imagine you have a video of a bird flying. If you take a photo every second, 90% of the photos look exactly the same. You only need a photo when the bird flaps its wings or changes direction.
  • The Method: FocusGraph uses a clever trick called Patchwise Sparse-Flow Retention (PSFR). It doesn't need to be "taught" how to do this; it just watches for movement and change.
    • It divides the screen into a grid.
    • It tracks tiny dots (corners) on objects.
    • If the dots move significantly or disappear (meaning the scene changed), it says, "Aha! This is a key moment! Take a picture!"
    • If the dots stay still, it says, "Nothing new here, skip this frame."

3. The Final Answer

Now, the main AI detective (the big MLLM) only has to look at:

  1. The text summaries to know where to look.
  2. A handful of key photos (the ones where things actually happened) to see the details.

Why is this a big deal?

  • Speed: It's like going from reading a whole library to reading a single index card. The AI answers questions much faster (in seconds instead of minutes).
  • Smarter: Because it focuses on the right moments, it doesn't get confused by irrelevant details. It actually gets better at answering hard questions about long videos.
  • Efficient: It saves a massive amount of computer power, making it possible to run these smart agents on devices that aren't supercomputers.

In short: FocusGraph teaches AI to stop staring at the whole ocean and instead learn to spot the specific waves that matter. It turns a 3-hour movie into a 5-minute highlight reel, ensuring the AI never misses the plot twist.