MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

This paper introduces MA-EgoQA, a novel benchmark and dataset featuring 1,700 questions across five categories designed to evaluate the ability of AI models to understand and reason over multiple long-horizon egocentric videos from embodied agents, alongside a proposed baseline model named EgoMAS that highlights current limitations in system-level multi-agent understanding.

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are the manager of a busy household with six different robots living with you. Each robot has a camera strapped to its head (like a GoPro) and records everything it sees, hears, and does, 24/7, for a whole week.

Now, imagine you walk in and ask a simple question:

"Who was the last person to use the paper towel on Day 5, and why did they think the coffee machine was broken?"

To answer this, you can't just ask one robot. You have to piece together clues from all six robots, cross-reference their memories, figure out who was talking to whom, and understand what they were thinking at the time.

This is exactly the problem the paper MA-EgoQA tackles. Here is the breakdown in simple terms:

1. The Problem: The "Too Much Information" Bottleneck

In the future, we won't just have one smart assistant; we'll have teams of them (in homes, factories, or hospitals).

  • The Challenge: If you have six robots recording video for seven days, that's 266 hours of video. That's like watching 11 days of non-stop TV!
  • The Current Failure: Even the smartest AI models today get overwhelmed. If you feed them all that video at once, they get confused, like a student trying to read six different textbooks simultaneously while someone is shouting questions at them. They miss the small details, forget who said what, and can't connect the dots between different people's perspectives.

2. The Solution: A New "Exam" (MA-EgoQA)

The researchers created a new test called MA-EgoQA (Multi-Agent Egocentric Question Answering).

  • The Dataset: They used real video data from six people living in a house for a week.
  • The Questions: They generated 1,700 tricky questions that require looking at multiple people's videos to answer.
    • Example: "What did Alice think Jake was doing while she was cooking?" (This requires Theory of Mind—guessing what someone else is thinking).
    • Example: "Who coordinated the cleaning of the kitchen while the others were dancing?" (This requires Task Coordination).

3. The New Strategy: The "Librarian" Approach (EgoMAS)

Instead of trying to force the AI to read everything at once, the researchers proposed a new method called EgoMAS. Think of it as a smart Librarian rather than a super-fast reader.

Here is how EgoMAS works:

  1. The Shared Memory (The Index): Instead of keeping the raw video, the system creates a "summary index" of events. It writes down: "At 2:00 PM, Jake was in the kitchen cooking, while Alice was in the living room dancing." It organizes this by Who, What, Where, When, and How.
  2. The Dynamic Retrieval (The Search): When you ask a question, the Librarian doesn't read the whole book.
    • It first checks the Shared Index to find the relevant time and place.
    • Then, it sends a specific, targeted request to the specific robot that was there. "Hey Jake, what were you doing at 2:00 PM?"
    • It combines these specific answers to give you the final result.

The Analogy:

  • Old Way: Trying to solve a mystery by reading every single page of a 1,000-page novel at the same time. You get lost in the details.
  • EgoMAS Way: Using the Table of Contents to find the right chapter, then asking the specific character involved in that scene what happened.

4. The Results: Why It Matters

The researchers tested this against the world's most powerful AI models (like Gemini and GPT-5).

  • The Result: The "Big Brains" (the massive AI models) struggled, getting only about 37% of the answers right. They were too distracted by the sheer volume of data.
  • The Winner: EgoMAS, even with a smaller, simpler brain, scored significantly higher (over 41%).
  • The Lesson: It's not about having the biggest memory; it's about having the best way to organize and retrieve that memory.

Summary

This paper tells us that for AI teams to work together effectively in the real world, they can't just be "smart." They need to be organized. They need a system that can:

  1. Summarize long periods of time.
  2. Connect the actions of different people.
  3. Retrieve the exact right piece of information when asked.

Without this kind of "system-level understanding," our future teams of robots will be great at doing tasks but terrible at explaining what happened or why. MA-EgoQA is the first step toward building robots that can truly collaborate and communicate like a human team.