MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Imagine you are the manager of a busy household with six different robots living with you. Each robot has a camera strapped to its head (like a GoPro) and records everything it sees, hears, and does, 24/7, for a whole week.

Now, imagine you walk in and ask a simple question:

"Who was the last person to use the paper towel on Day 5, and why did they think the coffee machine was broken?"

To answer this, you can't just ask one robot. You have to piece together clues from all six robots, cross-reference their memories, figure out who was talking to whom, and understand what they were thinking at the time.

This is exactly the problem the paper MA-EgoQA tackles. Here is the breakdown in simple terms:

1. The Problem: The "Too Much Information" Bottleneck

In the future, we won't just have one smart assistant; we'll have teams of them (in homes, factories, or hospitals).

The Challenge: If you have six robots recording video for seven days, that's 266 hours of video. That's like watching 11 days of non-stop TV!
The Current Failure: Even the smartest AI models today get overwhelmed. If you feed them all that video at once, they get confused, like a student trying to read six different textbooks simultaneously while someone is shouting questions at them. They miss the small details, forget who said what, and can't connect the dots between different people's perspectives.

2. The Solution: A New "Exam" (MA-EgoQA)

The researchers created a new test called MA-EgoQA (Multi-Agent Egocentric Question Answering).

The Dataset: They used real video data from six people living in a house for a week.
The Questions: They generated 1,700 tricky questions that require looking at multiple people's videos to answer.
- Example: "What did Alice think Jake was doing while she was cooking?" (This requires Theory of Mind—guessing what someone else is thinking).
- Example: "Who coordinated the cleaning of the kitchen while the others were dancing?" (This requires Task Coordination).

3. The New Strategy: The "Librarian" Approach (EgoMAS)

Instead of trying to force the AI to read everything at once, the researchers proposed a new method called EgoMAS. Think of it as a smart Librarian rather than a super-fast reader.

Here is how EgoMAS works:

The Shared Memory (The Index): Instead of keeping the raw video, the system creates a "summary index" of events. It writes down: "At 2:00 PM, Jake was in the kitchen cooking, while Alice was in the living room dancing." It organizes this by Who, What, Where, When, and How.
The Dynamic Retrieval (The Search): When you ask a question, the Librarian doesn't read the whole book.
- It first checks the Shared Index to find the relevant time and place.
- Then, it sends a specific, targeted request to the specific robot that was there. "Hey Jake, what were you doing at 2:00 PM?"
- It combines these specific answers to give you the final result.

The Analogy:

Old Way: Trying to solve a mystery by reading every single page of a 1,000-page novel at the same time. You get lost in the details.
EgoMAS Way: Using the Table of Contents to find the right chapter, then asking the specific character involved in that scene what happened.

4. The Results: Why It Matters

The researchers tested this against the world's most powerful AI models (like Gemini and GPT-5).

The Result: The "Big Brains" (the massive AI models) struggled, getting only about 37% of the answers right. They were too distracted by the sheer volume of data.
The Winner: EgoMAS, even with a smaller, simpler brain, scored significantly higher (over 41%).
The Lesson: It's not about having the biggest memory; it's about having the best way to organize and retrieve that memory.

Summary

This paper tells us that for AI teams to work together effectively in the real world, they can't just be "smart." They need to be organized. They need a system that can:

Summarize long periods of time.
Connect the actions of different people.
Retrieve the exact right piece of information when asked.

Without this kind of "system-level understanding," our future teams of robots will be great at doing tasks but terrible at explaining what happened or why. MA-EgoQA is the first step toward building robots that can truly collaborate and communicate like a human team.

Here is a detailed technical summary of the paper "MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents."

1. Problem Definition

The paper addresses a critical gap in the field of embodied AI: Question Answering (QA) over multiple, long-horizon egocentric video streams.

Context: As embodied AI agents (robots, smart glasses, bodycams) become common, humans will interact with systems comprising multiple agents operating in shared environments.
The Challenge: Current Video Large Language Models (Video-LLMs) struggle to:
1. Process extreme temporal horizons: Agents operate for days, generating massive video histories (e.g., 266 hours total), far exceeding the context limits of current models.
2. Aggregate multi-agent perspectives: Answering a query often requires integrating fragmented events scattered across different agents' timelines and viewpoints.
3. Reason across agents: Determining "who did what," "what they thought," or "how tasks were coordinated" requires synthesizing information from multiple sources simultaneously.
Formal Task: Given $N$ agents recording video streams $V = \{V_1, ..., V_N\}$ over time $T$ , and a user query $q$ , the system must generate an answer $y$ by retrieving and reasoning over relevant events across all agents.

2. Key Contributions

A. MA-EgoQA Benchmark

The authors introduce MA-EgoQA, the first benchmark designed to evaluate multi-agent egocentric video understanding.

Data Source: Built upon the EgoLife dataset, capturing 7 days of continuous video from 6 people living in a shared house (totaling 266 hours of video).
Scale: Contains 1,741 high-quality multiple-choice questions.
Five Core Categories:
1. Social Interaction (SI): Casual conversations, group behaviors, and affiliative actions.
2. Task Coordination (TC): Goal-driven collaboration, role assignment, and task sequencing.
3. Theory of Mind (ToM): Inferring mental states (beliefs, misunderstandings, intentions) of agents.
4. Temporal Reasoning (TR): Aligning timelines across agents (concurrency and event ordering).
5. Environmental Interaction (EI): Tracking object usage and environmental states across agents.
Construction Pipeline:
- Generation: Uses GPT-4o/5 to generate candidate QA pairs from dense captions and transcripts.
- Filtering: Employs a rigorous multi-stage filter (Zero-shot, Single-Agent, Cross-model) to ensure questions cannot be answered by a single agent's memory and are not trivially solvable.
- Verification: Human annotators validate the final set.

B. EgoMAS Baseline Model

The authors propose EgoMAS (Egocentric Multi-Agent System), a training-free, retrieval-augmented baseline designed to handle the specific challenges of this task.

Architecture:
1. Event-based Shared Memory: Instead of raw video concatenation, the system aggregates 10-minute captions from all agents into a structured 4W1H (When, Where, Who, What, How) global memory. This abstracts fragmented events into a coherent system-level view.
2. Agent-wise Dynamic Retrieval:
  - Step 1: Retrieve top- $n$ relevant events from the Shared Memory using BM25.
  - Step 2: Generate specific sub-queries for individual agents based on the retrieved context.
  - Step 3: Perform targeted retrieval from each agent's specific memory ( $M_{aj}$ ) to gather fine-grained details.
  - Step 4: Synthesize the final answer using the global context and specific agent details.

3. Methodology Details

Data Generation Strategy:
- SI, TC, ToM: Generated via open-ended prompts on 5-minute windows, followed by semantic clustering to create "Multi-span" questions (requiring reasoning across non-contiguous time windows).
- TR, EI: Generated using predefined templates (e.g., "Who used object X the most?") to ensure structured, factual queries.
Evaluation Protocol:
- Tested against 16 baselines, including proprietary LLMs (Gemini-2.5-Flash, GPT-5), open-source LLMs (Llama-3.1, Qwen), and Video-LLMs (VideoChat-Flash, VideoXL-2).
- Baselines included "All Caption Concat" (naive concatenation), "All Frame Concat" (naive video input), and RAG-based methods.

4. Experimental Results

Difficulty of the Task: MA-EgoQA is highly challenging. Even the strongest proprietary model, Gemini-2.5-Flash, achieved only 36.93% accuracy (vs. 20% random chance). Most open-source models performed near random.
Performance of EgoMAS:
- EgoMAS (Gemini-2.5-Flash) achieved 41.41% accuracy, outperforming the naive Gemini baseline by 4.48%.
- EgoMAS (Qwen3VL-8B) surpassed the Gemini baseline and GPT-5, demonstrating that a specialized retrieval architecture with a smaller model can outperform massive models with naive context injection.
- Efficiency: EgoMAS operates with significantly lower latency (~1.3s per query) compared to processing full video streams.
Ablation Studies:
- Shared Memory is Critical: Removing the shared memory layer caused a significant performance drop.
- Dynamic Retrieval: Agent-wise dynamic retrieval improved accuracy over static retrieval.
- Multi-Agent Necessity: Restricting the model to a single agent's memory resulted in a massive performance drop, confirming the multi-agent nature of the benchmark.
- ToM Difficulty: Theory of Mind remained the hardest category, highlighting the difficulty of inferring latent mental states.

5. Significance and Future Directions

System-Level Understanding: The paper establishes that future embodied AI systems cannot rely on single-agent reasoning. They require system-level memory to integrate experiences across a fleet of agents.
Beyond Context Length: The results suggest that simply increasing context window size (e.g., 1M tokens) is insufficient. Structured retrieval and event abstraction are more effective for long-horizon multi-agent reasoning.
Benchmark Impact: MA-EgoQA provides a standardized testbed for evaluating the "collective intelligence" of multi-agent systems, moving beyond task completion to complex social and temporal reasoning.
Future Work: The authors suggest exploring hybrid retrieval methods (combining lexical and dense embeddings) and adaptive modality selection (determining when video frames are necessary vs. when text summaries suffice).

In summary, MA-EgoQA defines a new frontier in embodied AI by challenging models to reason over the collective, long-term experiences of multiple agents, while EgoMAS demonstrates that structured, retrieval-based approaches are currently the most viable path toward solving these complex, system-level questions.

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

1. The Problem: The "Too Much Information" Bottleneck

2. The Solution: A New "Exam" (MA-EgoQA)

3. The New Strategy: The "Librarian" Approach (EgoMAS)

4. The Results: Why It Matters

Summary

1. Problem Definition

2. Key Contributions

A. MA-EgoQA Benchmark

B. EgoMAS Baseline Model

3. Methodology Details

4. Experimental Results

5. Significance and Future Directions

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning