Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Imagine you are playing a game of "Simon Says" with a robot, but with a twist: you aren't allowed to give full instructions.

The Problem: The "Vague Commander"

In most robot training games today, humans give very specific orders like, "Pick up the red apple on the left side of the table." The robot just reads the text, finds the red apple, and grabs it. It's like solving a crossword puzzle where every clue is spelled out.

But in real life, humans are lazy (in a good way!). We follow the "Principle of Least Effort." We don't say, "Pick up the red apple." We say, "Pick up that," while pointing at it.

The paper calls this "Deictic Grounding." It's when a word like "that" or "this" only makes sense because of when you say it and what you are doing with your hand at that exact moment. If you say "that" while pointing at a cup, it means the cup. If you say "that" a second later while pointing at a spoon, it means the spoon.

The Challenge: The Robot's Blind Spot

The authors found that current AI robots (Multimodal Large Language Models) are terrible at this. They are great at reading the text, but they are "deaf" to the timing of the gesture.

Think of it like this:

The Human: Sees a video of a person pointing at a strawberry and saying, "Put this in that." The human instantly knows: "This" = the strawberry at 2.5 seconds. "That" = the bowl at 3.1 seconds.
The AI: Hears "Put this in that." It sees a video with a strawberry and a bowl. It guesses. It might grab the wrong strawberry, or put it in the wrong bowl, or do it at the wrong time. It treats the video like a slideshow and the audio like a separate podcast, failing to sync them up perfectly.

The Solution: EcoG-Bench (The "Strict Teacher")

To fix this, the researchers built a new test called EcoG-Bench.

Imagine a strict teacher grading a student's homework. The teacher doesn't just care if the answer is "right" in a general sense. They care about three things happening simultaneously:

WHAT: Did you pick the right object? (e.g., The strawberry, not the apple).
WHERE: Did you point to the exact pixel on the screen where the object is?
WHEN: Did you identify the exact millisecond the person pointed at it?

If you get the object right but point to the wrong spot, or get the timing wrong by a split second, you get a zero. This is called "Strict Executability."

The Results: The "Gap"

The results were shocking:

Humans: Scored nearly 97%. We are natural at this.
Top AI Models: Scored around 17%. They are failing miserably.

Why? Because the AI is trying to guess the meaning of "that" without really listening to the rhythm of the pointing. It's like trying to dance to a song without hearing the beat; you might know the steps, but you'll be out of sync.

The "Magic Fix": Giving the AI a Clue

Here is the most interesting part. The researchers asked: "Is the AI stupid, or is the way we are showing it the video the problem?"

They ran a diagnostic test. Instead of giving the AI a raw video file (which is like a blurry, fast-moving stream), they gave it:

Still photos taken at specific times (like a comic book).
A transcript of the speech that included exact timestamps for every word (like a karaoke screen showing exactly when each word appears).

The Result? The AI's performance jumped from 17% to 43%.

The Metaphor:
Imagine trying to catch a ball thrown in the dark.

Native Video (The Old Way): You are in the dark, trying to catch the ball by feeling the air. You miss a lot.
Structured Input (The New Way): Someone shines a flashlight on the ball and tells you, "The ball is coming at 3 seconds." Suddenly, you catch it much more often.

The AI isn't necessarily "dumb"; it just needs the timing cues to be highlighted. The current video interfaces hide the precise moment the gesture happens, making it impossible for the AI to learn the connection between the word "this" and the finger pointing at it.

The Big Picture

This paper is a wake-up call. It tells us that to build robots that can truly collaborate with humans (like a co-worker or a dance partner), we can't just make them smarter at reading. We have to teach them to listen with their eyes.

We need to build systems that understand that "this" isn't just a word; it's a moment in time where a hand moves and a voice speaks. Until we fix the "timing" part of the AI's brain, robots will always be clumsy partners in our daily lives.

Here is a detailed technical summary of the paper "Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time."

1. Problem Definition

The paper addresses a critical gap in Embodied AI and Multimodal Large Language Models (MLLMs): the ability to perform deictic co-speech grounding.

The Challenge: In natural human collaboration, speakers often use intentionally underspecified deictic commands (e.g., "put this in that") rather than exhaustive descriptions. Resolving these references requires aligning the specific deictic word/phrase with a brief, simultaneous co-speech pointing gesture (the "stroke") in a video.
The Limitation of Current Benchmarks: Existing embodied benchmarks are largely "text-sufficient," where the instruction contains enough semantic detail (attributes, locations) to identify the target without needing to analyze the video or audio timing. Consequently, MLLMs can achieve high scores by ignoring the temporal alignment between speech and gesture.
The Core Requirement: To act as true collaborative partners, agents must perform event-level speech–gesture binding. This requires predicting a triplet for each referent:
1. What: The semantic target.
2. Where: A precise 2D actionable point on the final frame.
3. When: A millisecond-level timestamp corresponding to the gesture stroke that disambiguates the reference.

2. Methodology: EcoG-Bench

The authors introduce EcoG-Bench, a diagnostic benchmark designed to stress-test online event-level speech–gesture binding under strict executability constraints.

Data Construction

Dataset: 811 egocentric video clips (367 English, 444 Chinese) recorded from real human-human collaborative workflows (Industrial, Kitchen, Office domains).
Annotation: "Full-Stack" supervision including:
- Semantic: Closed-world candidate sets to avoid open-vocabulary ambiguity.
- Spatial: Instance masks or precise 2D points on the final frame.
- Temporal: Millisecond-level annotation of the gesture stroke window (the specific time interval where the pointing gesture peaks and disambiguates the referent), aligned with word-level ASR timestamps.
Progressive Cognitive Evaluation Protocol (L1–L4): The benchmark scales in complexity:
- L1 (Silent Deictic Pointing): Visual-only pointing (no speech). Tests pure visual stroke localization.
- L2 (Single-Event Binding): One deictic phrase + one gesture. Tests basic audio-visual alignment.
- L3 (Dual-Event Assignment): Two deictic phrases. Tests the ability to assign the correct phrase to the correct stroke within a single clip.
- L4 (Multi-Event Intent Chaining): 3–4 referents. Tests sequential reasoning and state tracking across multiple events.

Evaluation Metrics

The paper emphasizes strict executability:

Component Metrics: Accuracy for Classification (What), Spatial Localization (Where), and Temporal Alignment (When).
Eco-Accuracy ( $Acc_{eco}$ ): A conjunctive metric where a prediction is correct only if What, Where, and When are all correct simultaneously.
Sequence Accuracy ( $Acc_{seq}$ ): A clip is considered successful only if every referent in a multi-step instruction is correctly grounded. This captures error cascading.

3. Key Contributions

Task Definition (EcoG): Formalized the task of executable co-speech grounding requiring joint prediction of What/Where/When, moving beyond text-sufficient grounding.
Benchmark (EcoG-Bench): Created a bilingual, evaluation-only dataset with dense spatiotemporal annotations and a progressive difficulty protocol (L1–L4) specifically designed to expose failures in temporal alignment and event chaining.
System-Level Diagnosis: Introduced a diagnostic framework to distinguish between model reasoning failures and input pipeline limitations (e.g., whether native video-audio interfaces obscure temporal cues).

4. Experimental Results

The authors evaluated state-of-the-art MLLMs (including Gemini-3-Pro/Flash, Qwen3-Omni, and LLaVA variants) against human performance.

The Human-Model Gap: Humans achieve near-ceiling performance (96.9% strict Eco-Accuracy). In contrast, the best native video-audio model (Gemini-3-Pro) achieves only 17.0%.
Compositional Collapse: Performance drops sharply as complexity increases.
- L2 (Single Event): Gemini-3-Pro reaches ~29.2%.
- L3/L4 (Multi-Event): Performance collapses to ~10% for Eco-Accuracy and near-zero (<2%) for Sequence Accuracy. This indicates models fail to correctly chain multiple referents to their specific gesture strokes.
Semantic vs. Executable: Models often have high classification accuracy (e.g., 63.9% for Gemini-3-Pro) but fail at spatial or temporal grounding, rendering the output non-executable.
Input-Stack Diagnosis (The "Scaffold" Effect):
- When the same models (Gemini-3-Pro/Flash) were tested with a structured input pipeline (sampled frames with explicit timestamps + externally verified ASR with word-level timing) instead of native video-audio, performance improved drastically.
- Gemini-3-Pro: Improved from 17.0% → 42.9% Eco-Accuracy.
- Gemini-3-Flash: Improved from 7.0% → 48.1% Eco-Accuracy.
Ablation Study: Removing per-frame timestamps caused the largest drop in performance (especially in silent L1 tasks), proving that explicit temporal anchors are critical for the model to calibrate absolute time.

5. Significance and Conclusion

Bottleneck Identification: The paper demonstrates that the primary bottleneck for current MLLMs in deictic collaboration is not necessarily semantic understanding, but the observability of temporal alignment cues within native multimodal interfaces. Native video-audio pipelines may fail to expose the precise word-stroke synchrony required for grounding.
Future Directions: The results suggest that next-generation embodied systems require interfaces that explicitly represent fine-grained audio-visual timing (e.g., structured frames + ASR timestamps) to enable reliable event binding.
Impact: EcoG-Bench provides a rigorous, executable testbed that moves the field beyond "text-sufficient" evaluation, forcing models to learn the complex, time-sensitive coupling between speech and gesture that defines human collaboration.