Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

Here is an explanation of the paper using simple language and everyday analogies.

The Big Picture: The "Guessing Game" Robot

Imagine a robot working in a kitchen with a human. The robot needs to know what the human is going to do before they actually finish the action, so it can help them.

The Problem: The robot only sees the first few seconds of the action (the "prefix"). It's like seeing someone reach for a handle and having to guess if they are opening the fridge, the oven, or a cupboard.
The Danger: If the robot guesses too confidently and is wrong, it might grab the wrong thing, spill something, or get in the way. This is dangerous.
The Old Way: Most robots just pick the "best guess" (the top answer) and commit to it immediately. If the robot is 90% sure, it acts. But what if that 90% confidence is a lie? What if the robot is actually very confused?

The New Idea: "Decision-Aware" Uncertainty

This paper introduces a new way to test Vision-Language Models (VLMs). These are smart AI systems that can "see" a video and "read" a description to guess what's happening.

The authors argue that for robots to be safe, they shouldn't just ask, "What is the most likely action?" They should ask, "How sure are you, and should I wait or ask for help?"

The Core Experiment: The "Multiple Guesses" Trick

Since we can't peek inside the AI's brain to see its math, the researchers used a clever trick called Stochastic Sampling.

The Analogy: The Committee of Experts
Imagine you ask one expert, "What is this person doing?" They give you one answer. You don't know if they are guessing or sure.

So, the researchers asked the same AI model the same question 5 times in a row, but with a tiny bit of "randomness" (like rolling a die) each time.

Run 1: "They are opening the fridge."
Run 2: "They are opening the fridge."
Run 3: "They are taking a bottle."
Run 4: "They are opening the fridge."
Run 5: "They are putting food away."

If the AI gives the same answer every time, it's confident. If it gives a different answer every time, it's uncertain.

The Three Ways to Combine the Answers

The researchers tried three different ways to combine these 5 guesses into one final decision:

The "Voting" Method (Consistency): They just took the most common answer. If 3 out of 5 said "Fridge," the robot picks "Fridge."
The "Weighted" Method: They listened to the AI's own confidence score. If the AI said "Fridge" with 99% confidence, that vote counted more than a guess made with 50% confidence.
The "Pairwise" Method (PairRank): This is the most complex one. Instead of looking at the top answer, it looks at how the AI ranked all the options against each other (e.g., "Is Fridge better than Bottle? Is Bottle better than Oven?"). It builds a global map of preferences.

The Big Discovery: Accuracy vs. Safety

The researchers found something surprising: Just because an AI is good at guessing the right answer, doesn't mean it knows when it's wrong.

The "Sharp" Strategy (PairRank): This method was very decisive. It picked one answer and gave it a huge confidence score.
- Pros: It's great at filtering out bad guesses. If the robot is unsure, this method says "I don't know" very loudly.
- Cons: When it is wrong, it is overconfidently wrong. It might say "I'm 99% sure this is the fridge!" when it's actually the oven. This is dangerous for a robot.
The "Smooth" Strategy (Voting/Weighted): These methods were more humble. They spread the confidence out.
- Pros: They are safer. If the robot is confused, it admits it by giving similar confidence to multiple options.
- Cons: It might be harder for the robot to decide which one to pick because the scores are too close.

The "Decision Gate" (The Safety Valve)

The paper proposes a new rule for robots called Confidence-Gated Interaction.

Instead of the robot just acting on the top guess, it checks its confidence score:

High Confidence? -> Go ahead and act.
Low Confidence? -> Stop! Ask the human, "Hey, are you opening the fridge or the oven?"

The study showed that different guessing methods (the three strategies above) change how often the robot asks for help.

The "Sharp" method might ask for help too rarely (risking a crash).
The "Smooth" method might ask for help too often (annoying the human).

The Takeaway

You can't just look at how often an AI gets the answer right (Accuracy). You have to look at how it handles uncertainty.

For a robot to work safely with humans, it needs to be a humble expert, not a confident guesser. It needs to know when to say, "I'm not sure, let's wait," rather than confidently doing the wrong thing. This paper gives us the tools to measure exactly how "humble" or "confident" an AI really is before we let it drive a robot.

Here is a detailed technical summary of the paper "Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction."

1. Problem Statement

In shared workspaces, robots must interpret human actions from partial, ambiguous observations (temporal prefixes) to enable safe and timely collaboration.

The Challenge: Early action prediction is inherently uncertain due to occlusions, viewpoint changes (especially in egocentric views), and incomplete visual evidence.
The Gap: While Vision-Language Models (VLMs) have shown promise in open-vocabulary action recognition, their uncertainty reliability in the "temporal-prefix" regime is poorly understood.
The Risk: Relying solely on ranking accuracy (Top-K) is insufficient for Human-Robot Interaction (HRI). If a robot acts on an overconfident but incorrect early prediction, it can lead to unsafe or disruptive behavior. Conversely, excessive hesitation paralyzes the robot. Downstream HRI modules require not just an action hypothesis, but a trustworthy confidence estimate to decide whether to execute, wait, or ask for clarification.

2. Methodology

The authors propose a systematic framework to evaluate and aggregate uncertainty in VLM-based short-term action prediction without modifying the underlying model architecture.

A. Uncertainty Generation: Stochastic Multi-Run Sampling

Since VLMs often do not expose internal logits, the authors approximate predictive uncertainty using stochastic decoding:

For a given input video clip, the VLM is queried $M$ times (e.g., $M=5$ ) with identical prompts but temperature-based sampling ( $T=0.8$ ).
This generates $M$ independent Top- $K$ action prediction sets.
The variability across these sets serves as a proxy for model uncertainty.

B. Aggregation Strategies

The paper evaluates three distinct strategies to aggregate the stochastic outputs into a single ranked prediction with confidence scores:

Consistency-Based Aggregation: Uses majority voting on the action at each rank position. Confidence is the frequency of agreement.
Confidence-Weighted Aggregation: Incorporates the model's verbalized confidence scores into the voting process.
Pairwise Ranking (PairRank): Models the global ranking structure by treating the stochastic outputs as pairwise preferences (Bradley-Terry model) to derive a latent utility score and probability distribution.

C. Decision-Aware Evaluation Protocol

The authors introduce a four-dimensional evaluation framework specifically designed for HRI, moving beyond standard accuracy metrics:

Correctness: Measures ranking quality (Top-1 Accuracy, Recall@K).
Uncertainty Reliability: Measures calibration fidelity.
- Top-1 ECE: Alignment between confidence and correctness for the top prediction.
- Set-ECE: Alignment between the mean confidence of the Top- $K$ set and the empirical correctness of the set (whether the ground truth is in the set).
Selective Decision Utility: Simulates a confidence-gated policy. It measures the trade-off between Coverage (fraction of inputs executed) and Selective Accuracy (accuracy of executed inputs) as the confidence threshold increases.
Confidence Geometry: Analyzes the structural distribution of confidence (e.g., Normalized Entropy) to understand how ambiguity is modeled (concentrated vs. distributed).

3. Key Contributions

Reframing the Problem: Shifts the perspective of Top- $K$ action anticipation from a pure ranking problem to a reliability problem, highlighting the critical need for uncertainty evaluation in HRI.
New Evaluation Framework: Introduces a comprehensive, decision-aware protocol including Set-ECE, selective utility curves, and confidence geometry analysis to assess if VLM confidence signals are suitable for safety-gated control.
Empirical Discovery: Reveals that aggregation strategies fundamentally reshape the geometry of uncertainty. The study demonstrates that improved ranking performance does not necessarily imply improved uncertainty reliability; there are distinct trade-offs between calibration fidelity and decision-level separability.

4. Experimental Results

The study was evaluated on two egocentric datasets (EGTEA Gaze+ and EPIC-KITCHENS-100) using a black-box VLM (Gemini 2.5 Flash-lite).

Correctness: Aggregation strategies had a moderate impact on ranking accuracy (Recall@K and Top-1), with PairRank achieving the highest Recall@10.
Calibration (Reliability):
- Single-run baselines often achieved the lowest Top-1 ECE (best single-class calibration).
- PairRank showed higher Top-1 ECE but significantly lower Set-ECE as $K$ increased, indicating better calibration for multi-intent scenarios where the goal is to capture the ground truth within a set.
Selective Utility: PairRank demonstrated sharper threshold separability. It maintained higher accuracy at high confidence thresholds while aggressively reducing coverage (abstaining from uncertain predictions). This is ideal for safety-critical HRI where avoiding false positives is paramount.
Confidence Geometry:
- PairRank produced low-entropy distributions (sharp concentration on the top rank), reflecting a "decisive" but potentially overconfident stance.
- Consistency/Weighted methods produced high-entropy distributions (smoother decay), reflecting a more "cautious" handling of ambiguity.
Interaction Implications: The choice of aggregation strategy directly dictates robot behavior. Sharp distributions (PairRank) may lead to efficient but risky execution, while smoother distributions increase robustness but may require more clarification overhead.

5. Significance

This work provides the missing reliability evidence required to integrate VLMs into confidence-gated HRI systems.

Safety: It establishes that accuracy alone is insufficient; the structure of uncertainty determines whether a robot should act, wait, or ask.
Design Guidance: It offers practitioners a toolkit to select aggregation strategies based on operational constraints (e.g., choosing PairRank for high-stakes, low-tolerance environments vs. Consistency methods for robust, exploratory interactions).
Future Direction: It shifts the focus from merely improving prediction accuracy to optimizing the decision-relevant properties of uncertainty, paving the way for safer, more adaptive human-robot collaboration.