IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Imagine you are looking at a busy photo of a kitchen counter. There are three different apples, two mugs, and a banana. You point at the photo and ask a smart AI, "What color is that?"

The AI is confused. It doesn't know which "that" you mean. Is it the red apple? The green one? The brown mug? In the world of Artificial Intelligence, this is called referential ambiguity. The AI is like a helpful but slightly absent-minded librarian who knows everything about books but can't tell which book you are pointing at when you just say, "I want that one."

This paper introduces a new tool called IRIS (Intent Resolution via Inference-time Saccades) to solve this problem. Here is how it works, explained simply:

The Core Idea: "Look, Don't Just Listen"

For decades, scientists have known that when humans speak, our eyes usually look at what we are talking about before or while we say the words. If you are about to ask about the red apple, your eyes likely darted to it a split second before you opened your mouth.

IRIS is a system that lets the AI "see" your eyes in real-time. Instead of just listening to your question, it watches where your eyes land. It uses your gaze as a secret hint to figure out exactly what you are talking about.

How It Works (The "Magic Trick")

Think of IRIS as a spotlight operator for the AI.

The Setup: You sit in front of a screen with a camera tracking your eyes. You look at a picture and ask a question out loud (e.g., "Is that healthy?").
The Clue: As you ask the question, the system records exactly where your eyes were looking.
The Filter: The system is smart enough to know that not all eye movements matter. It focuses only on the split-second right when you start speaking. It ignores the time you spent looking around the room before you decided to ask.
The Boost: The system takes those specific eye locations and draws little "X" marks on the image, then shows this marked-up image to the AI.
The Result: The AI sees the "X" marks, realizes, "Ah! The user is looking at the green apple, not the red one!" and gives you the correct answer.

Why This Is a Big Deal

The researchers tested this with 500 different questions and 10 different types of advanced AI models. Here is what they found:

The "Ambiguity" Problem: When the question was vague (like "What is that?" with multiple options), the AI was only right about 35% of the time on its own. It was guessing.
The "IRIS" Solution: When they added the eye-tracking data, the AI's accuracy skyrocketed to 77%. They more than doubled the success rate!
No Training Required: The best part? They didn't have to re-teach the AI or change its brain. They just gave it a new piece of information (your eyes) at the moment it was answering. It works like a plug-and-play upgrade for any existing smart AI.
The "Unambiguous" Test: When the question was clear (e.g., "What color is the only apple?"), the eye-tracking didn't change anything. This proves the system is smart enough to know when it doesn't need the extra help.

A Real-World Analogy

Imagine you are in a crowded room with a friend, and you both want to order a drink.

Without IRIS: You say, "I'll have the red one." The waiter looks at the bar and sees three red drinks (a soda, a cocktail, and a juice). The waiter has to guess, or ask, "Which red one?"
With IRIS: You say, "I'll have the red one," while your eyes are locked on the cocktail. The waiter (the AI) sees your eyes, instantly knows you mean the cocktail, and brings it to you without asking a single follow-up question.

The Future

The authors call this IRIS because it mimics the human iris (eye) to resolve confusion. They believe this technology will be huge for Augmented Reality (AR) and Virtual Reality (VR). Imagine wearing smart glasses that know exactly what you are looking at and talking about, allowing you to have natural, fluid conversations with AI assistants without having to point or use complicated commands.

In short: IRIS teaches AI to pay attention to where you look, not just what you say, turning a confused robot into a mind-reading companion.

1. Problem Statement

Referential Ambiguity in Open-Ended VQA:
Large Vision-Language Models (VLMs) have achieved significant success on standard benchmarks but struggle with referential ambiguity in real-world scenarios. When a user asks a vague question like "What is that?" or "Is the food healthy?" in an image containing multiple similar objects (e.g., two mugs, or both healthy and unhealthy food), the model lacks the contextual grounding to identify the specific referent (the intended object).

Current Limitations: Existing solutions often require retraining models, architectural changes, or rely on estimated gaze (which is less accurate than actual human gaze).
The Gap: There is a lack of training-free, real-time methods that leverage natural human behavior to resolve these ambiguities during inference.

2. Methodology: IRIS System

The authors propose IRIS (Intent Resolution via Inference-time Saccades), a training-free approach that integrates human eye-tracking data into the inference pipeline of existing VLMs.

Core Components

Real-time Eye Tracking: Captures overt visual attention (fixation locations) as users view an image and formulate a question.
Speech Recognition: Identifies the timing (onset) and content of the user's verbal question.
VLM Integration: The system overlays processed fixation data onto the image and prompts the VLM to use this visual context to disambiguate the query.

Key Technical Innovations

Temporal-Spatial Filtering: The system does not use all gaze data. It identifies a critical time window around speech onset (when the user starts speaking).
- Temporal Filter: Selects fixations occurring within a specific window (optimized to $\pm 1$ second) relative to the start of the question.
- Spatial Filter: Calculates the coordinate-wise median of these temporally filtered fixations. It retains only fixations within a 2 degrees of visual angle (dva) radius of this median.
- Rationale: Cognitive science suggests that eye movements precede verbal references by hundreds of milliseconds. Fixations clustered around speech onset are the most reliable indicators of the user's intent.
Inference-Time Operation: The method requires no model modification, fine-tuning, or parameter updates. It simply adds the fixation overlay (white crosses on black circles) as an additional visual input alongside the original image and the text prompt.
Prompt Engineering: The system prompt instructs the VLM to identify the object closest to the fixation markers and answer precisely, explicitly forbidding the model from mentioning the eye-tracking data in its output to maintain natural interaction flow.

3. Experimental Setup & Dataset

Participants: 10 participants (university students).
Stimuli: 50 unique images (40 ambiguous, 10 unambiguous) from everyday environments.
Data Collection:
- Participants viewed images and asked questions aloud while eye movements were recorded at 1000 Hz (EyeLink 1000).
- Ground Truth: After the VLM responded, participants clicked the specific object they were referring to (Location of Interest - LOI).
- Total Pairs: 500 unique image-question pairs.
Evaluation Metrics:
- Accuracy: Binary classification (Correct/Incorrect) verified by human evaluators and an automated VLM evaluator.
- Semantic Similarity: Cosine similarity between model responses and ground truth using a frozen sentence encoder.

4. Key Results

The study demonstrates that IRIS significantly improves VLM performance on ambiguous queries without degrading performance on clear queries.

A. Performance Gains

Ambiguous Questions: Accuracy increased dramatically from 35.2% (Image-only) to 77.2% (Image + Gaze). This represents a 115% improvement ( $p < .001$ ).
Unambiguous Questions: Performance remained statistically unchanged (83.0% to 86.0%, $p = .52$ ), proving the method does not introduce noise when ambiguity is absent.
Semantic Similarity: Improved from 0.531 to 0.650 for ambiguous questions, approaching the theoretical upper bound provided by perfect mouse-click ground truth (0.691).

B. Temporal Analysis

Optimal Window: The most informative gaze data occurs in a window centered near speech onset (specifically -200ms to +400ms).
Mechanism: Fixations closest to the intended object (LOI) occur just before the user speaks. Expanding the window beyond $\pm 4500$ ms yields diminishing returns, as later fixations introduce noise.
Filtering Efficacy: The spatiotemporal filtering approach yielded a 17.7% accuracy gain over using "all fixations" (no filtering).

C. Generalizability Across Architectures

Tested on 10 diverse state-of-the-art VLMs (including GPT-5, Gemini 2.5, Claude, Qwen, and Ovis).
Consistent Improvement: All models showed performance gains when gaze data was included.
Architecture Agnostic: The improvement correlates with instruction-following capabilities rather than parameter count, suggesting the method works across both frontier and compact open-source models.

D. Ablation Studies

The paper compared different ways to represent gaze data to the VLM:

Crosses (Ours): White crosses on black circles (Best performance: 0.830 similarity).
Heatmaps: Gaussian density maps (0.820).
Bounding Boxes: Clusters of fixations (0.780).
Text Coordinates: Raw (x,y) text (0.748).
Cropped Images: Cropping to the fixation area (Worst: 0.703).

Finding: Discrete markers (crosses) preserve precise spatial information better than smoothed heatmaps or cropped images.

5. Key Contributions

IRIS Framework: A novel, training-free, real-time method to resolve referential ambiguity in open-ended VQA using human gaze.
Temporal Insight: Empirical evidence that fixations occurring around speech onset are the most critical for disambiguation, validating the tight coupling between visual attention and linguistic planning.
Benchmark & Dataset: Release of a new dataset (500 image-question pairs with synchronized gaze, speech, and LOI) and an evaluation suite for gaze-augmented VQA.
Broad Applicability: Demonstration that the method works across 10 different VLM architectures without retraining, making it immediately deployable.

6. Significance and Future Impact

Human-AI Interaction: IRIS bridges the gap between human intent and machine interpretation, enabling more intuitive interactions in AR/VR and accessibility tools where eye-tracking is becoming ubiquitous.
Efficiency: By avoiding model retraining, it offers a low-cost, high-impact solution for improving VLM robustness in ambiguous real-world scenarios.
Cognitive Alignment: The approach leverages established cognitive science principles (eye-movement/speech coupling) to solve engineering problems, suggesting a promising direction for "neuro-inspired" AI interfaces.

Limitations: The study used a controlled lab setting with research-grade eye-trackers and a limited demographic (university students). Future work aims to validate the approach on consumer-grade hardware and broader populations.