IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

The paper introduces IRIS, a training-free method that leverages real-time eye-tracking data to resolve ambiguities in open-ended Visual Question Answering, effectively doubling response accuracy on ambiguous queries across various Large Vision-Language Models.

Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

Published 2026-02-19
📖 4 min read☕ Coffee break read

Imagine you are looking at a busy photo of a kitchen counter. There are three different apples, two mugs, and a banana. You point at the photo and ask a smart AI, "What color is that?"

The AI is confused. It doesn't know which "that" you mean. Is it the red apple? The green one? The brown mug? In the world of Artificial Intelligence, this is called referential ambiguity. The AI is like a helpful but slightly absent-minded librarian who knows everything about books but can't tell which book you are pointing at when you just say, "I want that one."

This paper introduces a new tool called IRIS (Intent Resolution via Inference-time Saccades) to solve this problem. Here is how it works, explained simply:

The Core Idea: "Look, Don't Just Listen"

For decades, scientists have known that when humans speak, our eyes usually look at what we are talking about before or while we say the words. If you are about to ask about the red apple, your eyes likely darted to it a split second before you opened your mouth.

IRIS is a system that lets the AI "see" your eyes in real-time. Instead of just listening to your question, it watches where your eyes land. It uses your gaze as a secret hint to figure out exactly what you are talking about.

How It Works (The "Magic Trick")

Think of IRIS as a spotlight operator for the AI.

  1. The Setup: You sit in front of a screen with a camera tracking your eyes. You look at a picture and ask a question out loud (e.g., "Is that healthy?").
  2. The Clue: As you ask the question, the system records exactly where your eyes were looking.
  3. The Filter: The system is smart enough to know that not all eye movements matter. It focuses only on the split-second right when you start speaking. It ignores the time you spent looking around the room before you decided to ask.
  4. The Boost: The system takes those specific eye locations and draws little "X" marks on the image, then shows this marked-up image to the AI.
  5. The Result: The AI sees the "X" marks, realizes, "Ah! The user is looking at the green apple, not the red one!" and gives you the correct answer.

Why This Is a Big Deal

The researchers tested this with 500 different questions and 10 different types of advanced AI models. Here is what they found:

  • The "Ambiguity" Problem: When the question was vague (like "What is that?" with multiple options), the AI was only right about 35% of the time on its own. It was guessing.
  • The "IRIS" Solution: When they added the eye-tracking data, the AI's accuracy skyrocketed to 77%. They more than doubled the success rate!
  • No Training Required: The best part? They didn't have to re-teach the AI or change its brain. They just gave it a new piece of information (your eyes) at the moment it was answering. It works like a plug-and-play upgrade for any existing smart AI.
  • The "Unambiguous" Test: When the question was clear (e.g., "What color is the only apple?"), the eye-tracking didn't change anything. This proves the system is smart enough to know when it doesn't need the extra help.

A Real-World Analogy

Imagine you are in a crowded room with a friend, and you both want to order a drink.

  • Without IRIS: You say, "I'll have the red one." The waiter looks at the bar and sees three red drinks (a soda, a cocktail, and a juice). The waiter has to guess, or ask, "Which red one?"
  • With IRIS: You say, "I'll have the red one," while your eyes are locked on the cocktail. The waiter (the AI) sees your eyes, instantly knows you mean the cocktail, and brings it to you without asking a single follow-up question.

The Future

The authors call this IRIS because it mimics the human iris (eye) to resolve confusion. They believe this technology will be huge for Augmented Reality (AR) and Virtual Reality (VR). Imagine wearing smart glasses that know exactly what you are looking at and talking about, allowing you to have natural, fluid conversations with AI assistants without having to point or use complicated commands.

In short: IRIS teaches AI to pay attention to where you look, not just what you say, turning a confused robot into a mind-reading companion.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →