AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Imagine you are talking to a very smart, but slightly rigid, robot assistant. You show it a picture of a messy living room with three different sofas, two coffee tables, and a pile of books.

You ask the robot: "What color is this sofa?"

A human would immediately realize you haven't pointed to a specific sofa. They might say, "Which one? The blue one by the window, the beige one near the TV, or the red one in the corner?" Or, if you were pointing at the blue one, they would just say, "It's blue."

But current AI models (Vision-Language Models) often act like a confident but clueless tourist. Even though there are three sofas, they might just pick one at random and say, "It's blue!" They are so eager to give an answer that they ignore the fact that the question was actually impossible to answer correctly without more info.

This paper, titled rAQUA, introduces a new way to teach these robots how to handle confusion, just like a human would.

The Problem: The "Overconfident Robot"

Right now, most AI models are trained on clear, simple questions like, "What color is the sky?" (Answer: Blue). They are great at that. But in the real world, questions are often messy and vague.

The researchers found that even the smartest AI models (like GPT-5 or Gemini) fail when faced with ambiguity. They don't know how to react. They don't know when to:

Just answer (if it's obvious).
Guess based on context (if it's likely).
List all options (if there are a few choices).
Ask for clarification (if it's totally confusing).

Instead, they usually just guess, acting overconfident even when they are lost.

The Solution: The "Ambiguity Menu" (AQUA)

To fix this, the authors created a new dataset called AQUA (Ambiguous Visual Question Answering). Think of this dataset as a training manual that teaches the robot four specific "strategies" for dealing with confusion, depending on how bad the confusion is.

They categorize every confusing question into four levels:

Level 0: The Clear Day.
- Scenario: You ask, "What is on the plate?" and there is only one pizza.
- Robot Strategy: Just answer. "It's a pizza." No thinking needed.
Level 1: The "Obvious" Hint.
- Scenario: You ask, "What is this made of?" in a room full of furniture, but there is only one giant, shiny wooden table in the center.
- Robot Strategy: Use common sense. "This" clearly means the big table. Answer: "It's wood."
Level 2: The "Menu" Situation.
- Scenario: You ask, "What color are these apples?" and there are exactly two apples: one red, one green.
- Robot Strategy: Don't guess! List them all. "One is red, and the other is green."
Level 3: The "Help Me" Situation.
- Scenario: You ask, "What color is this car?" in a parking lot with 20 identical red cars.
- Robot Strategy: Stop! You can't guess. Politely ask, "Which car do you mean? The one on the left or the right?"

The Training: Teaching the Robot to "Think"

The researchers took open-source AI models and trained them on this new dataset. They didn't just teach them the answers; they taught them the strategy.

They used a two-step training process:

Supervised Fine-Tuning (SFT): Like a teacher showing the student the right answers for each level. "If you see 20 cars, ask a question. If you see 1, answer."
GRPO (The "Reward System"): This is like a video game. The AI tries to answer. If it picks the right strategy (e.g., asking for help when it should), it gets a "high score." If it guesses wrong, it gets a penalty. This teaches the model to choose the right behavior automatically.

The Results: From Clueless to Clever

When they tested the new models:

Before: The models were like a bull in a china shop. They would smash through ambiguity by giving confident, wrong answers.
After: The models became like a skilled waiter. If the order is clear, they bring the food. If the customer is pointing at two dishes, they ask, "Which one?" If there are three, they list the options.

Even small models trained on this new method beat much larger, expensive, "closed-source" models (like the ones from big tech companies) at handling confusion.

The Big Takeaway

The paper proves that being smart isn't just about knowing facts; it's about knowing when you don't know.

By teaching AI models to recognize different types of confusion and respond with the right strategy (answering, listing, or asking), we can make them much more useful in the real world, where questions are rarely perfect and pictures are rarely simple. It's the difference between a robot that blindly guesses and a robot that actually understands the conversation.

Here is a detailed technical summary of the paper "rAQUA: TOWARD STRATEGIC RESPONSE GENERATION FOR AMBIGUOUS VISUAL QUESTIONS" (Note: The title in the text appears as "rAQUA" in the header but "AQUA" throughout the body; the dataset is referred to as AQUA).

1. Problem Statement

Visual Question Answering (VQA) has traditionally focused on unambiguous image-question pairs where a single correct answer exists. However, real-world interactions frequently involve ambiguity (e.g., vague pronouns like "this" or "that" in images with multiple similar objects).

Current Vision-Language Models (VLMs) exhibit significant limitations in handling these scenarios:

Overconfidence: Models tend to force a single answer even when the visual context is ambiguous, often hallucinating or arbitrarily selecting an object.
Lack of Strategy: Existing approaches often adopt a binary strategy (always answer or always ask for clarification). They fail to adapt their response based on the degree and nature of the ambiguity.
Missing Benchmarks: There is a lack of systematic datasets that categorize ambiguity into fine-grained levels with corresponding optimal response strategies.

2. Methodology

The authors propose a comprehensive framework consisting of a new dataset, a training pipeline, and an evaluation protocol.

A. The AQUA Dataset

AQUA (Ambiguous Visual Question Answering) is a fine-grained dataset constructed from COCO images. It categorizes VQA instances into four distinct levels based on the nature of ambiguity and the optimal response strategy:

Level 0 (Unambiguous): Standard VQA with clear, unique answers. Serves as a control group.
Level 1 (Low-Level Referential Ambiguity): Questions use ambiguous terms (e.g., "this"), but the context (e.g., object saliency) makes the referent obvious.
- Strategy: Infer the intent from context and answer directly.
Level 2 (Multiple Valid Interpretations): The image contains a small number of plausible targets (2–3).
- Strategy: List all plausible alternatives rather than guessing or asking for clarification.
Level 3 (High-Level Ambiguity): The image contains many similar objects, making it impossible to determine the target without more info.
- Strategy: Explicitly request clarification from the user.

Dataset Construction:

Generation: Uses GPT-5 with level-specific prompts to generate question-answer pairs.
Filtering: A rigorous three-stage pipeline (Level Consistency, Best-Fit Validation, Real-World Quality) using LLM-as-a-judge, followed by human validation via Amazon Mechanical Turk (MTurk).
Scale: 7.2K total samples (3.6K training, 3.6K evaluation), balanced across all four levels.

B. Training Strategy

The authors fine-tune open-source VLMs (Qwen2.5-VL-3B and InternVL3-2B) using a two-stage pipeline:

Supervised Fine-Tuning (SFT): Trains the model on the AQUA dataset to learn the space of possible strategies (direct answer, context inference, listing options, clarification).
Group Relative Policy Optimization (GRPO): A reinforcement learning stage to optimize strategic choice.
- Reward Mechanism: An LLM-as-a-judge (GPT-5-mini) evaluates the output.
- Reward Function:
  - $R = 1$ : Correct strategy + Factually accurate.
  - $R = 1 - \lambda$ (where $\lambda=0.3$ ): Correct strategy but factual distortion.
  - $R = 0$ : Incorrect strategy (e.g., answering when clarification is needed).

C. Evaluation Metrics

Factual Consistency: Binary measure of whether the answer is grounded in the image.
Strategic Accuracy: Measures if the model's response type (Level 0–3) matches the ground-truth ambiguity level.

3. Key Contributions

AQUA Dataset: The first resource to provide a fine-grained, four-level categorization of visual ambiguity, enabling systematic training and evaluation of strategy selection.
Strategy-Aware Training: Demonstrates that fine-tuning VLMs on AQUA enables them to autonomously select contextually appropriate strategies, outperforming both larger open-source and proprietary closed-source models.
Comprehensive Analysis:
- Reveals that current state-of-the-art models (including GPT-5 and Gemini 2.5) default to overconfident single answers even in highly ambiguous scenarios.
- Identifies that Chain-of-Thought (CoT) and Strategy Prompting alone are insufficient to solve this problem; explicit training is required.
- Analyzes error patterns, showing that tuned models still struggle with boundary cases (e.g., Level 1 vs. Level 2) and salience-driven biases.

4. Results

Baseline Performance: Zero-shot models (Qwen, InternVL, GPT-5, Gemini) achieve high factual accuracy (~~80-98%) but very low strategic accuracy (~~22-33%). They fail to adapt strategies, often treating Level 2 and 3 questions as Level 0.
Tuned Models Performance:
- Models fine-tuned with SFT + GRPO achieve ~86% overall strategic accuracy (Qwen2.5-VL-3B-Tuned).
- They significantly outperform larger baselines (e.g., Qwen-72B, GPT-5) which only reach ~25-27% strategic accuracy.
- Level-Specific Gains: Tuned models show massive improvements in Level 1 (77%), Level 2 (82%), and Level 3 (86%), whereas baselines hover near 0-5% for these levels.
Generalization: Models trained on COCO-based AQUA generalize well to the Open Images V7 dataset, maintaining high strategic accuracy (~78-83%).
Clarification Effectiveness: In a two-turn experiment, when Level 3 models request clarification, they successfully resolve the ambiguity and provide correct answers in 80% of cases.

5. Significance

This work addresses a critical gap in VLM capabilities: strategic reasoning under uncertainty.

Beyond "Answer or Ask": It moves the field beyond binary "abstain vs. answer" paradigms, showing that human-like communication requires a spectrum of responses (inferring, listing, clarifying) based on context.
Efficiency: It proves that smaller, fine-tuned models can outperform massive proprietary models in handling ambiguity, suggesting that data quality and strategy alignment are more critical than sheer parameter scale for this specific task.
Real-World Applicability: By enabling models to recognize when they lack sufficient information and ask for clarification (or list options), AQUA paves the way for more robust and reliable VQA systems in real-world, unstructured environments.