Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment

Imagine you are hiring a new art critic to judge the quality of photographs. You want this critic to be as reliable as a seasoned human expert. However, you've noticed two major problems with your current AI critics:

The "Wobbly" Critic: Sometimes the AI is super confident, giving a clear score like "4.5/5." Other times, it's totally confused, giving wildly different scores like "2.0," "4.8," and "3.1" for the same picture depending on how you ask. The current training methods treat these "wobbly" guesses just as seriously as the confident ones, which messes up the learning process.
The "Text-Only" Critic: The AI is great at writing fancy descriptions about a photo ("The lighting is warm, the composition is balanced..."), but it often ignores the actual visual flaws. It might give a high score to a blurry photo just because it wrote a nice paragraph about the idea of the photo, rather than actually seeing that the image is fuzzy.

Enter Q-Hawkeye.

Think of Q-Hawkeye as a super-vision training program designed to fix these two flaws. It uses a clever training method called "Reinforcement Learning" (think of it as a game where the AI gets points for good answers and loses points for bad ones), but it adds two special "power-ups" to make the AI smarter and more reliable.

Power-Up 1: The "Confidence Filter" (Uncertainty-Aware Optimization)

The Analogy: Imagine a classroom where students are taking a test.

Student A answers every question with a steady hand and a clear voice.
Student B is shaking, guessing wildly, and giving different answers every time you ask the same question.

In the old training method, the teacher (the AI trainer) would give Student A and Student B equal weight when correcting their mistakes. This is bad because Student B's wild guesses just add noise and confusion.

How Q-Hawkeye fixes it:
Q-Hawkeye acts like a smart teacher who notices Student B is shaking. It says, "Okay, Student B, your answer is too shaky. I'm going to listen to you less right now so your confusion doesn't mess up the whole class."

It asks the AI to look at the same photo multiple times (like taking a poll).
If the AI gives different scores each time (high uncertainty), Q-Hawkeye turns down the volume on that lesson.
If the AI is consistent (low uncertainty), it turns up the volume.
Result: The AI learns from its stable, confident moments and ignores the noisy, confusing ones.

Power-Up 2: The "Blindfold Test" (Perception-Aware Optimization)

The Analogy: Imagine you are teaching someone to spot a fake painting.

Old Method: You show them a fake painting and ask, "Is this good?" They might say, "It looks like a sunset, sunsets are nice, so I'll give it a 5." They are judging based on the story of the painting, not the paint itself.
Q-Hawkeye's Method: You show them the original, beautiful painting. Then, you show them a version where you've smeared the paint, added scratches, and made it blurry. You ask, "What about this one?"

If the AI is truly "seeing" the image, it should immediately say, "Whoa, this one is terrible! It's blurry and scratched!"
If the AI is just guessing based on text, it might say, "It's still a sunset, so it's a 4.8."

How Q-Hawkeye fixes it:
Q-Hawkeye forces the AI to take a "Blindfold Test" (though without the blindfold, it's more like a "Distortion Test").

It shows the AI a clean photo and a damaged version of the same photo.
It demands that the AI's reaction to the damaged photo be drastically different from the clean one.
If the AI gives them similar scores, it gets a penalty. It has to prove it can actually see the difference between a clear image and a blurry one.
Result: The AI stops relying on "textbook descriptions" and starts paying attention to the actual pixels, noise, and blur. It learns to trust its eyes, not just its vocabulary.

The Grand Result

By combining these two strategies, Q-Hawkeye creates an Image Quality Assessment AI that is:

Stable: It doesn't flip-flop between scores.
Visual: It actually looks at the picture to judge it, not just the words it writes about it.
General: It works great on new types of photos it hasn't seen before, whether they are AI-generated, taken with a shaky phone, or heavily edited.

In short, Q-Hawkeye teaches the AI to be a reliable, sharp-eyed judge rather than a confident but confused guesser. It's like upgrading from a student who memorized the answer key to a master artist who can spot a flaw in a painting from a mile away.

1. Problem Statement

Image Quality Assessment (IQA) aims to predict perceptual quality scores that align with human judgments. While recent methods leveraging Multimodal Large Language Models (MLLMs) and Reinforcement Learning (RL) have shown promise, the authors identify two critical reliability limitations in existing RL-based approaches (specifically those using Group Relative Policy Optimization, or GRPO):

Uniform Advantage Weighting Ignoring Uncertainty: Existing GRPO-based methods apply uniform update strengths across all training samples. However, the model's predictive stability varies significantly; some images yield consistent scores, while others produce broad, unstable distributions. Treating these unstable samples equally injects noisy gradients into the optimization process, undermining training reliability.
Over-reliance on Textual Reasoning: Current methods often prioritize text-grounded reasoning and description generation over actual visual perception. Consequently, models may rely on dataset regularities or language priors rather than genuine visual evidence, leading to scores that are not truly grounded in the image content (e.g., assigning high scores to degraded images if the text description is plausible).

2. Methodology: Q-Hawkeye

The authors propose Q-Hawkeye, a reliable visual policy optimization framework built upon the Qwen2.5-VL-7B model. It redesigns the RL learning signal through two core strategies: Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization.

A. Uncertainty-Aware Dynamic Optimization

To address the issue of unstable samples, Q-Hawkeye introduces a mechanism to estimate and mitigate predictive uncertainty.

Uncertainty Estimation: For each input image, the model performs $K$ rollouts (generating $K$ reasoning trajectories and scores). The uncertainty ( $u$ ) is estimated as the variance of these predicted scores within the group.
Dynamic Reweighting: A sample-specific weight $w(u) = \exp(-\tau \tilde{u})$ $w (u) = exp (- τ \tilde{u})$ is calculated based on the normalized uncertainty.
- Low Uncertainty (Stable samples): Receive higher weights, reinforcing reliable judgments.
- High Uncertainty (Unstable samples): Receive lower weights, suppressing their gradient contribution to prevent noisy updates.
Integration: This weight is applied to the advantage term ( $\tilde{A}_k = w \cdot A_k$ ) in the GRPO objective, effectively filtering out noisy signals during policy updates.

B. Perception-Aware Optimization

To ensure the model relies on visual evidence rather than text priors, the authors introduce a loss function that explicitly enforces sensitivity to visual degradations.

Paired Data Construction: For every training image, a degraded version ( $I_{deg}$ ) is generated using specific distortions (Noise, Blur, JPEG, Darken). A "double-check filter" (using an MLLM and human experts) ensures the degradation is perceptible.
Implicit Perception Loss: The model is trained to produce distinguishable output distributions for the original image ( $I$ $I$ ) and the degraded image ( $I_{deg}$ $I_{d e g}$ ).
- KL Divergence Maximization: The objective maximizes the KL divergence between the policy distributions $\pi_\theta(\cdot|I, q)$ and $\pi_\theta(\cdot|I_{deg}, q)$ . This forces the model to react to visual changes.
- Double Entropy Regularization: To prevent the model from simply increasing randomness (high entropy) to maximize the KL term, an entropy regularization term is added to constrain the output distributions under both conditions, ensuring sharp and stable predictions.

C. Overall Objective

The total loss function combines the standard GRPO objective (with uncertainty-reweighted advantages), a KL regularization against a reference policy, the Implicit Perception Loss (maximizing KL divergence between original and degraded inputs), and the entropy regularization terms.

3. Key Contributions

Unified Framework: Proposes Q-Hawkeye, the first RL-based IQA framework that simultaneously addresses predictive uncertainty and visual perceptual grounding.
Uncertainty-Aware Reweighting: Introduces a dynamic advantage reweighting strategy based on rollout score variance, which stabilizes training by down-weighting ambiguous samples.
Perception-Aware Optimization: Develops an Implicit Perception Loss using original-degraded image pairs to force the model to ground its quality judgments in visual evidence, rather than language priors.
Data-Efficient Generalization: Demonstrates that high performance can be achieved with single-dataset training (KonIQ) by optimizing the learning signal, rather than relying on massive multi-dataset training.

4. Experimental Results

The authors evaluated Q-Hawkeye on eight IQA benchmarks, including in-the-wild datasets (SPAQ, LIVE-Wild), synthetic distortions (KADID, CSIQ), and AI-generated images (AGIQA-3K).

Performance: Q-Hawkeye achieved state-of-the-art (SOTA) results, outperforming both traditional deep learning models (e.g., MUSIQ, ManIQA) and recent MLLM-based methods (e.g., Q-Align, Q-Insight, VisualQuality-R1).
- Average PLCC/SRCC: Achieved 80.0 / 76.2 across all datasets, surpassing the previous best (VisualQuality-R1 at 75.8/72.0).
Generalization: Notably, Q-Hawkeye was trained only on the KonIQ dataset yet outperformed methods trained on multiple datasets (e.g., DeQA-Score trained on 3-4 datasets). This highlights the effectiveness of the proposed reliability mechanisms.
Ablation Studies:
- Removing either the uncertainty or perception modules resulted in significant performance drops.
- The "Reverse" weighting (up-weighting high uncertainty) degraded performance, confirming the necessity of down-weighting unstable samples.
- The Perception-Aware module was shown to effectively shift the model's score distribution for degraded images, whereas models without it tended to assign similar scores to original and degraded images.

5. Significance

Q-Hawkeye represents a significant shift in how RL is applied to visual tasks. By moving beyond simple reward maximization to reliable policy optimization, it addresses the "hallucination" and instability issues common in MLLMs.

Reliability: It ensures that the model's confidence (uncertainty) is respected during training, preventing noisy data from corrupting the policy.
Grounding: It enforces a "visual-first" approach, ensuring that quality assessments are derived from actual image content rather than textual shortcuts.
Efficiency: It demonstrates that sophisticated RL strategies can yield better generalization than simply scaling up training data, offering a more efficient path for training robust IQA models.

The code and dataset are open-sourced, facilitating further research into reliable visual policy optimization.

Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment

Power-Up 1: The "Confidence Filter" (Uncertainty-Aware Optimization)

Power-Up 2: The "Blindfold Test" (Perception-Aware Optimization)

The Grand Result

1. Problem Statement

2. Methodology: Q-Hawkeye

A. Uncertainty-Aware Dynamic Optimization

B. Perception-Aware Optimization

C. Overall Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

Unbiased Rectification for Sequential Recommender Systems Under Fake Orders

Self-Sovereign Agent

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

GAN-Enhanced Deep Reinforcement Learning for Semantic-Aware Resource Allocation in 6G Network Slicing