Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment

Imagine you are a food critic reviewing a complex dish.

The Old Way (Traditional IQA):
You take a bite and give the dish a single number, like "7 out of 10." It's quick, but it doesn't tell the chef why it's a 7. Is the sauce too salty? Is the steak undercooked? Is the presentation messy? You just gave a score, but you didn't explain the details.

The "Smart" Way (Current MLLMs):
You try to be more helpful. You say, "The steak is good, but the sauce is a bit salty, and the plating is messy." This is better! You are using natural language to describe the quality.

The Problem:
Even with this smart description, you are still vague. When you say "the sauce is salty," you aren't pointing exactly to which sauce or where on the plate it is. If the chef tries to fix it, they might wash the whole plate instead of just the sauce. In the world of images, current AI models can describe that an image is "blurry" or "bright," but they often can't point their finger at the exact blurry spot or the overexposed area. They lack precision.

The New Solution: "Grounding-IQA"

This paper introduces a new way for AI to judge image quality called Grounding-IQA. Think of it as upgrading the food critic from someone who just talks to someone who can point and touch.

The authors created a system that combines Image Quality Assessment (judging how good a picture is) with Grounding (the ability to point to specific objects with a box around them).

They broke this down into two simple games:

1. The "Point-and-Tell" Game (GIQA-DES)

Instead of just saying, "The photo is blurry," the AI must say:

"The person's hands [points to hands] are blurry, but the mountain in the background [points to mountain] is sharp."

It forces the AI to not only describe the quality but also draw a digital box around the specific part of the image it's talking about.

2. The "Spot the Issue" Game (GIQA-VQA)

This is like a quiz where the AI has to answer questions about specific parts of the image.

User: "Is the horse [points to horse] blurry?"
AI: "Yes."
User: "What is overexposed in this picture?"
AI: "The window [points to window]."

Here, the AI has to understand the question, find the specific object, and give a precise answer, often pointing back to the location.

How Did They Teach the AI? (The "Robot Chef" Pipeline)

You can't just ask an AI to do this perfectly right away. It needs training data. But labeling 160,000 images with text and drawing boxes around every single object is incredibly hard and expensive for humans.

So, the authors built an Automated Annotation Pipeline. Imagine a super-efficient robot chef:

Reads the Menu: It takes existing descriptions of images (e.g., "The sky is blue, but the car is blurry").
Identifies Ingredients: It uses a smart tool to find the "car" and the "sky" in the photo.
Checks the Quality: It asks, "Is this specific car blurry?" If the answer is yes, it keeps it. If not, it ignores it.
Draws the Boxes: It automatically draws a box around the blurry car and attaches the text "blurry" to that box.
Serves the Data: It creates a massive dataset called GIQA-160K with 160,000 examples of these "point-and-tell" lessons.

They also built a GIQA-Bench, which is like a final exam for the AI. It has 100 tricky images where human experts check if the AI correctly pointed out the blurry parts or answered the questions.

Why Does This Matter?

Think of it like the difference between a general doctor and a surgeon.

The general doctor (old AI) says, "You have a stomach ache."
The surgeon (Grounding-IQA) says, "You have inflammation specifically in the lower right quadrant of your abdomen."

By teaching AI to point exactly where the problem is, this new method allows for:

Better Editing: If you want to fix a photo, the AI knows exactly which part to sharpen or brighten.
Better Safety: In self-driving cars, the AI can say, "The pedestrian on the left is blurry and hard to see," rather than just "It's hard to see."
More Trust: We trust the AI more when it can show us why it thinks an image is bad, rather than just giving a vague opinion.

In short, Grounding-IQA teaches AI to stop guessing and start pointing, making image quality assessment much more detailed, accurate, and useful for real-world tasks.

1. Problem Statement

Traditional Image Quality Assessment (IQA) methods, whether handcrafted or score-based deep learning models, often fail to provide fine-grained explanations for why an image is of poor quality. They typically output a single scalar score, which lacks interpretability and cannot address complex scenarios where quality varies across different regions of an image.

While Multimodal Large Language Models (MLLMs) have improved IQA by generating natural language descriptions, current MLLM-based approaches suffer from two main limitations:

Lack of Spatial Precision: They rely on general contextual descriptions and cannot pinpoint specific objects or regions causing quality degradation (e.g., "the sky is blurry" without specifying which sky or providing coordinates).
Inaccurate Referring: When asked about local regions, they often fail to accurately identify the target object or provide precise bounding boxes, leading to biased assessments.

The paper argues that to achieve truly fine-grained IQA, MLLMs must integrate multimodal referring (understanding a region based on text) and grounding (identifying a region based on text) capabilities directly into the quality assessment process.

2. Methodology

The authors propose a new task paradigm called Grounding-IQA, which integrates spatial perception with quality assessment. This paradigm consists of two subtasks:

A. Task Definition

Grounding-IQA-Description (GIQA-DES): The model generates a detailed quality description that includes precise locations (bounding boxes) for key objects or regions affecting the image quality (e.g., "The hands are blurry").
Grounding-IQA-Visual Question Answering (GIQA-VQA): The model answers questions regarding low-level attributes of specific local regions. This includes:
- Referring: Answering questions about a specified region (input position).
- Grounding: Providing an answer that identifies a specific location (output position) based on the question (e.g., "What is blurry?" $\rightarrow$ "The white ball").

B. Automated Annotation Pipeline (GIQA-160K Construction)

Since no existing dataset supports this fine-grained task, the authors constructed GIQA-160K, a dataset with 160K instruction-tuning samples derived from public IQA datasets (Q-Pathway and DQ-495K). They developed a four-stage automated pipeline:

Object Tag Extraction: Uses an LLM (Llama3) to extract key objects from existing descriptions and assigns a quality tag ( $T_q$ ) and an effect tag ( $T_e$ : positive, negative, or no impact).
Bounding Box Detection: Uses Grounding DINO to detect objects. Crucially, it uses the specific description phrase ( $T_r$ ) rather than just the object name to ensure precise detection (e.g., "the man in the white shirt" vs. "man").
Box Refinement (IQA-Filter & Box-Merge):
- IQA-Filter: Uses an MLLM-based IQA model (Q-Instruct) to verify if the detected patch actually matches the quality description (e.g., "Is this patch blurry?"). Incorrect detections are filtered out.
- Box-Merge: Merges overlapping boxes to prevent redundancy and reduce learning difficulty.
Transformation and Fusion: Converts coordinates into a discrete format (grid indices) to reduce token overhead and fuses them into the text in an interleaved format: [object](<x, y>).

C. Benchmark (GIQA-Bench)

A high-quality benchmark comprising 100 diverse images and 250 test samples (100 DES, 150 VQA) was created. Evaluation metrics include:

Description Quality: BLEU@4 and LLM-Score.
VQA Accuracy: Accuracy for Yes/No and open-ended questions.
Grounding Precision: Mean Intersection over Union (mIoU) and Tag-Recall.

3. Key Contributions

New Paradigm (Grounding-IQA): Introduces a novel framework that unifies multimodal referring and grounding with IQA, enabling fine-grained, spatially-aware quality assessment.
GIQA-160K Dataset: A large-scale dataset (167k samples, 43k images) constructed via an automated pipeline that includes precise bounding box annotations linked to quality descriptions.
GIQA-Bench: A comprehensive benchmark with human-verified annotations to evaluate models across description quality, VQA accuracy, and grounding precision.
State-of-the-Art Performance: Demonstrates that fine-tuning general MLLMs on GIQA-160K significantly outperforms existing specialized IQA models and grounding models.

4. Experimental Results

The authors fine-tuned four pre-trained MLLMs (LLaVA-v1.5/1.6, mPLUG-Owl2) on GIQA-160K and evaluated them on GIQA-Bench.

Quantitative Performance:
- Grounding-IQA (mPLUG-Owl2-7B) achieved the best overall results.
- In GIQA-DES, it achieved a Tag-Recall of 0.5474 and LLM-Score of 63.00, significantly outperforming specialized IQA models like Q-Instruct (which had high scores but zero grounding capability) and grounding models like Ferret (which had good grounding but poor quality assessment).
- In GIQA-VQA, the proposed method achieved an Acc (Total) of 0.7417, surpassing the baseline mPLUG-Owl2 (0.5633) and other grounding models.
Ablation Studies:
- Box Optimization: The proposed refinement (IQA-Filter + Box-Merge) and discrete coordinate representation significantly improved both grounding accuracy and description quality compared to raw boxes or continuous coordinates.
- Multi-task Training: Joint training on both GIQA-DES and GIQA-VQA yielded better results than training on either task alone, proving the synergy between description and QA tasks.
Qualitative Analysis: Visual comparisons show that the proposed model correctly identifies blurry regions or specific objects affecting quality with accurate bounding boxes, whereas baseline models either hallucinate locations or provide vague descriptions.

5. Significance

This work bridges the gap between low-level vision tasks (IQA) and high-level multimodal understanding (Grounding/Referring). By enabling MLLMs to not only judge image quality but also locate the specific causes of degradation, Grounding-IQA:

Enhances the interpretability of AI-driven quality assessment.
Facilitates downstream tasks such as automated image editing, where knowing where to fix an image is as important as knowing that it needs fixing.
Provides a robust foundation for future research in fine-grained visual perception, moving beyond global scores to localized, explainable analysis.