ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking

The Big Problem: The "Rounded Number" Habit

Imagine you ask a very smart, well-read robot (a Vision-Language Model) to rate the quality of a photograph on a scale of 1 to 5.

You show it two photos:

Photo A: A slightly blurry sunset.
Photo B: A slightly more blurry sunset.

You expect the robot to say, "Photo A is a 4.2, and Photo B is a 3.8."

But instead, the robot gets lazy with its math. It looks at both and says, "They both look okay, so I'll give them both a 4.0."

This is called "Discrete Collapse." The robot is so used to speaking in whole words (like "good," "bad," "okay") that it struggles to speak in precise numbers. It squashes all the subtle differences into just a few "buckets" (like 3.0, 4.0, 5.0), making it impossible to tell which photo is actually better.

The Solution: ME-IQA (The "Memory-Enhanced" Judge)

The authors created a system called ME-IQA to fix this. Think of it not as a new robot, but as a smart assistant that sits next to the robot and helps it make better decisions while it's working.

Here is how ME-IQA works, step-by-step:

1. The "Photo Album" (The Memory Bank)

Instead of judging a photo in a vacuum, ME-IQA opens a digital photo album (Memory Bank) filled with thousands of other images the robot has seen before, along with their "correct" scores.

The Anchor Album: This part has famous, standard photos (like a perfect apple or a blurry car) that everyone agrees on. It keeps the robot grounded.
The "Hard Case" Album: This part is dynamic. It fills up with tricky photos the robot recently struggled with. If the robot gets confused by a specific type of distortion, ME-IQA saves that example so the robot can learn from it next time.

2. The "Context Clue" (Retrieval)

When the robot sees a new photo, ME-IQA doesn't just look at the picture; it looks at the reasoning the robot is thinking.

Analogy: If the robot thinks, "This photo is blurry because of motion," ME-IQA quickly flips through the album to find other photos that are "blurry because of motion."
It pulls out a small group of similar neighbors to show the robot: "Hey, look at these photos. They are similar to yours. How do they compare?"

3. The "Taste Test" (Re-Ranking)

Instead of asking the robot for a number immediately, ME-IQA asks it a different question: "Which of these two photos looks better?"

The robot is much better at comparing two things side-by-side (like a judge in a cooking contest) than it is at guessing a number out of thin air.
ME-IQA gathers these "A vs. B" opinions from the neighbors in the album.

4. The "Final Verdict" (Fusion)

Now, ME-IQA takes the robot's original guess (which might be a boring "4.0") and mixes it with the "A vs. B" opinions from the album.

It uses a mathematical formula (Thurstone's model) to blend them.
Result: Instead of a flat "4.0," the robot might now say, "Based on the similar photos, this is actually a 4.15."
This creates a smoother, more sensitive score that can tell the difference between a 4.1 and a 4.2.

5. The "Reflection" (Learning on the Fly)

If the robot's new score is very different from its old guess, ME-IQA triggers a "Reflection."

Analogy: It's like a teacher saying, "Wait, you changed your mind? Let's write down why you changed your mind so you remember this next time."
This new insight gets added to the "Hard Case" album, making the system smarter for the very next photo it sees.

Why is this a big deal?

No Retraining: You don't have to teach the robot a new language. You just plug this "assistant" in, and it works immediately.
It's Fair: It stops the robot from giving everyone the same score. It makes the scores spread out naturally, just like human judges do.
It's Fast: It doesn't need to re-read the whole internet; it just grabs a few relevant examples from its memory to make a quick, smart decision.

In a Nutshell

ME-IQA is like giving a robot a personal librarian and a panel of peer reviewers. When the robot is about to give a lazy, rounded-off score, the librarian says, "Hold on, let's compare this to 32 similar photos we've seen before." The robot then compares them, realizes the subtle difference, and gives a much more precise, human-like rating.

1. Problem Statement

The paper addresses a critical limitation in using Reasoning-Induced Vision-Language Models (VLMs) for Image Quality Assessment (IQA). While VLMs that generate step-by-step reasoning before outputting a score show improved generalization, they suffer from "discrete collapse."

The Phenomenon: Instead of producing a continuous distribution of scores that reflects subtle perceptual differences, these models tend to output coarse, identical scalar values (e.g., clustering heavily around 3.0, 4.0, or 5.0).
The Cause: VLMs are pre-trained to generate discrete text tokens, not continuous perceptual quantities. When forced to predict numeric scores, they gravitate toward textually salient numbers, quantizing perception and losing sensitivity to fine-grained distortions.
Limitations of Existing Solutions:
- Token Probability Averaging: Lacks explicit comparative context.
- Pure Pairwise Comparison: Perceptually grounded but scales poorly (computationally expensive) for large datasets and is impractical for online testing.
- Static Anchors: Compare a query to a fixed set of references, which fails to adapt to distribution shifts or novel distortions.

2. Methodology: ME-IQA

The authors propose ME-IQA, a plug-and-play, test-time framework that enhances existing reasoning-induced VLMs without retraining. It operates as a re-ranking module that mitigates discrete collapse through three core stages:

A. Hybrid Memory Bank Construction

ME-IQA maintains a dynamic memory bank ( $M$ ) composed of two parts to balance stability and adaptability:

Anchor Memory (AM): An offline, static bank constructed from labeled datasets (e.g., KONIQ-10K). It uses Ground Truth (GT)-stratified retrieval, dividing the quality scale [1, 5] into bins to ensure coverage across the entire spectrum. This provides a stable global scaffold.
Contrast Memory (CM): An online, dynamic bank that grows during inference. It stores "hard cases" or cases where the model's initial prediction was significantly corrected. This allows the system to adapt to distribution shifts and emerging artifacts.

B. Reasoning-Aware Retrieval

For a new query image ( $x_i$ ):

The VLM generates an initial reasoning trace ( $\tilde{r}_i$ ) and a raw score ( $\tilde{s}_i$ ).
The reasoning trace is summarized into a concise, self-contained quality description ( $r_i$ ) focusing on semantic content and distortions.
This description is embedded and used to retrieve a neighborhood ( $N$ ) of $K$ similar exemplars from the Hybrid Memory (split between AM and CM) using cosine similarity.

C. Probabilistic Re-Ranking (Thurstone's Case V)

Instead of asking the VLM for a direct score, the framework reframes the VLM as a probabilistic comparator:

Pairwise Comparison: The VLM compares the query image against each retrieved neighbor $j$ in $N$ and outputs the probability $y_{ij}$ that the query is better than the neighbor.
Fusion via Optimization: The system fuses these pairwise ordinal preferences with the initial mapped score ( $s_i$ $s_{i}$ ) by minimizing a loss function based on Thurstone's Case V model.
- The objective minimizes the Binary Cross-Entropy (BCE) between the predicted preference and the VLM's output, regularized by a quadratic tether to the initial score ( $s_i$ ).
- A closed-form approximation is used for efficiency, yielding a refined score $s^*_i$ .
Gated Reflection: If the refined score $s^*_i$ deviates significantly from the initial score $s_i$ (beyond a threshold $\epsilon$ ), the system triggers a "reflection" step. The VLM revises its quality description, and the corrected case is consolidated into the Contrast Memory (CM) for future use.

3. Key Contributions

Mitigation of Discrete Collapse: ME-IQA successfully transforms the coarse, discrete output of reasoning VLMs into dense, distortion-sensitive predictions that align closely with human Mean Opinion Scores (MOS).
Test-Time Memory Enhancement: It introduces a novel hybrid memory mechanism (Static Anchors + Dynamic Contrast) that operates purely at inference time, requiring no architectural changes or retraining of the base VLM.
Reasoning-Grounded Retrieval: Unlike previous methods that retrieve based on raw image features, ME-IQA retrieves neighbors based on reasoning summaries, ensuring that the retrieved context is semantically and perceptually aligned with the specific quality issues of the query.
Plug-and-Play Efficiency: The framework is compatible with any reasoning-induced VLM (open-source or proprietary) and offers a better efficiency-effectiveness trade-off compared to test-time scaling methods like majority voting.

4. Experimental Results

The authors evaluated ME-IQA across seven benchmarks (SPAQ, AGIQA, LIVEW, KADID, PIPAL, TID2013, CSIQ) and five VLM backbones (including Q-Insight, VisualQuality-R1, EvoQuality, Doubao-Seed-1.6, and GPT-5).

Performance Gains: ME-IQA consistently outperformed strong reasoning baselines, non-reasoning IQA methods (e.g., Q-Align, MUSIQ), and test-time scaling alternatives (e.g., Majority Voting, Mean Aggregation).
- Example: On the KADID dataset, ME-IQA improved the PLCC of VisualQuality-R1 from 0.709 to 0.741.
- Weighted Average (WAVG): Achieved state-of-the-art results across all backbones, with significant gains in SRCC (Spearman Rank Correlation), indicating superior ordinal ranking.
Discrete Collapse Analysis:
- Histograms showed that while baselines produced sharp spikes at discrete levels, ME-IQA redistributed probability mass, creating smooth, MOS-like distributions.
- Metrics like Jensen-Shannon Divergence decreased (closer to human distribution), and Entropy/Effective Bins increased significantly (e.g., effective bins jumped from ~10 to ~70+), proving the removal of quantization artifacts.
Efficiency: ME-IQA is significantly faster than majority voting (2.4x speedup) and comparable to pairwise comparison baselines (Compare2Score) while achieving higher accuracy.

5. Significance

ME-IQA represents a paradigm shift in how VLMs are applied to regression tasks like IQA. By decoupling the "reasoning" capability from the "scoring" mechanism and injecting local ordinal constraints via memory retrieval, the method solves the fundamental mismatch between discrete token generation and continuous perceptual assessment.

Its plug-and-play nature makes it highly practical for industry deployment, allowing existing VLMs to be upgraded to fine-grained quality assessors without the cost of retraining. Furthermore, the use of online memory consolidation suggests a path toward self-improving systems that adapt to new types of image distortions in real-time.