DEFNet: Multitasks-based Deep Evidential Fusion Network for Blind Image Quality Assessment

Imagine you are a food critic trying to rate a dish, but you've never seen the recipe, and the chef is nowhere to be found. You have to judge the quality of the meal just by looking at it and tasting a few bites. This is exactly what Blind Image Quality Assessment (BIQA) does for computers: it tries to judge how "good" an image looks without having the original, perfect version to compare it against.

For a long time, computers were like critics who only looked at the main plate. They missed the garnish, the lighting, or the fact that the plate was cracked. Newer methods tried to look at more things, but they often treated those extra clues as separate, unrelated tasks, leading to a confused judgment.

The paper you shared introduces DEFNet, a new "super-critic" that uses a clever three-part strategy to give a much more reliable rating. Here is how it works, broken down into simple concepts:

1. The "Team of Experts" Approach (Multitask Learning)

Imagine you are rating a photo, but instead of just one person doing it, you have a team of three specialists working together:

The Quality Judge: The main expert who says, "This looks good/bad."
The Scene Detective: An expert who identifies what is in the picture (e.g., "This is a sunset," or "This is a busy city").
The Damage Inspector: An expert who spots what went wrong (e.g., "It's too blurry," or "The colors are washed out").

Previous methods asked these experts to work in separate rooms and just shout their answers to the main judge. DEFNet puts them in the same room. They talk to each other. The Scene Detective says, "Hey, this is a night scene, so it's supposed to be dark," and the Damage Inspector says, "But the noise here looks unnatural." By sharing this information, the main judge makes a much smarter decision.

2. The "Zoom-In and Zoom-Out" Strategy (Trustworthy Fusion)

Even with a team, you can miss details if you only look at the whole picture, or you can miss the big picture if you only look at tiny crumbs. DEFNet uses a two-level zoom strategy:

Cross-Sub-Region (The Puzzle Piece Method): Imagine cutting the photo into four puzzle pieces. DEFNet looks at each piece individually to find local flaws (like a smudge on a specific face) and then stitches those observations together. It ensures no small detail is ignored.
Local-Global (The Telescope and Microscope): It combines a "microscope" view (looking at fine details like texture) with a "telescope" view (looking at the overall composition and context). It balances the tiny details with the big picture so the computer doesn't get obsessed with a single pixel or ignore the whole scene.

3. The "Confidence Meter" (Evidential Uncertainty)

This is the most creative part. Most AI models are like overconfident students: they give an answer even when they are guessing, and they don't tell you how sure they are.

DEFNet is different. It uses a concept called Evidential Learning. Think of it as the AI keeping a "Confidence Journal."

When the AI sees a clear, perfect image, it says, "I am 100% sure this is a 5-star photo."
When the image is weird or distorted in a way it hasn't seen before, it says, "I think this is a 3-star photo, BUT I'm only 60% sure because this looks strange."

It uses a special mathematical tool (Normal-Inverse Gamma distribution) to measure two types of doubt:

Noise: "The image is just blurry." (Aleatoric uncertainty)
Ignorance: "I've never seen a picture like this before." (Epistemic uncertainty)

By admitting when it's unsure, DEFNet avoids making wild guesses, making it much more reliable in real-world situations.

Why Does This Matter?

In the real world, images are messy. They come from phone cameras, medical scanners, and security feeds, often with weird distortions.

Old methods might confidently say a blurry medical scan is "perfect" because they didn't know how to measure their own doubt.
DEFNet looks at the scene, checks the damage, zooms in and out, and then says, "This looks okay, but I'm not 100% sure because the lighting is weird."

The Bottom Line

The authors tested DEFNet on hundreds of images, from synthetic computer-generated errors to real-world photos taken by people. The results showed that DEFNet is better at judging quality than almost any other current method. It's like upgrading from a critic who just guesses to a critic who has a team of experts, a zoom lens, and a honest confidence meter.

In short: DEFNet doesn't just look at the image; it understands the context, checks the details, and knows when to say, "I'm not sure," leading to smarter and safer image quality ratings.

1. Problem Statement

Blind Image Quality Assessment (BIQA) aims to objectively evaluate image quality without reference to a pristine original. While deep learning has advanced this field, existing methods face two critical limitations:

Insufficient Information Integration: Current multitask approaches often treat auxiliary tasks (e.g., scene classification, distortion type detection) as independent modules. This leads to fragmented information and a lack of deep mining of correlations between tasks. Furthermore, they often fail to effectively fuse features across different sub-regions and scales (local vs. global).
Rigid Uncertainty Estimation: Existing methods struggle to provide flexible and robust uncertainty representations. They often fail to simultaneously model aleatoric uncertainty (data noise) and epistemic uncertainty (model ignorance), leading to overconfident predictions even when the model is incorrect.

2. Methodology: DEFNet Framework

The authors propose DEFNet, a multitask-based Deep Evidential Fusion Network that integrates three core tasks: BIQA, Scene Classification, and Distortion Type Classification. The framework operates through the following stages:

A. Feature Extraction via CLIP

The model utilizes Contrastive Language-Image Pre-training (CLIP) to extract features.

Input: Images are processed as both local sub-images (cropped) and a global image (downsampled).
Text Prompt: A textual template is used: "a photo of a(n) {scene} with {distortion} artifacts, which is of {quality} quality."
Output: The frozen CLIP image encoder generates feature embeddings, which are mapped to joint probability scores for quality ( $c$ ), scene ( $s$ ), and distortion ( $d$ ).

B. Multitask Optimization

DEFNet employs a simultaneous optimization strategy for three tasks:

BIQA Task: Uses a fidelity loss based on Thurstone's model to rank image pairs.
Scene Classification: Aggregates joint probabilities to predict scene categories.
Distortion Classification: Predicts the dominant distortion type.
These are combined into a unified multitask loss ( $L_M$ ) with adaptive weighting.

C. Trustworthy Information Fusion Strategy

To address the lack of deep integration, DEFNet introduces a two-level fusion mechanism:

Cross Sub-Region Fusion:
- Randomly samples probability scores from four sub-regions.
- Marginalizes these scores for each task and converts them into parameters for a Normal-Inverse Gamma (NIG) distribution.
- Fuses these distributions using evidential summation rules to aggregate diverse regional patterns, reducing aleatoric uncertainty.
Local-Global Fusion:
- Combines fine-grained details from local sub-images with coarse-grained context from the global image.
- Fuses the local NIG distributions with the global NIG distribution.
- This ensures the model balances micro-details with macro-context, addressing cross-grained feature integration.

D. Evidential Uncertainty Estimation

Instead of predicting a single point estimate, DEFNet models the quality score distribution using Evidential Learning.

Distribution: Assumes the quality score follows a Normal distribution $N(\mu, \sigma^2)$ , where the parameters $(\mu, \sigma^2)$ are governed by a Normal-Inverse Gamma (NIG) prior.
Loss Function: The total loss ( $L_U$ $L_{U}$ ) combines:
- Negative Log-Likelihood (NLL): Maximizes model fit.
- Regression Loss: Penalizes predictions that deviate from ground truth, weighted by the total evidence.
Benefit: This allows the model to simultaneously capture aleatoric and epistemic uncertainty, enabling it to identify predictive fluctuations and avoid overconfidence.

3. Key Contributions

Novel Architecture: Proposes DEFNet, the first framework to integrate scene and distortion classification with BIQA using Deep Evidential Fusion.
Two-Level Fusion Strategy: Introduces a trustworthy fusion mechanism that operates at:
- Cross Sub-Region: Aggregating diverse patterns from image patches.
- Local-Global: Balancing fine-grained details with global context.
Robust Uncertainty Modeling: Develops a mechanism based on evidential learning and NIG mixtures to simultaneously model both types of uncertainty, significantly improving prediction reliability.
State-of-the-Art Performance: Demonstrates superior generalization and adaptability across synthetic and authentic distortion datasets.

4. Experimental Results

The model was evaluated on six datasets: LIVE, CSIQ, KADID-10k (synthetic) and BID, LIVE-C, KonIQ-10k (authentic).

Performance Metrics: DEFNet achieved State-of-the-Art (SOTA) results in both Spearman's Rank Order Correlation Coefficient (SRCC) and Pearson's Linear Correlation Coefficient (PLCC).
- Example: On LIVE, DEFNet achieved SRCC 0.978 and PLCC 0.960, outperforming previous leaders like LIQE and CDINet.
- Example: On KonIQ-10k (authentic), it achieved SRCC 0.920.
Generalization (Zero-Shot): In cross-dataset evaluations (training on KADID/KonIQ, testing on TID2013/SPAQ), DEFNet showed high robustness (SRCC 0.828 on TID2013), outperforming other multitask methods.
Distortion Robustness: DEFNet consistently outperformed competitors across diverse distortion types (e.g., White Noise, JPEG, Blur) in the CSIQ dataset.
Uncertainty Analysis: Compared to LIQE, DEFNet exhibited narrower confidence intervals (0.251 vs. 0.286) and better alignment between predicted scores and ground truth in gMAD competitions, indicating lower uncertainty and higher reliability.

5. Significance and Conclusion

DEFNet represents a significant advancement in BIQA by moving beyond simple feature concatenation to deep evidential fusion.

Theoretical Impact: It successfully bridges Evidential Theory with Multitask Learning, providing a mathematically rigorous way to handle uncertainty in image quality assessment.
Practical Impact: The ability to quantify uncertainty makes the model more trustworthy for real-world applications (e.g., medical imaging, autonomous driving) where overconfident errors are critical.
Future Work: The authors acknowledge that while robust, the model still faces challenges with highly novel distortions and has a relatively high parameter count (84.22M), suggesting future optimization directions.

In summary, DEFNet sets a new benchmark for blind image quality assessment by effectively fusing multi-level information and providing reliable uncertainty estimates, making it a robust solution for both synthetic and real-world image quality evaluation.