Perceptual Quality Optimization of Image Super-Resolution

Imagine you have an old, blurry family photo that you want to print out in a huge size. When you try to enlarge it, the image usually gets pixelated and blocky. This is the problem Image Super-Resolution (SR) tries to solve: taking a small, blurry picture and magically filling in the missing details to make it look crisp and high-definition.

For a long time, computers were really good at making these pictures look "mathematically correct," but they often looked boring and fake to human eyes. They would smooth out textures (like skin or grass) until everything looked like plastic.

This paper introduces a new system called Efficient-PBAN that teaches computers to make pictures look good the way humans actually see them, not just the way math says they should.

Here is the breakdown using simple analogies:

1. The Problem: The "Perfectly Wrong" Artist

Imagine you hire an artist to redraw a blurry photo.

The Old Way (Distortion-Oriented): You tell the artist, "Make sure every pixel matches the original photo's average color exactly." The artist does this perfectly. The result is mathematically accurate, but the face looks like a smooth, wax mannequin. It's "correct" but lacks life.
The New Way (Perceptual-Oriented): You tell the artist, "Make it look real to a human eye, even if the pixels aren't mathematically perfect." The artist adds realistic wrinkles, hair strands, and texture.

The problem is that previous "realistic" methods were either too slow (like a slow-motion video editor) or unstable (sometimes adding weird, hallucinated details like a third eye).

2. The Solution: The "Human Eye" Coach (Efficient-PBAN)

The authors built a new tool called Efficient-PBAN. Think of this tool as a strict art critic who has seen thousands of photos and knows exactly what humans find beautiful.

The Training: They didn't just teach this critic with math. They built a massive library of photos created by different AI methods and asked 23 real humans to rate them. The critic (Efficient-PBAN) studied these human ratings to learn what "good" looks like.
The Magic Trick (Bi-Directional Attention): Usually, critics look at a photo in tiny, separate patches (like looking at a mosaic one tile at a time). This is slow and misses the big picture.
- The Analogy: Imagine looking at a painting. The old way is looking through a straw at one spot. The new way (Efficient-PBAN) is like having two pairs of eyes looking at the whole painting at once, comparing the blurry version with the sharp version simultaneously. It understands the relationship between the two instantly, making it fast and efficient.

3. How It Works in Practice

Once the "Human Eye Critic" is trained, it doesn't just sit there judging; it becomes a coach for the AI artist.

The AI artist tries to make a high-res image.
The Critic (Efficient-PBAN) looks at the result and says, "This looks a bit too smooth, add more texture here," or "This edge is too jagged, soften it."
The artist tries again, guided by the Critic's feedback.
They repeat this until the image looks perfect to a human.

This is called a "Closed-Loop." The AI isn't just guessing; it's constantly checking its work against human preferences.

4. The Results: The Best of Both Worlds

The paper tested this on two different AI artists.

Without the Critic: The images were sharp but looked a bit "plastic" or blurry in the details.
With the Critic: The images had realistic textures (like the grain in wood or the fuzz on a peach) and looked much more natural.

The Trade-off:
There is a tiny catch. Sometimes, focusing too much on "looking real" makes the image slightly less "mathematically perfect" (a tiny drop in standard scores). However, the authors found a sweet spot: if they let the Critic guide the artist just enough, they get images that look incredibly real while still keeping the structure accurate.

Summary

Think of this paper as teaching a computer to stop being a calculator and start being an artist.

Old AI: "I calculated the average color of this pixel. Here is your result." (Boring, smooth, fake).
New AI (Efficient-PBAN): "I looked at what humans love, compared it to the original, and added the right amount of texture to make it pop." (Real, crisp, satisfying).

The authors even made their "Critic" and their "Human Rating Library" available for everyone to use, so other developers can build better, more realistic image enhancers.

1. Problem Statement

Single Image Super-Resolution (SR) aims to reconstruct High-Resolution (HR) images from Low-Resolution (LR) inputs. While deep learning methods (CNNs and Transformers) have significantly improved distortion-oriented metrics like PSNR and SSIM, they often fail to recover high-frequency details critical for human perception.

The Trade-off: Optimizing solely for signal fidelity leads to over-smoothed textures and unnatural appearances.
Limitations of Existing Solutions:
- Perceptual Losses (e.g., VGG-based, Adversarial): Often result in unstable textures or hallucinations.
- Diffusion Models: Achieve high realism but suffer from heavy computational costs and long inference times.
- Image Quality Assessment (IQA): Existing IQA metrics are typically trained on generic distortions (noise, blur) rather than SR-specific artifacts. Furthermore, current SR-specific metrics (e.g., PFIQA, PBAN) are patch-based, requiring extensive sampling and calculation, which makes them inefficient and unsuitable for end-to-end differentiable training as loss functions.

2. Methodology

The authors propose Efficient-PBAN (Efficient Perceptual Bi-directional Attention Network), a framework designed to optimize SR directly toward human-preferred quality. The approach consists of two main stages:

A. Dataset Construction

To address the lack of SR-specific training data, the authors constructed a new SR Quality Database:

Content: 720 SR images (~2K resolution) generated from 19 diverse HR references.
Coverage: Includes 19 state-of-the-art SR methods (GANs, Diffusion, Transformers, Flow-based, CNNs) across upscaling factors $\times2, \times3, \times4, \times8$ .
Annotations: Subjective quality scores collected via a single-stimulus experiment following ITU-R BT.500-14 standards with 23 participants.

B. Efficient-PBAN Architecture

Unlike patch-based models, Efficient-PBAN operates at the image level for efficiency.

Feature Extraction: Uses a shared ResNet stem and Layer1 for both the SR prediction ( $x_{SR}$ ) and HR reference ( $x_{HR}$ ). Beyond Layer1, branches separate to capture distinct statistics.
PBA+ Block (Bi-directional Attention):
- Computes Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) representations for both inputs.
- Applies attention mechanisms along both Height ( $H$ ) and Width ( $W$ ) dimensions to capture spatial dependencies.
- Fusion: Combines $HR \to SR$ and $SR \to HR$ attention outputs using a SubEC module (Sub-pixel and Sub-channel extraction) to enhance features with learnable upsampling and shuffling.
Quality Prediction Module: A prediction head with global pooling and fully connected layers regresses a perceptual score ( $\hat{q}$ ).
Training Strategy:
- Stage 1: Pre-train Efficient-PBAN as a quality predictor using an L2 regression loss against human opinion scores.
- Stage 2: Integrate the trained network into SR optimization as a differentiable perceptual loss.

C. Perceptual Optimization Loss

The SR model is trained using a combined loss function to balance fidelity and perception:
$\mathcal{L} = \alpha \times \frac{\mathcal{L}_D}{\mathcal{L}_D + \mathcal{L}_P} + \beta \times \frac{\mathcal{L}_P}{\mathcal{L}_D + \mathcal{L}_P}$
Where:

$\mathcal{L}_D$ : Distortion-oriented loss (e.g., SSIM).
$\mathcal{L}_P$ : Perceptual loss derived from the Efficient-PBAN score.
$\alpha, \beta$ : Weighting ratios.
This formulation creates a closed-loop where the SR network is guided to maximize perceptual scores while maintaining structural integrity.

3. Key Contributions

New SR Quality Database: A comprehensive dataset covering a wide range of modern SR methods with human opinion scores, specifically designed to train SR-customized perceptual metrics.
Efficient-PBAN: A lightweight, bi-directional attention network that predicts perceptual quality with strong correlation to human judgments. Crucially, it avoids patch-based sampling, enabling efficient image-level perception.
Differentiable Perceptual Optimization: The integration of the learned metric into the SR training loop as a differentiable loss, achieving superior perceptual quality without sacrificing computational feasibility.

4. Experimental Results

The method was evaluated on B100 and DIV2K datasets using two baselines: CAMixerSR and LINF.

Quantitative Performance:
- Perceptual Metrics: Models optimized with Efficient-PBAN showed significant improvements in PFIQA, LPIPS, and the Efficient-PBAN score compared to Original and SSIM-only baselines.
- Distortion Metrics: While PSNR/SSIM saw a slight decrease (the expected trade-off), the joint optimization (SSIM + Efficient-PBAN) achieved the best balance, maintaining competitive fidelity while boosting visual quality.
- Ablation Study: Varying the ratio of distortion ( $\alpha$ ) to perceptual ( $\beta$ ) loss demonstrated that increasing $\beta$ significantly improves naturalness (SN) and reduces artifacts, though excessive $\beta$ can degrade structural fidelity.
Qualitative Performance: Visualizations confirmed that Efficient-PBAN-guided models recover finer textures and sharper edges, avoiding the over-smoothing typical of SSIM-optimized models.
Subjective Testing: Mean Opinion Score (MOS) tests confirmed the ranking: Joint Optimization > Efficient-PBAN only > SSIM only > Original, validating that the combined approach yields the most preferred visual results.

5. Significance

This work bridges the gap between Image Quality Assessment (IQA) and Image Super-Resolution (SR) optimization. By moving away from patch-based, non-differentiable metrics to an efficient, image-level, differentiable perceptual loss, the authors provide a practical paradigm for training SR models that align directly with human visual preferences. The proposed framework offers a viable alternative to heavy diffusion-based models, delivering high perceptual quality with lower computational overhead, making it suitable for real-world applications.