How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

Imagine you have a beloved, old, scratched-up photograph of your grandmother. You want to fix it, but you don't just want to clean the scratches; you want to bring the colors back to life and make her smile look as warm as you remember.

For a long time, computers were like photocopiers. If the photo was blurry, they just made it less blurry. If it was missing a piece, they left a blank spot. They were safe, but boring.

Then, Generative AI arrived. These are like magical artists. They don't just clean the photo; they imagine what the missing parts should look like. They can paint new fur on a dog, reconstruct a face, or add texture to a blurry building. It's amazing, but it's also a bit risky. Sometimes, the artist gets too creative and paints a dog with six legs or changes your grandmother's nose into a potato.

This paper is a massive report card for these "magical artists." The researchers asked: How good are they really? Where do they fail? And how do we measure their success without just guessing?

Here is the breakdown of their findings using some everyday analogies:

1. The New Problem: "Too Much Imagination"

In the past, the main problem with AI was under-generation (the AI was too lazy and didn't add enough detail).

The Old Struggle: Trying to fill in a blank canvas with a tiny, sad paintbrush.
The New Struggle: The AI is now like an over-enthusiastic toddler with a box of crayons. It wants to add everything. It might add extra whiskers to a cat that didn't have them, or turn a simple fence into a complex, chaotic maze of railings.
The Finding: The biggest challenge now isn't making the image look real; it's making sure the AI doesn't lie about what's in the picture. It's the difference between a skilled restorer and a hallucinating dreamer.

2. The "Hard Mode" Test

The researchers didn't just test the AI on easy pictures. They built a gym for AI with specific obstacles:

The "Crowd" Challenge: Imagine trying to fix a photo of a stadium full of people. The AI often gets confused, turning faces into blobs or giving people three eyes.
The "Text" Challenge: If the photo has a sign that says "STOP," the AI might change it to "STO" or "STUP." It struggles to keep letters perfect.
The "Hand" Challenge: Hands are notoriously hard for AI. The model might give a person six fingers or twist a hand into a pretzel shape.
The "Old Film" Challenge: Fixing a movie reel from the 1920s is like trying to rebuild a castle from a pile of dust. The AI often fails because there is simply too much missing information.

3. The Different Types of "Artists"

The paper tested 20 different AI models, which fall into four main groups:

The Diffusion Models (The New Stars): These are the current champions. They are like master sculptors who can create incredibly realistic textures. However, they are a bit "moody." Sometimes they are too smooth (boring), and sometimes they go wild (hallucinating). They need very specific instructions (parameters) to get the job done right.
The GANs (The Old Guard): These are the veterans. They are reliable but often produce images that look a bit "plastic" or fake. They rarely hallucinate, but they also rarely create amazing new details.
The General Generators (The Wildcards): These are models designed to create anything from scratch (like making a picture of a cat from a text prompt). When you try to use them to fix photos, they are unpredictable. Sometimes they do a great job; other times, they change the entire identity of the person in the photo.
The PSNR Models (The Cleaners): These are the traditional tools. They are very good at keeping the original image exactly as it is, but they can't "invent" new details to fix big holes.

4. The "Ruler" Problem

How do you know if an AI did a good job?

The Old Ruler: Previously, we used math-based scores (like PSNR) that measured how close the pixels were to the original. This is like judging a painting only by how many brushstrokes match the original. It misses the feeling.
The New Ruler: The researchers built a human-like judge. They asked real people to rate the images on four things:
1. Detail: Is it too smooth or too messy?
2. Sharpness: Is it blurry or too harsh?
3. Semantics: Did the AI change the meaning? (e.g., turning a dog into a cat).
4. Overall: Would you hang this on your wall?
They used these human ratings to train a new AI judge that can spot these subtle errors much better than old math formulas.

The Big Takeaway

We have come a long way. We can now restore old photos with stunning realism. But we have hit a new wall.

The AI is no longer struggling to see the image; it is struggling to control its imagination. The future of image restoration isn't about making the AI smarter; it's about teaching it restraint. We need an AI that knows exactly when to add a detail and when to leave it alone, ensuring that the restored photo is not just beautiful, but truthful.

In short: The magic is working, but we need to teach the magician to stop pulling rabbits out of hats when we just wanted to fix a broken vase.

1. Problem Statement

Generative Image Restoration (GIR) has achieved remarkable perceptual realism, often surpassing traditional methods in generating fine textures and details. However, the field lacks a systematic understanding of the practical capabilities and limitations of these models in real-world scenarios.

The Gap: Existing benchmarks primarily focus on synthetic degradations (e.g., Gaussian noise, simple blur) and holistic quality scores. They fail to account for:
- Semantic Sensitivity: How models perform on specific content types (e.g., faces, text, hands) where human perception is highly sensitive to errors.
- Real-world Degradations: Complex, authentic degradations like old film, surveillance footage, and low-light conditions.
- Failure Modes: The shift from "under-generation" (lack of detail) to "over-generation" (hallucinated details, semantic inconsistencies) is not well quantified.
- Evaluation Metrics: Current Image Quality Assessment (IQA) methods often fail to detect semantic hallucinations or provide only a single score, masking specific failure causes.

2. Methodology

The authors propose a comprehensive, multi-dimensional evaluation framework involving dataset construction, model benchmarking, and a new IQA training pipeline.

A. Dataset Construction

The authors constructed a new, large-scale test set comprising 7,080 restored images generated by 20 different models. The dataset is balanced across two dimensions:

Semantic Categories (21 types): Ranging from human-centric (faces, hands/feet, crowds) to structural (architecture, vehicles), texture-rich (fur, fabric), and symbolic (text, comics).
Degradation Types (11 types): Covering both synthetic (compression, blur) and authentic real-world conditions (old photos, surveillance, low light, motion blur).

Data Sources: Includes 147 high-quality images degraded via a modified RealESRGAN pipeline and 207 authentic degraded images from the wild.

B. Model Selection

The study evaluates 20 representative models across four architectural families:

Diffusion-based: (e.g., SUPIR, DiffBIR, StableSR) – Current mainstream for GIR.
General Generation Models: (e.g., FLUX, Nano Banana) – Open-ended generation models applied to restoration.
GAN-based: (e.g., RealESRGAN, BSRGAN) – Traditional perceptual restoration baselines.
PSNR-oriented: (e.g., SwinIR, HAT) – Deterministic models for sharpness/fidelity baselines.

C. Human Evaluation Protocol

Instead of a single score, the authors introduced a multi-dimensional scoring framework based on human annotation:

Detail: Bipolar scale (-3 to +3). Negative = under-generation (smooth); Positive = over-generation (hallucinated).
Sharpness: Bipolar scale (-3 to +3). Negative = blurry; Positive = over-sharpened/haloed.
Semantics: 0 to 4 scale. Measures structural correctness and semantic alignment (e.g., is the text legible? is the face distorted?).
Overall Quality: 0 to 4 scale. Willingness to accept the restoration.
Reliability: 56 professional annotators were used, achieving high inter-annotator agreement (Krippendorff's $\alpha > 0.7$ ).

D. New IQA Model

Using the human-annotated dataset, the authors trained a new multi-dimensional IQA model based on the DeQA-Score framework. This model predicts scores for Detail, Sharpness, Semantics, and Overall quality independently, allowing for diagnostic evaluation rather than just a single metric.

3. Key Contributions

First Fine-Grained GIR Benchmark: A systematic study covering the joint impact of semantic content and degradation types, revealing that model performance is highly dependent on the specific scene.
Paradigm Shift Identification: The paper identifies a critical evolution in failure modes: the field has moved from the problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation).
Diagnostic Evaluation Framework: A multi-dimensional scoring system that distinguishes between under-generation, over-generation, and semantic hallucinations, providing insights that single-score metrics miss.
Novel IQA Model: A trained model that significantly outperforms existing state-of-the-art IQA methods (e.g., CLIP-IQA, MANIQA) in correlating with human judgments on generative restoration tasks, specifically in detecting semantic inconsistencies.

4. Key Results & Findings

A. Semantic-Dependent Behavior

High Variance: Models perform exceptionally well on textures like animal fur or cartoons but struggle significantly with small faces, crowds, hands/feet, and text.
The "Over-Generation" Problem: While GAN and PSNR models rarely over-generate (staying in the "under-generation" zone), Diffusion-based models frequently oscillate between under-generation and over-generation. They often hallucinate excessive textures (e.g., too many wrinkles on skin, extra railings in street views) or distort geometry.
Semantic Sensitivity: Diffusion models show high upper-bound scores (best-case scenarios) but suffer from low lower-bound scores (worst-case scenarios) in complex semantic scenes, indicating instability.

B. Degradation-Dependent Behavior

Information Deficiency: Models struggle most with degradations that involve irreversible information loss, such as motion blur, surveillance footage, and old film.
Generalization: Diffusion-based models generally show the strongest robustness across different degradation types compared to GANs and PSNR models, but they still fail when input information is too sparse.

C. Parameter Sensitivity

Configuration Matters: For diffusion-based models, performance is highly sensitive to parameter configurations (e.g., CFG scale, noise levels). There is no single "optimal" setting; the best configuration depends heavily on the specific semantic scene and degradation type.

D. IQA Performance

Superior Correlation: The authors' trained IQA model achieves an SRCC of 0.662 and PLCC of 0.677 against human annotations, significantly outperforming existing methods (which scored around 0.3–0.4).
Diagnostic Capability: Unlike traditional IQA, the new model can identify why an image is bad (e.g., distinguishing between a blurry image and a semantically hallucinated one).

5. Significance and Future Directions

Redefining Evaluation: The paper argues that the era of "pixel-level fidelity" or "single-score realism" is over. Future evaluation must prioritize semantic consistency and controllability to prevent hallucinations.
Guidance for Development: The findings suggest that future GIR models need:
- Better mechanisms to control generative strength to avoid over-generation.
- Task-specific architectures rather than relying solely on general-purpose generation models.
- Adaptive parameter tuning based on scene semantics.
Foundation for Agents: The multi-dimensional evaluation framework provides a necessary foundation for developing agent-based image restoration systems that can diagnose failures and adjust strategies dynamically.

In conclusion, this work serves as a critical reality check for the Generative Image Restoration field, highlighting that while models have become incredibly powerful at generating details, they have not yet mastered the control required to ensure those details are semantically correct and faithful to the original input.