FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

Imagine you have a blurry, pixelated photo of your favorite memory. You want to make it crisp and clear again, but you don't want to just "guess" what the missing parts look like (which might make your dog look like a cat) or just "stretch" the pixels (which makes it look blocky).

This is the challenge of Image Super-Resolution (SR). The paper introduces a new AI tool called FiDeSR that solves this problem in a single, lightning-fast step.

Here is how FiDeSR works, explained through simple analogies:

The Problem: The "One-Step" Dilemma

Imagine you are a chef trying to recreate a complex dish from a blurry description.

Old AI methods (Multi-step): These chefs taste the soup, add salt, taste again, add pepper, taste again... They do this 200 times. The result is great, but it takes forever.
New AI methods (One-step): These chefs try to guess the perfect seasoning in one single toss. It's super fast, but they often mess up. They either make the soup too salty (losing the original flavor/fidelity) or forget the spices entirely (losing the fine details).

FiDeSR is the new chef who can get the perfect dish in one toss, balancing speed, flavor, and texture.

The Three Secret Ingredients of FiDeSR

FiDeSR uses three special techniques to ensure the photo looks real and sharp.

1. The "Spotlight" (Detail-Aware Weighting)

The Analogy: Imagine you are painting a masterpiece. If you paint the whole canvas with the same amount of effort, the background might get too muddy, and the tiny details on the face might get lost.
How FiDeSR does it: FiDeSR puts a "spotlight" on the hard parts of the image. It looks at the blurry photo and says, "Hey, this edge of the building is really fuzzy, and this texture on the fabric is confusing. I'm going to focus 100% of my energy there."
It ignores the easy, smooth parts (like a blue sky) and concentrates its brainpower on the tricky, detailed areas where mistakes usually happen.

2. The "Second Opinion" (Latent Residual Refinement)

The Analogy: Imagine you are taking a math test. You write down your answer (the first guess), but you know you might have made a small calculation error. Instead of just handing it in, you have a smart tutor (the LRRB) who looks at your answer and your scratch paper, finds the tiny mistake, and whispers, "Actually, change this number by a tiny bit."
How FiDeSR does it: The AI makes a first guess at the missing details. Then, a special "refinement block" checks that guess. It doesn't start over; it just fixes the tiny errors and adds the missing "crunch" to the image, ensuring the details aren't blurry or weird.

3. The "Frequency Tuner" (Latent Frequency Injection)

The Analogy: Think of an image like a song.

Low Frequencies are the bass and drums (the structure, the shape, the big picture).
High Frequencies are the cymbals and violins (the fine textures, the hair strands, the fabric weave).
Sometimes, when you restore a song, you get the bass right but the cymbals sound flat. Or you get the cymbals loud but the bass is wobbly.
How FiDeSR does it: FiDeSR has a special knob that lets it tune the bass and the cymbles separately.
It strengthens the Low Frequencies to make sure the building doesn't look wobbly or distorted.
It boosts the High Frequencies to make sure the leaves on the tree look sharp and crisp.
It mixes them back together perfectly so the image looks both stable and detailed.

Why is this a Big Deal?

Speed: Because it does all this in one step (instead of 200), it is incredibly fast. You could restore a photo almost instantly.
Balance: Most fast methods make the photo look "fake" or "plastic." FiDeSR keeps the photo looking real (high fidelity) while making it sharp (high detail).
No Training Needed: The "Frequency Tuner" (Ingredient #3) works without needing to retrain the whole AI. You can just turn the knobs to get more detail or more stability depending on what you like.

The Bottom Line

FiDeSR is like a magic photo restorer that doesn't just guess what's missing. It uses a "spotlight" to focus on the hard parts, a "tutor" to fix tiny mistakes, and a "sound mixer" to balance the structure and the texture. The result? A photo that looks exactly like the original, but in high definition, created in the blink of an eye.

1. Problem Statement

Real-world Image Super-Resolution (Real-ISR) aims to restore high-quality (HQ) images from low-quality (LQ) inputs suffering from unknown and complex degradations. While diffusion models have achieved state-of-the-art results, they face a critical trade-off between fidelity (structural accuracy) and perceptual quality (detail richness).

Existing one-step diffusion models (designed for efficiency) suffer from two main limitations:

Structural Fidelity Degradation: Due to Variational Autoencoder (VAE) conditioning and the prediction of a single global residual, these models often produce structural distortions and low-frequency (LF) inconsistencies.
Insufficient High-Frequency (HF) Detail Restoration: Compressing the iterative denoising process into a single step leads to a loss of fine textures and high-frequency details. The models often fail to recover the HF information lost during the noise injection process or the VAE compression.

2. Methodology: FiDeSR Framework

FiDeSR proposes a high-fidelity, detail-preserving one-step diffusion framework that integrates three key technical components to address the above challenges without requiring multi-step sampling.

A. Detail-Aware Weighting (DAW) Strategy

Goal: To adaptively emphasize regions where the model struggles (high error) during training, preventing overfitting to already well-reconstructed areas.
Mechanism:
- Detail Map ( $D$ ): Generated by applying spatial operators (Sobel, Laplacian, and Variance filters) to the Ground Truth (HQ) image to capture edges, contrast, and texture variance.
- Error Map ( $E$ ): Computed as a weighted combination of pixel-wise error ( $L1$ ) and perceptual error ( $LPIPS$ ) between the restored and HQ images.
- Weight Map ( $W_{DAW}$ ): Calculated as the element-wise product $D \odot E$ .
Application: This weight map modulates both the reconstruction loss ( $L_{rec}$ ) and the Classifier Score Distillation (CSD) regularization loss ( $L_{reg}$ ), forcing the model to focus computational effort on structurally complex and perceptually difficult regions.

B. Latent Residual Refinement Block (LRRB)

Goal: To correct the unstable and coarse residual predictions inherent in single-step latent diffusion models.
Mechanism:
- Unlike standard residual learning (e.g., ESRGAN) which operates in pixel space, LRRB operates in the latent space.
- It takes the concatenation of the LQ latent ( $z_L$ ) and the initial coarse residual ( $r$ ) predicted by the U-Net.
- Using a Residual-in-Residual Dense Block (RRDB) architecture, it learns an adaptive correction term ( $\Delta r$ ).
- Refinement: The final residual is $r' = r + \Delta r$ , which is then subtracted from $z_L$ to produce a refined latent ( $z_r$ ). This two-step process allows for more precise adjustments to the residual than a single global prediction.

C. Latent Frequency Injection Module (LFIM)

Goal: To enhance perceptual details and maintain structural fidelity during inference without retraining.
Mechanism:
- After the diffusion step and LRRB refinement, the latent representation is decomposed into Low-Frequency ( $\Delta_{LP}$ ) and High-Frequency ( $\Delta_{HP}$ ) components using FFT-based Butterworth filters.
- Selective Injection: A Latent Frequency Injection Module selectively injects these components back into the latent space using two gates:
  - Spatial Gate ( $M_{sp}$ ): Identifies detailed vs. flat regions based on a detail map derived from the LQ image.
  - Channel Gate ( $M_{ch}$ ): Analyzes frequency energy ratios across latent channels.
- Strategy: LF injection stabilizes global structure and tone, while HF injection enhances textures and edges. This allows for flexible control over the balance between fidelity and detail.

3. Key Contributions

FiDeSR Framework: A novel one-step diffusion SR framework that successfully bridges the gap between high structural fidelity and rich perceptual detail, overcoming the limitations of existing one-step methods.
Three Core Components:
- DAW: A training strategy that dynamically weights losses based on spatial difficulty and detail richness.
- LRRB: A latent-space refinement block that corrects coarse residual predictions, reducing high-frequency noise and artifacts.
- LFIM: An inference-time module that adaptively injects frequency components to enhance details while preserving structure.
State-of-the-Art Performance: FiDeSR achieves superior results on both full-reference (PSNR, SSIM) and no-reference (MANIQA, LPIPS, FID) metrics, outperforming both one-step and competitive multi-step diffusion methods.

4. Experimental Results

Datasets: Evaluated on synthetic (DIV2K) and real-world (RealSR, DRealSR) benchmarks.
Quantitative Performance:
- FiDeSR achieves the best FID scores among all compared methods (e.g., 127.97 on DRealSR vs. 130.48 for StableSR), indicating the closest alignment with real image distributions.
- It consistently outperforms other one-step methods (SinSR, OSEDiff, PiSA-SR) in perceptual metrics (MANIQA, MUSIQ) while maintaining competitive PSNR/SSIM.
- It surpasses multi-step methods (StableSR, SeeSR) in efficiency (1 step vs. 200/50 steps) while matching or exceeding their perceptual quality.
Qualitative Performance: Visual comparisons show FiDeSR produces sharper textures and more faithful structures compared to competitors, which often suffer from over-smoothing (OSEDiff), structural distortion (AddSR), or excessive artifacts (PiSA-SR).
Efficiency: Despite adding LRRB and LFIM, FiDeSR maintains a competitive inference time (~~0.078s) and parameter count (~~1.29B), comparable to other one-step models.

5. Significance

FiDeSR demonstrates that one-step diffusion models can achieve high-fidelity and detail-rich restoration without the computational burden of multi-step sampling. By integrating frequency-aware guidance (LFIM) and residual refinement (LRRB) with a difficulty-aware training strategy (DAW), the paper provides a new direction for efficient Real-ISR. This work suggests that the perception-distortion trade-off can be effectively managed in a single-step framework, paving the way for real-time, high-quality image restoration applications in video and multi-modal tasks.