Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Imagine you have an old, blurry, and scratched-up photograph of a family reunion. You want to restore it so it looks crisp and new again. This is the job of Image Super-Resolution (SR).

For a long time, computers tried to fix these photos by just guessing what the missing pixels should look like based on math. But often, the result looked like a smooth, plastic painting—too perfect, or with weird, fake details (like a dog with six legs).

Recently, computers started using "AI artists" (called Diffusion Models) that are great at creating new images from scratch. But when you ask them to fix a blurry photo, they sometimes get confused. They might hallucinate a beach scene where there was actually a living room, or they might mix up the big picture with the tiny details.

This paper introduces DTPSR, a new way to guide these AI artists. Here is how it works, explained simply:

1. The Problem: The "Confused Chef"

Imagine you hire a chef to cook a complex meal, but you only give them a vague note: "Make something with a dog and a ball."
The chef might make a delicious meal, but the dog might be made of soup, or the ball might be floating in the sky. The chef didn't know how the dog should look (its shape) or what the fur should feel like (its texture). They got the "idea" but missed the "details."

Existing AI methods are like this chef. They get a general description, but they mix up the big layout (where the dog is) with the tiny textures (the fur), leading to messy results.

2. The Solution: The "Disentangled Recipe"

The authors of this paper say: "Let's stop giving the chef one vague note. Let's give them a structured, step-by-step recipe that separates different types of information."

They call this Disentangled Textual Priors. Think of it as breaking the instructions down into three distinct layers:

Layer 1: The Blueprint (Global Context)
- Analogy: "There is a dog in a grassy field."
- What it does: This tells the AI the big picture. Where are the objects? What is the scene? This ensures the dog stays on the grass and doesn't float in the sky.
Layer 2: The Shape & Color (Low-Frequency)
- Analogy: "The dog is brown and white, medium-sized, and sitting."
- What it does: This handles the "smooth" parts. It defines the general shape and colors without worrying about individual hairs yet.
Layer 3: The Texture & Edges (High-Frequency)
- Analogy: "The fur is fluffy, the nose is wet, and the eyes are shiny."
- What it does: This handles the "crunchy" details. It adds the sharp edges, the rough texture of the grass, and the shiny reflection in the eye.

By separating these instructions, the AI doesn't get confused. It builds the scene layer by layer, just like an architect builds a house (foundation first, then walls, then paint).

3. The Secret Ingredient: The "DisText-SR" Cookbook

To teach the AI this new way of thinking, the authors had to create a massive new textbook. They built a dataset called DisText-SR.

Imagine they took 95,000 photos and, for every single one, they wrote three different descriptions:

A sentence about the whole scene.
A sentence about the shape of every object.
A sentence about the texture of every object.

They used other AI tools to automatically write these descriptions, creating a massive library of "perfect recipes" for the DTPSR model to learn from.

4. The Safety Net: The "Editor"

Even with a great recipe, the AI might still make a mistake (like giving the dog a third ear). To fix this, the authors added a Multi-Branch Guidance system.

Think of this as having three different editors checking the work:

Editor 1 checks: "Is the layout right?" (If not, fix the global scene).
Editor 2 checks: "Are the shapes correct?" (If not, fix the colors and sizes).
Editor 3 checks: "Are the textures realistic?" (If not, fix the fur and edges).

If the AI starts to "hallucinate" (make up nonsense), these editors catch it immediately and say, "No, that's wrong, try again," but they do it specifically for that type of error.

The Result

When you put all this together, DTPSR creates photos that look incredibly real.

Before: The AI might turn a blurry wall into a weird ocean texture because it got confused.
After: The AI knows the wall is a wall (Global), it's beige (Low-Frequency), and it has a rough brick texture (High-Frequency).

In short: This paper teaches AI to stop guessing the whole picture at once. Instead, it teaches the AI to look at the big picture, then the shapes, and finally the tiny details, using a special set of instructions to make sure everything fits together perfectly. The result is a photo restoration that is sharp, realistic, and free of weird AI mistakes.

Here is a detailed technical summary of the paper "Disentangled Textual Priors for Diffusion-based Image Super-Resolution" (DTPSR).

1. Problem Statement

Image Super-Resolution (SR) aims to reconstruct High-Resolution (HR) images from Low-Resolution (LR) inputs. While traditional methods focus on pixel-level distortion metrics (PSNR/SSIM), they often fail to produce visually realistic results. Diffusion-based models have emerged as a powerful alternative for perceptual SR but face significant challenges:

Entangled Priors: Existing text-guided diffusion methods often rely on coarse-grained or "entangled" textual priors (e.g., a single sentence describing the whole scene). These mix global layout with local details and conflate structural cues with textural cues.
Semantic Drift & Hallucinations: Without explicit separation of semantic dimensions, diffusion models struggle to balance global structure recovery with fine-grained detail synthesis. This leads to artifacts, hallucinations (e.g., misinterpreting a wall as ocean texture), and a lack of controllability.
Frequency Agnosticism: Current approaches do not explicitly distinguish between low-frequency information (shape, color, layout) and high-frequency information (texture, edges), limiting the model's ability to perform frequency-aware restoration.

2. Methodology: DTPSR Framework

The authors propose DTPSR, a diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: Spatial Hierarchy (Global vs. Local) and Frequency Semantics (Low-frequency vs. High-frequency).

A. Architecture Overview

The framework operates on a latent diffusion model (based on Stable Diffusion) with a progressive generation pipeline:

Global Text Cross-Attention (GTCA):
- Injects a global sentence describing the overall scene (layout, background, object count).
- Guides the generation of the holistic scene structure.
Low-Frequency Cross-Attention (LFCA):
- Injects local low-frequency descriptions for specific object regions (shape, size, dominant colors, spatial orientation).
- Refines the global layout to ensure object-level structural integrity.
High-Frequency Cross-Attention (HFCA):
- Injects local high-frequency descriptions for specific regions (texture patterns, surface materials, edges, fine details).
- Enhances perceptual fidelity and realistic texture synthesis.
Low-Resolution Feature Cross-Attention (LRCA):
- Anchors the generation to the input LR image using visual features extracted by a frozen DAPE encoder to prevent identity drift.

The process flows sequentially: $z_t \xrightarrow{GTCA} z^g_t \xrightarrow{LFCA} z^{lf}_t \xrightarrow{HFCA} z^{hf}_t \xrightarrow{LRCA} z_{t-1}$ .

B. Multi-Branch Classifier-Free Guidance (CFG)

To further enhance controllability and suppress hallucinations, the authors propose a Multi-Branch CFG strategy. Unlike standard CFG which uses a single negative prompt, DTPSR employs three distinct negative prompts aligned with the positive priors:

$c_{neg}^g$ : Suppresses incorrect global layouts.
$c_{neg}^{lf}$ : Suppresses low-frequency structural errors.
$c_{neg}^{hf}$ : Suppresses high-frequency artifacts.
This allows independent modulation of generation across different semantic scales.

C. DisText-SR Dataset

To train this paradigm, the authors constructed DisText-SR, a large-scale dataset (~95,000 image-text pairs).

Construction: Uses a panoptic segmentation model (Mask2Former) to extract object regions. A frozen Multimodal Large Language Model (LLaVA) then generates:
1. One Global Description.
2. Per-region Low-Frequency Descriptions (coarse structure).
3. Per-region High-Frequency Descriptions (fine details).
This provides the first large-scale resource for frequency-aware, disentangled semantic supervision in SR.

3. Key Contributions

DTPSR Framework: A novel diffusion-based SR method that disentangles textual priors into spatial (global/local) and frequency (low/high) dimensions, enabling interpretable and controllable restoration.
DisText-SR Dataset: A new dataset of ~95k pairs with structured global-local and low-high frequency annotations, facilitating fine-grained semantic guidance research.
Disentangled Injection Mechanism: Introduction of specialized cross-attention modules (GTCA, LFCA, HFCA) and a multi-branch CFG strategy with frequency-aware negative prompts to suppress hallucinations.
State-of-the-Art Performance: Demonstrated superior perceptual quality and robust generalization across synthetic and real-world degradation scenarios.

4. Experimental Results

The method was evaluated on DIV2K-Val (synthetic), RealSR, and DRealSR (real-world) datasets.

Perceptual Quality: DTPSR achieved the best scores on all no-reference perceptual metrics (MUSIQ, MANIQA, CLIP-IQA) across all datasets.
- Example (RealSR): MUSIQ (71.84), MANIQA (0.6021), CLIP-IQA (0.7278).
Fidelity: While GAN-based methods often score higher on PSNR/SSIM due to distortion-minimization objectives, DTPSR maintains competitive fidelity while significantly outperforming them in visual realism.
Qualitative Analysis: Visual comparisons show DTPSR produces sharper textures, better edge preservation, and fewer hallucinations compared to SOTA methods like FaithDiff, SUPIR, and SeeSR. It successfully reconstructs complex details (e.g., facial contours, fur textures) that other methods blur or distort.
Efficiency: Despite the complex pipeline, inference time is moderate (~14.94s per image on DRealSR) due to lightweight upstream modules and processing only top-3 segments.
Ablation Studies:
- Confirmed that combining Global and Local priors yields the best results.
- Proved that separating Low and High-frequency priors is superior to mixing them.
- Demonstrated that Multi-Branch CFG significantly reduces artifacts compared to single-prompt CFG.
- Showed robustness to imperfect upstream segmentation or textual corruption.

5. Significance

This work represents a significant shift in diffusion-based Super-Resolution from "coarse guidance" to "structured, disentangled guidance."

Interpretability: By separating priors by frequency and hierarchy, the model's generation process becomes more transparent and controllable.
Controllability: The multi-branch CFG allows users to specifically suppress errors in layout, structure, or texture independently.
Foundation for Future Research: The introduction of DisText-SR establishes a new benchmark for training models that understand the multi-scale nature of visual content, paving the way for more faithful and realistic image restoration in applications like medical imaging, remote sensing, and historical preservation.