Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

Imagine you are trying to solve a jigsaw puzzle, but someone has thrown away half the pieces, smudged the picture with mud, and maybe even cut out the corners. This is what scientists call an inverse problem: trying to figure out what the original picture looked like based on a broken, messy version of it.

In the world of AI, Diffusion Models are like incredibly talented artists who have memorized millions of pictures. If you ask them to draw a face, they can do it beautifully. But if you ask them to "fix this broken photo," they often guess. They might draw a face that looks real, but it's the wrong person, or the wrong expression. They are guessing in the dark.

This paper introduces a clever new trick to help these AI artists solve the puzzle without needing to go back to art school. They call it "Inference-Time Search with Side Information."

Here is how it works, broken down into simple concepts:

1. The Problem: The "Guessing Game"

Usually, when an AI tries to fix a blurry photo, it just follows a standard path. It's like a hiker walking down a single trail in the fog. If the trail leads to a cliff (a bad guess), the hiker falls. The AI often gets stuck because there are too many possibilities for what the original image could be.

2. The Secret Weapon: "Side Information"

The authors realized that in real life, we rarely solve puzzles in a vacuum.

If you are trying to restore an old photo of your grandfather, you might have a new photo of him to compare it to.
If you are trying to restore a blurry picture of a dog, you might have a text description that says, "It's a Golden Retriever sitting on a lake."

This extra info is called Side Information. The problem is that teaching an AI to use this extra info usually requires training it on millions of specific pairs (e.g., "Blurry Dog + Text Description = Clear Dog"). That takes forever and is expensive.

3. The Solution: The "Search Party"

Instead of teaching the AI a new language, the authors say: "Let's just send out a search party."

Imagine the AI is a scout trying to find a lost hiker (the original image).

Old Way: The scout picks one path and walks it until the end. If they get lost, they are stuck.
New Way (The Paper's Method): The scout sends out 8 different teams (particles) at the same time. Each team takes a slightly different path.

As the teams walk, they carry a Reward Scorecard (the "Side Information").

If a team is walking toward a path that looks like the "Golden Retriever" description, they get a high score.
If a team is walking toward a path that looks like a "Cat," they get a low score.

4. The Magic Trick: "Fork and Join"

This is where the paper gets really smart. They don't just let the teams walk randomly. They use a strategy called Recursive Fork-Join Search (RFJS).

Think of it like a game of "Telephone" mixed with a family reunion:

The Fork (Exploration): Every so often, the teams split up. Some teams branch off to try wild, new ideas. This ensures they don't all get stuck in the same wrong place.
The Join (Exploitation): At specific checkpoints, the teams compare notes. The teams that are doing well (high scores) get to "clone" themselves. The teams that are doing poorly get sent home.
The Result: The "bad" paths die out, and the "good" paths get stronger and more numerous. By the time they reach the end, almost all the teams are walking the same correct path, guided by the side information.

5. Why This is a Big Deal

No Retraining: You don't need to teach the AI anything new. You just plug this "Search Party" module into existing AI tools. It works like a universal adapter.
Works with Anything: Whether your side info is a text prompt, a reference photo, or even a different type of medical scan (like an MRI), the system treats them all the same way. It just asks, "Does this look like the side info?"
Better Results: In experiments, this method fixed blurry faces, restored missing parts of images, and sharpened medical scans much better than previous methods, especially when the damage was severe.

The Bottom Line

The paper is essentially saying: "Don't just let the AI guess blindly. Give it a hint, send out multiple guesses, and let the best guesses survive and multiply."

It's like hiring a team of detectives instead of a single detective. If one detective gets the wrong clue, the others might still find the truth. And by constantly checking their work against the "Side Information" (the clues), they ensure they are solving the right case, not just a case.

1. Problem Statement

The paper addresses inverse problems in image reconstruction (e.g., inpainting, super-resolution, deblurring) where the goal is to recover a latent signal $x_0$ from noisy or partial observations $y = A(x_0) + \sigma z$ .

The Challenge: In severely ill-posed settings (where $A$ is non-injective), standard diffusion-based solvers (like DPS or DAPS) often fail to recover the specific ground truth because the posterior distribution is multimodal. Unconstrained sampling may converge to a plausible but incorrect solution.
The Gap: Existing diffusion solvers typically ignore side information ( $S$ )—auxiliary data perceptually related to the target (e.g., a reference image of the same person, a text description, or a different MRI contrast).
Limitations of Current Solutions:
- Training-based approaches: Training a conditional diffusion model $p(X|Y, S)$ requires massive paired datasets and locks the model to a specific modality of side information.
- Gradient-based guidance: Methods like Reward Gradient Guidance (RGG) require differentiable reward functions and backpropagation through the denoiser at every step, which is computationally expensive, sensitive to hyperparameters, and prone to artifacts.

Core Question: How can we leverage a pre-trained unconditional diffusion prior to solve inverse problems with arbitrary side information at inference time without retraining?

2. Methodology

The authors propose a plug-and-play, training-free inference-time search framework that integrates side information via a reward function.

A. Modeling Side Information via Reward Tilting

Instead of learning a new conditional distribution, the authors model the posterior $p_{0|S}(x_0|s)$ as a reward-tilted version of the pre-trained unconditional prior $p_0(x_0)$ .

Assumption: The conditional distribution is approximated as:
$p_{0|S}(x_0 | s) \propto p_0(x_0) \exp\left(\frac{r(x_0; s)}{\tau}\right)$
where $r(x_0; s)$ is a reward function measuring the consistency between a candidate reconstruction and the side information $s$ , and $\tau$ is a temperature parameter.
Modality Agnostic: This formulation works with any side information (text, images, features) as long as a pre-trained reward model exists to score the alignment.

B. Inference-Time Search Algorithms

To sample from this tilted posterior without calculating intractable gradients through the diffusion network, the authors employ particle-based search strategies. They maintain a set of $N$ particles (candidate reconstructions) and use the reward function to guide resampling.

Two specific strategies are introduced:

Greedy Search (GS): Periodically resamples particles (every $B$ steps) by selecting the top candidates based on the reward. This is a "Best-of-N" approach applied at intervals.
Recursive Fork-Join Search (RFJS): A hierarchical strategy designed to balance exploration and exploitation.
- Fork: At intermediate time steps, particles are grouped into smaller clusters (e.g., size $N/2, N/4$ ) and resampled independently. This preserves diversity and explores different structural hypotheses.
- Join: At larger intervals (every $B$ steps), all particles are resampled together to exploit the best global candidates.
- Benefit: RFJS prevents the "mode collapse" often seen in greedy resampling while avoiding the inefficiency of pure random sampling.

C. Integration with Solvers

The framework is modular and can be applied on top of existing solvers (DPS, DAPS, MPGD).

At each diffusion step $t$ $t$ , the algorithm:
1. Proposes candidate particles.
2. Estimates the clean image $\hat{x}_0$ from the noisy state.
3. Computes the reward $r(\hat{x}_0; s)$ .
4. Resamples particles based on the reward (using GS or RFJS logic).
5. Continues the reverse diffusion process with the resampled particles.

3. Key Contributions

Training-Free Framework: A novel method to incorporate arbitrary side information into diffusion-based inverse solvers without retraining the diffusion model or collecting paired datasets.
Reward-Tilted Modeling: A principled mathematical abstraction that decouples the measurement model from the side information, allowing the use of off-the-shelf reward models (e.g., FaceID networks, CLIP, ImageReward).
Search Algorithms: The introduction of RFJS, a recursive search strategy that effectively balances exploration (diversity of solutions) and exploitation (convergence to high-reward solutions), outperforming standard greedy or Best-of-N approaches.
Gradient-Free Advantage: The method supports non-differentiable and black-box reward functions, overcoming the computational and stability limitations of gradient-based guidance.

4. Experimental Results

The authors evaluated the framework across diverse inverse problems and side information types.

Tasks: Box inpainting, Super-resolution (up to 32x), Motion/Gaussian/Nonlinear/Blind deblurring.
Side Information Types:
- Reference Images: Reconstructing a face from a noisy observation using another image of the same person (different pose/lighting).
- Text Descriptions: Reconstructing images from noisy inputs guided by text prompts (e.g., "golden retriever").
- Medical Imaging: MRI reconstruction using complementary contrasts (PD vs. PDFS).
Baselines: Compared against DPS, BlindDPS, DAPS, MPGD, Best-of-N (BoN), and Reward Gradient Guidance (RGG).
Key Findings:
- Performance: RFJS and GS consistently outperformed all baselines in perceptual quality and identity preservation (measured by FaceSimilarity and CLIPScore).
- Ill-Posed Settings: The improvements were most significant in severely ill-posed problems (e.g., 32x super-resolution, heavy masking) where standard solvers failed to preserve identity or semantic content.
- Metric Discrepancy: The paper highlights that classical metrics (PSNR, SSIM, LPIPS) often fail to capture the semantic improvements achieved by the search method. For instance, RFJS might have a slightly lower PSNR than a baseline but significantly better identity preservation (lower FaceSimilarity loss).
- Efficiency: The method scales linearly with the number of particles but remains computationally feasible due to parallelization. RFJS provided better quality than GS with comparable runtime.

5. Significance

Practicality: The "plug-and-play" nature allows researchers and practitioners to immediately enhance existing diffusion pipelines with side information without the cost of training new models.
Generality: By treating side information as a reward signal, the method unifies the handling of diverse modalities (text, images, medical scans) under a single algorithmic framework.
Paradigm Shift: It shifts the focus from learning conditional distributions to searching for optimal samples at inference time, leveraging the power of pre-trained unconditional models. This approach is particularly valuable in domains where paired data is scarce or impossible to obtain (e.g., specific medical modalities or rare events).
Robustness: The search-based approach is more robust to hyperparameter sensitivity and non-differentiable rewards compared to gradient-guided methods, making it more reliable for real-world applications.

In conclusion, this work demonstrates that inference-time search is a powerful mechanism to inject side information into diffusion models, significantly improving reconstruction fidelity in challenging, ill-posed inverse problems while maintaining flexibility and avoiding expensive retraining.