Single Image Reflection Separation via Dual Prior Interaction Transformer

Imagine you are trying to take a photo of a beautiful painting inside a museum, but you have to shoot it through a glass case. The glass reflects the lights from the ceiling and your own face, creating a messy "ghost" image that sits on top of the painting. Your goal is to digitally remove that ghost so you can see the painting clearly again.

This is the problem of Single Image Reflection Removal. It's incredibly hard because the camera only sees one jumbled picture (the painting + the reflection), and it has to guess which pixels belong to the painting and which belong to the ghost.

Here is a simple breakdown of how the authors of this paper, DPIT, solved this problem using two main tricks: The "Smart Filter" (LLCN) and The "Double-Brain Team" (DSCRAT).

1. The Problem: Too Much Guesswork

Previous methods tried to solve this by either:

Using a "General Brain": A pre-trained AI that knows what objects look like (like a general knowledge encyclopedia). This helps, but it's too vague. It knows "that's a building," but it doesn't know exactly which pixels are the building and which are the reflection.
Using a "Specialist Brain": A network trained specifically to remove reflections. But these are often huge, slow, and expensive to run.

The authors realized that relying on just one of these wasn't enough. They needed a way to get a detailed, fine-grained guess of the painting without needing a massive, slow computer.

2. Trick #1: The "Smart Filter" (Local Linear Correction Network)

Instead of asking the AI to paint the clean image from scratch (which is like asking an artist to recreate a masterpiece from memory), they asked the AI to adjust the existing messy photo.

The Analogy: Imagine you have a photo that is slightly too dark and slightly too red. Instead of repainting the whole photo, you just turn a few dials: "Make the shadows 10% brighter" and "Reduce the red by 5%."
How it works: The authors built a lightweight tool called LLCN. It looks at the messy photo and calculates two simple things for every single pixel:
1. Scale ( $s$ ): How much of this pixel should we keep? (e.g., "Keep 80% of this pixel, it's probably the painting.")
2. Bias ( $b$ ): How much brightness/color should we add or subtract? (e.g., "This pixel is too bright because of a reflection, dim it down.")
The Result: This creates a "Transmission Prior"—a rough draft of the clean image. Because the AI only has to learn to tweak the image rather than create it, it is incredibly fast, uses very little memory, and is surprisingly accurate.

3. Trick #2: The "Double-Brain Team" (Dual-Prior Interaction)

Now, the system has two sources of information:

The General Brain: A powerful AI that knows the big picture (semantics).
The Smart Filter: The rough draft created by the LLCN (fine details).

The challenge is: How do you mix these two without them getting confused or slowing each other down?

The Old Way: Most methods just mashed the two brains together, forcing them to talk to each other constantly. This is like trying to have a conversation in a crowded room where everyone is shouting; it's computationally expensive and messy.
The New Way (DSCRAT): The authors designed a Channel Reorganization system.
- The Analogy: Imagine a relay race with two runners (the two brains). Instead of them running side-by-side and bumping into each other, the coach (the algorithm) cuts their batons in half.
- The Swap: The coach takes the "left half" of Runner A's baton and the "left half" of Runner B's baton and gives them to Runner A. Then, they swap the "right halves" to Runner B.
- The Magic: Now, Runner A has a mix of both brains' information in their left hand, and Runner B has a mix in their right. They can now focus on their specific jobs (separating the layers) while having access to the best parts of the other's knowledge.
- The Benefit: This "reorganization" allows the AI to separate the reflection from the image much more efficiently, using less computing power than previous methods while getting better results.

4. The Final Result

By combining the Smart Filter (which gives a great starting guess) with the Double-Brain Team (which efficiently mixes that guess with general knowledge), the DPIT system achieves State-of-the-Art results.

It's faster: It uses fewer computer resources (parameters and calculations) than its competitors.
It's clearer: It removes reflections better, leaving behind sharp details of the real scene without blurring or leaving ghostly artifacts.
It's versatile: It works on everything from photos of windows in a forest to reflections on a coffee mug in a store.

In summary: The paper teaches us that to fix a messy photo, you don't need to rebuild the whole world from scratch. You just need a smart way to tweak the existing pixels and a clever way to let two different types of AI "brains" share their strengths without getting in each other's way.

1. Problem Statement

Single image reflection removal (SIRR) aims to recover the transmission layer (the scene behind glass) from a single mixed image containing both transmission and reflection layers.

Core Challenge: The problem is ill-posed because a single mixed image contains limited information, making it difficult to accurately separate the layers without residual artifacts, color distortions, or incomplete removal.
Limitations of Existing Methods:
- Coarse Guidance: Current methods rely on "general priors" (from pre-trained models like VGG or Swin) or "task-specific priors" (text prompts, reflection estimation). These provide only coarse-grained perception of the transmission content, limiting restoration quality.
- Efficiency vs. Accuracy Trade-off: Using high-performance networks to generate transmission priors is computationally expensive and inflexible. Conversely, lightweight networks often fail to provide accurate guidance.
- Complex Interaction: Existing dual-stream interaction mechanisms (e.g., concatenation-based attention) often incur high computational overhead while failing to fully exploit the complementarity between different types of priors.

2. Methodology

The authors propose DPIT (Dual-Prior Interaction Transformer), a framework that integrates a fine-grained transmission prior with a general prior via a novel interaction mechanism.

A. Local Linear Correction Network (LLCN) for Transmission Prior

Instead of directly regenerating the transmission layer (pixel generation), the authors propose a Local Linear Correction Model (LLCM) based on pixel selection.

Formulation: The transmission prior $\hat{T}_{prior}$ is modeled as a linear transformation of the input mixed image $I$ :
$\hat{T}_{prior} = sI + b$
Where $s$ (scaling) and $b$ (bias) are learnable, pixel-wise, and channel-wise parameters.
Architecture:
- Uses a pre-trained ConvNeXt-Base as the backbone to extract deep semantic features.
- Two parallel decoders generate the scaling map ( $s$ ) and bias map ( $b$ ).
- Advantage: By learning transformation parameters rather than generating pixels from scratch, LLCN achieves high-quality prior generation with significantly fewer parameters and lower computational cost.

B. Dual-Stream Channel Reorganization Transformer (DSCRT)

This module fuses the General Prior (extracted via a pre-trained Swin Transformer) and the Transmission Prior (from LLCN) to perform layer separation.

Dual-Stream Channel Reorganization Attention Block (DSCRAB):
- Channel Reorganization: The input features from the two streams (Left/General, Right/Transmission) are split along the channel dimension. The first half of channels from both streams are concatenated to form a Generation Stream, and the second halves form an Exchange Stream.
- Dual-Stream Attention:
  - Intra-stream Self-Attention: Computes attention within the Generation Stream to capture long-range dependencies.
  - Cross-stream Attention: Uses the Generation Stream as the Query and the Exchange Stream as Key/Value. This leverages the complementarity of heterogeneous features (General vs. Transmission) and the exclusivity of layer separation objectives.
- Efficiency: This design simplifies the attention target, allowing effective feature separation and complementation without the heavy computational cost of full dual-stream self-attention mechanisms used in previous works (like DSIT).
Hierarchical Fusion: The network performs same-layer fusion (using DSCRAB) and cross-layer fusion (propagating features from deeper to shallower layers) to refine the transmission ( $\hat{T}$ ), reflection ( $\hat{R}$ ), and a learnable residual term ( $\hat{\Phi}$ ).

C. Loss Functions

The model is optimized using a weighted combination of:

Pixel Reconstruction Loss ( $L_{pix}$ ): MSE between predicted and ground truth layers.
Gradient Reconstruction Loss ( $L_{grad}$ ): L1 loss on gradients to preserve structural integrity.
Perceptual Loss ( $L_{per}$ ): VGG-19 feature matching for perceptual quality.
Reconstruction Consistency Loss ( $L_{rec}$ ): Ensures $I \approx \hat{T} + \hat{R} + \hat{\Phi}(\hat{T}, \hat{R})$ , capturing non-linear residuals.

3. Key Contributions

DPIT Framework: A novel dual-prior interaction approach that successfully integrates fine-grained transmission priors with general priors, achieving State-of-the-Art (SOTA) performance.
LLCN & LLCM: Introduction of a lightweight transmission prior generator based on a local linear correction model ($T = sI + b$). This shifts the paradigm from "pixel generation" to "pixel selection," achieving superior performance under strict parameter budgets.
DSCRAB Mechanism: A new attention mechanism that reorganizes dual-stream features at the channel level. It effectively exploits feature complementarity and layer separation exclusivity, achieving high performance with significantly reduced computational complexity compared to existing attention-based methods.

4. Experimental Results

The method was evaluated on five real-world benchmark datasets: Real20, Objects, Postcard, Wild, and Nature.

Quantitative Performance:
- Average PSNR/SSIM: DPIT achieved 27.21 dB and 0.924 SSIM, outperforming all existing methods (e.g., DSIT, RDNet, DSRNet).
- Efficiency: DPIT uses 131.54M trainable parameters and 191.35G FLOPs.
  - Compared to RDNet (315.89M params), DPIT uses 41.6% of the parameters while gaining 0.49 dB.
  - Compared to DSIT (233.09G FLOPs), DPIT reduces computational cost by 17.9% while improving performance by 0.50 dB.
Qualitative Performance:
- Visual comparisons show DPIT removes reflections more completely (especially in complex scenes like bridges and night restaurants) while preserving fine textures and structural details better than competitors.
- It also demonstrates superior capability in separating the reflection layer itself, not just the transmission layer.
Ablation Studies:
- LLCN: The local linear correction model ($T=sI+b$) outperformed global linear models and direct pixel generation methods by a significant margin (e.g., +1.63 dB over global linear).
- DSCRAB: The proposed attention block outperformed other interaction modules (MLP, MuGI, DAIB) in both accuracy and efficiency.
- Prior Integration: Adding the transmission prior to any interaction module consistently improved performance (gains of 0.36–1.42 dB), validating the importance of fine-grained guidance.

5. Significance

This paper addresses a critical bottleneck in single image reflection removal: the lack of fine-grained guidance in existing prior-based methods.

Paradigm Shift: It challenges the standard "end-to-end pixel generation" approach for prior modeling, proposing a more efficient "parameter-based selection" strategy (LLCM).
Efficiency-Accuracy Balance: It demonstrates that high-performance restoration does not necessarily require massive computational resources. By optimizing the attention mechanism (DSCRAB) and the prior generation strategy (LLCN), DPIT achieves SOTA results with a fraction of the computational cost of recent heavy-weight models.
Generalizability: The framework of combining task-specific priors with general priors via efficient interaction mechanisms offers a promising direction for other low-level vision tasks involving image decomposition or restoration.

Single Image Reflection Separation via Dual Prior Interaction Transformer

1. The Problem: Too Much Guesswork

2. Trick #1: The "Smart Filter" (Local Linear Correction Network)

3. Trick #2: The "Double-Brain Team" (Dual-Prior Interaction)

4. The Final Result

1. Problem Statement

2. Methodology

A. Local Linear Correction Network (LLCN) for Transmission Prior

B. Dual-Stream Channel Reorganization Transformer (DSCRT)

C. Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

Parameterized Complexity Of Representing Models Of MSO Formulas