Adaptive Language-Aware Image Reflection Removal Network

This paper proposes ALANet, an adaptive language-aware network that effectively removes complex image reflections by integrating filtering and optimization strategies to mitigate the negative impact of inaccurate machine-generated language descriptions, alongside the introduction of a new CRLAV dataset for evaluation.

Siyan Fang, Yuntao Wang, Jinpu Zhang, Ziwen Li, Yuehuan Wang

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are trying to take a photo of a beautiful garden through a dirty, reflective window. You want to see the flowers (the Transmission Layer), but the glass is showing you a reflection of the street outside (the Reflection Layer). The result is a messy, confusing picture where the flowers and the street are blended together.

For a long time, computers have struggled to separate these two layers, especially when the reflection is strong or the scene is complex.

This paper introduces a new AI system called ALANet (Adaptive Language-Aware Network) that solves this problem by using language as a helper, but with a very clever twist: it doesn't panic if the helper lies.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Confused Assistant"

Imagine you ask a friend to help you clean the window. You say, "Look for the red flowers."

  • Scenario A (Perfect): Your friend sees the red flowers and helps you clean around them perfectly.
  • Scenario B (The Problem): Your friend is looking at the reflection and thinks the reflection is the flower. They say, "I see red flowers on the street!" and try to clean the street instead of the window.

Previous AI models were like that friend. If you gave them a description of the image (e.g., "There is a car and a tree"), they would blindly trust it. But because the reflection messes up the AI's ability to "see" the image, the description it generates is often wrong. If the AI follows a wrong description, it makes the photo worse than if it had no help at all.

2. The Solution: The "Smart Detective" (ALANet)

The authors built ALANet to be a smart detective that doesn't just blindly follow instructions. It uses two main strategies: Filtering and Optimization.

Strategy A: The "Skeptic Filter" (LCAM)

Think of this as a debate inside the AI's brain.

  • Team Visual: The AI looks at the pixels and says, "I see a car here."
  • Team Language: The AI reads the text and says, "The text says there is a tree here."

If the text is accurate, Team Language wins, and the AI focuses on the tree. But if the text is wrong (e.g., it says "tree" but the pixels clearly show a "car"), the Filtering Strategy kicks in. It acts like a referee, saying, "Wait, the visual evidence is stronger here. Let's ignore the text for this part and trust what we see."

  • The Metaphor: It's like a GPS that says, "Turn left," but you see a giant wall blocking the road. A smart driver (ALANet) ignores the GPS and stops, rather than crashing into the wall.

Strategy B: The "Translator" (ALCM)

Sometimes the language isn't totally wrong, just a bit "off" or vague.

  • The Metaphor: Imagine the language is a rough sketch, and the image is a high-definition photo. The Optimization Strategy acts like a translator who takes the rough sketch and tweaks it to match the photo perfectly. It adjusts the language so it fits the visual reality, ensuring the AI doesn't get confused by slight mismatches.

Strategy C: The "Spotlight" (LSCA)

Once the AI trusts the language, it uses it like a flashlight.

  • If the text says, "There is a yellow pillar," the AI uses that clue to shine a spotlight specifically on the yellow pillar in the image. It then separates the pillar (the real object) from the reflection of the pillar. This helps the AI untangle complex scenes where everything is mixed together.

3. The New Training Ground: The "CRLAV" Dataset

To teach this new AI, the researchers couldn't just use normal photos. They needed a training ground that simulated real-world messiness.

  • They created a new dataset called CRLAV.
  • The Twist: They took real photos and paired them with descriptions that were intentionally wrong, confused, or incomplete.
  • Why? To teach the AI that sometimes the "helper" (the language) is unreliable, and it needs to learn how to handle that without breaking.

4. The Results

When they tested ALANet against other top AI models:

  • With perfect language: It did a great job, just like the others.
  • With bad language: This is where it shined. While other models crashed and produced garbage images when given wrong descriptions, ALANet kept working. It filtered out the bad advice and still managed to remove the reflections.
  • The Verdict: It is the first system that can handle the "messy reality" of the world, where descriptions aren't always perfect.

Summary

Think of ALANet as a smart window cleaner who brings a friend with a checklist.

  • Old cleaners would follow the checklist blindly, even if the friend was looking at the wrong window.
  • ALANet checks the checklist against the actual window. If the friend is wrong, ALANet says, "Thanks, but I see a car, not a tree," and cleans the car. If the friend is a bit vague, ALANet asks for clarification.

This makes the technology much more robust and ready for real-world use, where perfect descriptions are rare.