Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation

Imagine you are a photographer trying to take a picture on a very foggy day. In the past, photo editors (or "dehazing" software) had one goal: make the picture look as clear and pretty as possible for a human eye. They would wipe away the fog until the image looked sharp.

The Problem:
Here's the catch: A photo that looks beautiful to a human isn't always the best photo for a robot.

If a self-driving car needs to see a pedestrian, it doesn't care about the "pretty colors"; it needs the edges of the person to be super sharp so it doesn't hit them.
If a security camera is trying to count people, it needs the background to be distinct, not just "clear."
If a robot is trying to guess how far away a tree is, it needs specific depth details that a pretty photo might actually blur out.

Old software was like a rigid chef who cooked the same perfect meal for everyone. If you asked for a spicy dish, they couldn't change it without cooking a whole new meal from scratch (retraining the model).

The Solution: The "Smart, Adaptable Chef"
This paper introduces a new system called ADeT-Net. Think of it as a smart, adaptable chef who can change their cooking style instantly based on who is eating and what they want, without needing to go back to culinary school.

Here is how it works, using two main "superpowers":

1. The "Feedback Loop" (The Taste-Tester)

Imagine the chef cooks a dish (removes the fog) and serves it to a "taste-tester" (the downstream task, like the self-driving car).

If the car says, "I can't see the stop sign clearly," the chef doesn't just ignore it.
The chef immediately tweaks the recipe while cooking. Maybe they add more contrast to the edges or sharpen the colors of the sign.
This happens in real-time. The system learns from the robot's reaction and adjusts the image instantly to make the robot happy.

2. The "Instruction Manual" (The Customer's Order)

Sometimes, the robot doesn't know exactly what it needs, or a human operator wants to give a specific command.

You can type a note to the chef: "Hey, I need this image to be great for finding lost dogs," or "Make it perfect for measuring distances."
The system reads this text (using a language understanding tool called BERT) and adjusts the image to match that specific request. It's like the chef reading a special order slip and knowing exactly how to plate the food.

How They Work Together (The Closed Loop)

The magic happens because these two powers talk to each other in a closed loop:

The Chef makes an initial clear image.
The Taste-Tester (the robot) tries to do its job (detect a car, segment a road).
The Customer (the text instruction) says what they want.
The system combines the Taste-Tester's feedback ("This edge is too blurry for me") with the Customer's order ("Focus on depth") to tweak the image one last time before it's final.

Why This is a Big Deal

No Re-training: Old methods required you to build a new, separate software for every single task (one for cars, one for security, one for drones). This new method is a "Swiss Army Knife." You use the same tool for everything, and it just adapts on the fly.
Real-Time: It doesn't wait for a computer to process a new model; it adjusts the image as it is being processed.
Better Results: In their tests, this method helped robots see better, detect objects more accurately, and measure distances more precisely than any previous method, all while making the images look great to humans too.

In a nutshell:
This paper gives foggy-image software a brain and a voice. Instead of just "cleaning" a picture, it asks, "Who is going to use this picture, and what do they need?" and then instantly reshapes the image to be the perfect tool for that specific job.

1. Problem Statement

In real-world vision systems (e.g., autonomous driving, surveillance), haze severely impairs image visibility. While traditional image dehazing methods focus on enhancing visual quality (PSNR, SSIM), they often fail to optimize for specific downstream tasks (e.g., object detection, semantic segmentation, depth estimation).

The Gap: Existing task-driven dehazing methods typically require joint training with a specific downstream task. This leads to a lack of flexibility: a model trained for object detection cannot easily adapt to semantic segmentation without retraining.
The Challenge: How to create a dehazing framework that dynamically adapts its output to the specific requirements of diverse downstream tasks and user preferences without retraining the core model during inference.

2. Methodology

The authors propose a novel Adaptive Dynamic Dehazing Framework centered on a Closed-Loop Optimization mechanism. The system operates in two stages: an initial training phase and a dynamic inference phase.

A. Core Architecture

Initial Dehazing Network (IDN):
- A Transformer-based U-Net architecture trained on synthetic hazy data using an atmospheric scattering model.
- It produces an initial dehazed image $J'(x)$ .
- Loss Function: Optimized using $L_1$ loss and a contrastive loss to ensure high-fidelity restoration.
Closed-Loop Optimization (Inference Stage):
Instead of static output, the IDN is dynamically modulated during inference using two complementary guidance signals:
- Signal 1: Downstream Task Feedback. The initial dehazed image is fed into a downstream task network (e.g., YOLOv5 for detection, SegFormer for segmentation). The performance feedback (intermediate features or task loss) is sent back to the dehazing network.
- Signal 2: Textual Instructions. Users provide high-level natural language instructions (e.g., "optimize for edge detection" or "enhance depth clarity"). These are encoded using a pre-trained BERT model.

B. Key Modules

The framework integrates two innovative modules to process these signals:

Task Feedback-Guided Adaptation (TFGA):
- Function: Modulates the decoder features based on the performance of the downstream task.
- Mechanism: Uses a Bidirectional Cross-Attention mechanism to interact between the initial dehazing features ( $F_{id}$ ) and downstream task features ( $F_{down}$ ).
- Process: It calculates attention weights to fuse features, ensuring the reconstructed details align with what the downstream task needs (e.g., sharper edges for detection).
Instruction-Guided Modulation (IGM):
- Function: Injects semantic intent from text instructions into the image features.
- Mechanism: A Text Adapter projects BERT-encoded text features into the image feature space.
- Process: It utilizes a Weight Generation Block (WGB) and Channel-wise Feature Fusion Blocks (CFFB) to generate modulation parameters. These parameters dynamically adjust the encoder and decoder features to prioritize specific semantic attributes requested by the user.

C. Training Strategy & Loss Functions

The model is trained in two stages:

IDN Training: Standard restoration training.
Module Training (TFGA & IGM): The IDN is frozen; only the TFGA and IGM modules are trained.
- Multi-level Contrastive Ranking Loss ( $\ell_{mcr}$ ): Enforces a quality hierarchy where the final modulated result ( $J'_w$ ) must be better than the initial result ( $J'$ ), which must be better than the hazy input ( $\tilde{J}$ ).
- Downstream Task Loss ( $\ell_{down}$ ): Directly optimizes the performance of the downstream task (e.g., detection mAP) to guide the feedback loop.
- Total Loss: $\ell_{total} = \ell_{dehaze} + \ell_{mcr} + \gamma\ell_{down}$ .

3. Key Contributions

Novel Closed-Loop Framework: Proposes the first adaptive dehazing framework that enables real-time, task-aware refinement without model retraining. It bridges low-level restoration with high-level task guidance.
Dual-Guidance Mechanism: Introduces a unique combination of Task Feedback (performance-driven) and Text Instructions (intent-driven) via the TFGA and IGM modules. This allows for fine-grained control over the dehazing process.
Generalizability: Demonstrates that a single model can adapt to diverse tasks (Detection, Segmentation, Depth Estimation) simply by changing the feedback source or text prompt, eliminating the need for task-specific fine-tuning.

4. Experimental Results

The method was evaluated on ADE20K, COCO, and KITTI datasets across three downstream tasks: Semantic Segmentation (SS), Object Detection (OD), and Depth Estimation (DE).

Visual Quality: Outperformed 8 State-of-the-Art (SOTA) methods (e.g., Dehamer, DCMPNet, IPC) in PSNR, SSIM, and LPIPS.
- Example: On KITTI, achieved 30.50 PSNR and 0.9740 SSIM, surpassing the runner-up by a significant margin.
Downstream Task Performance:
- Semantic Segmentation: Achieved 50.34 mIoU (vs. 46.92 for the next best).
- Object Detection: Achieved 54.7 mAP and 35.7 mAP50-95.
- Depth Estimation: Showed consistent improvements in error metrics (AbsRel, RMSE) and accuracy thresholds.
Ablation Studies: Confirmed that removing either TFGA, IGM, or the Feature Fusion Module (FFM) resulted in measurable performance drops, validating the necessity of the dual-guidance and closed-loop design.

5. Significance

This work establishes a new paradigm for interactive and controllable image restoration.

Deployment Efficiency: By removing the need for retraining when switching between tasks, it significantly reduces deployment costs in dynamic environments.
Human-in-the-Loop: The text instruction interface allows non-expert users to steer the dehazing process based on specific operational needs (e.g., "prioritize traffic signs" vs. "prioritize road surface").
System Integration: It solves the misalignment problem between image restoration and computer vision tasks, ensuring that the pre-processed data is optimized for the specific algorithm that follows, rather than just for human visual perception.

In summary, the paper presents a robust, flexible solution that transforms image dehazing from a static preprocessing step into an adaptive, collaborative component of the broader vision pipeline.

Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation

1. The "Feedback Loop" (The Taste-Tester)

2. The "Instruction Manual" (The Customer's Order)

How They Work Together (The Closed Loop)

Why This is a Big Deal

1. Problem Statement

2. Methodology

A. Core Architecture

B. Key Modules

C. Training Strategy & Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes