Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing

Imagine you are trying to take a beautiful photo of a city skyline, but a thick, uneven fog has rolled in. Some parts of the fog are dense and gray, while other parts are thin and wispy. Your goal is to "dehaze" the image—to digitally remove the fog and reveal the crisp, clear city underneath.

This is the challenge of Image Dehazing. For a long time, computers struggled with this because fog isn't just a uniform blanket; it's messy, uneven, and changes from spot to spot.

The paper you provided introduces a new AI model called Fourier-RWKV. Think of it as a "super-smart photo editor" that doesn't just guess how to clean the image but understands the physics of fog and the structure of the picture simultaneously.

Here is how it works, explained through simple analogies:

1. The Problem with Old Methods

The "Blind Painter" (CNNs): Early AI models were like painters who only looked at a tiny dot on the canvas at a time. They could fix a small smudge, but they couldn't see the whole picture to understand how the fog connected across the entire image.
The "Overworked Librarian" (Transformers): Newer models (Transformers) are like librarians who read every single book in the library to find a connection. They are great at seeing the big picture, but if the library is huge (a high-resolution photo), they get overwhelmed and take forever to finish. They are too slow for real-time use.

2. The Solution: A "Multi-State" Detective

The authors created Fourier-RWKV, which acts like a detective with three different "super-senses" working together. Instead of just looking at the photo, it looks at it in three different ways at once.

Sense 1: The "Shape-Shifter" (Spatial-Form Perception)

The Analogy: Imagine trying to clean a window with a rag. If the dirt is in a straight line, you wipe straight. If the dirt is in a weird, jagged shape, you have to twist your wrist and move the rag in a specific way to hit every spot.
How it works: Old AI models used a "rigid" wipe (a fixed pattern). This new model uses DQ-Shift, a "shape-shifting" tool. It looks at the fog and instantly changes its shape to fit the uneven patches of haze, ensuring it cleans every nook and cranny without missing anything.

Sense 2: The "Music Conductor" (Frequency-Domain Perception)

The Analogy: Imagine a song. The fog is like a low, rumbling bass note that drowns out the melody. The actual image details (buildings, trees) are the high-pitched instruments.
How it works: Most AI looks at the photo as a grid of pixels (like looking at a painting). This model uses Fourier Mix to listen to the "music" of the image. It separates the "bass" (the fog) from the "melody" (the clear image). Because fog mostly lives in the low-frequency "bass" notes, the model can easily identify and remove it while keeping the high-frequency details sharp. This allows it to see the "whole song" (global context) instantly without getting tired.

Sense 3: The "Translator" (Semantic-Relation Perception)

The Analogy: Imagine a construction crew. The "Encoder" team is digging the foundation, and the "Decoder" team is building the roof. If they don't talk to each other, the roof might not fit the foundation, leading to a wobbly house.
How it works: In many AI models, the "digging" team and the "building" team get out of sync, causing blurry spots or weird artifacts. This model uses a Semantic Bridge Module (SBM). It acts as a translator, constantly checking in with both teams to make sure they are speaking the same language. It ensures that the details the model is trying to restore match perfectly with the original structure of the image.

3. Why is this a Big Deal?

Speed vs. Quality: Usually, you have to choose between a model that is fast but blurry, or one that is slow but perfect. Fourier-RWKV is like a Formula 1 car that drives on dirt roads. It is incredibly fast (linear complexity, meaning it doesn't slow down as the photo gets bigger) but still produces museum-quality results.
Real-World Ready: It works amazingly well on real-world photos where the fog is uneven and messy, not just on perfect computer-generated test images.

Summary

Fourier-RWKV is a new way for computers to clear up foggy images. Instead of just squinting at pixels, it:

Adapts its shape to clean uneven fog (Shape-Shifter).
Listens to the frequency of the image to separate fog from details (Music Conductor).
Translates instructions between different parts of the AI to keep the image consistent (Translator).

The result is a tool that is fast enough to run on a phone but smart enough to restore a photo that looks like it was taken on a crystal-clear day, even if it was originally taken in a thick storm.

1. Problem Statement

Image dehazing is critical for visual perception tasks (e.g., autonomous driving) but remains challenging under real-world non-uniform haze conditions. Existing methods face a trade-off between accuracy and efficiency:

CNN-based methods: Limited by small receptive fields, failing to model long-range dependencies required for global haze estimation.
Transformer-based methods: Excel at capturing global context via self-attention but suffer from quadratic computational complexity ( $O(N^2)$ ), making them unsuitable for real-time, high-resolution applications.
State Space Models (e.g., Vision-RWKV): Offer linear complexity ( $O(N)$ $O (N)$ ) but have architectural limitations when applied directly to dehazing:
1. Fixed spatial shifts lack adaptability to irregular haze distributions.
2. Sequential spatial modeling suffers from long-range information decay.
3. Encoder-decoder frameworks often suffer from semantic gaps, leading to artifacts.

2. Methodology: Fourier-RWKV

The authors propose Fourier-RWKV, a novel framework based on a Multi-State Perception paradigm. It achieves comprehensive haze modeling with linear complexity by synergistically integrating three distinct perceptual states within a symmetric encoder-decoder architecture.

A. Core Architecture

The network uses a four-level encoder-decoder structure. The core building block is the FRWKV Block, which consists of two complementary units:

Fourier Mix Block: Handles global dependencies.
Channel Mix Block: Refines inter-channel features.
Both blocks utilize a Deformable Quad-directional Token Shift (DQ-Shift) operation. Skip connections are enhanced with a Semantic Bridge Module (SBM).

B. Key Components

1. Deformable Quad-directional Token Shift (DQ-Shift)

Problem: Standard RWKV uses fixed shift patterns (Q-Shift) that cannot adapt to irregular local haze densities.
Solution: DQ-Shift combines fixed offsets (up, down, left, right) with input-dependent dynamic offsets.
Mechanism: A lightweight gated CNN predicts dynamic offsets and gating weights. These are combined with fixed offsets to dynamically adjust the receptive field, allowing the model to adapt to local structural variations in haze.

2. Fourier Mix Block (Frequency-Domain Perception)

Problem: Spatial-domain sequential modeling (WKV) suffers from information decay over long distances.
Solution: Transforms the core WKV attention mechanism into the Fourier domain.
Insight: Haze interference is primarily encoded in the amplitude spectrum (low frequencies), while structural content is preserved in the phase spectrum.
Mechanism:
- Features are mapped to the Fourier domain via FFT.
- Dual-Domain Gating: A spatial gate ( $R_s$ ) preserves local structural sensitivity, while a Fourier gate ( $R_{fft}$ ) regulates long-range dependencies.
- Spectral Sequencing: Frequency points are sorted by distance from the origin to maintain relationships before being processed by the Bi-WKV mechanism.
- Result: This mitigates spatial attenuation and captures global haze statistics with linear complexity.

3. Semantic Bridge Module (SBM)

Problem: Semantic misalignment between encoder and decoder features causes artifacts and noise propagation.
Solution: A lightweight module that aligns features using Dynamic Semantic Kernel Fusion (DSK-Fusion).
Mechanism:
- Computes a semantic similarity matrix between encoder and decoder channel descriptors.
- Generates multi-scale dynamic convolution kernels ( $3\times3, 5\times5, 7\times7$ ) based on this similarity.
- Uses a Kernel Selection Fusion Unit (KSFU) to adaptively fuse these scales.
- Semantic Replacement: Replaces the DC component (global mean) of encoder features with the fused semantic features to ensure consistency before merging with decoder features.

4. Loss Function
The model is optimized using a dual-domain loss combining $L_1$ penalties in both the spatial domain and the frequency domain to enforce pixel-level fidelity and structural coherence.

3. Key Contributions

Fourier-RWKV Framework: The first multi-state perception dehazing network built on a linear-complexity RWKV architecture, establishing a new paradigm for efficient image restoration.
DQ-Shift Operation: Introduces adaptive spatial perception via dynamic receptive field adjustment, overcoming the rigidity of fixed shifts in irregular haze.
Fourier Mix Block: Extends the WKV mechanism to the Fourier domain, leveraging spectral properties to capture global dependencies and mitigate information decay without quadratic cost.
Semantic Bridge Module (SBM): Solves the encoder-decoder semantic gap using dynamic kernel fusion and DC component replacement, significantly reducing artifacts.
State-of-the-Art Performance: Demonstrates superior restoration quality with significantly lower computational overhead compared to Transformers and other SOTA methods.

4. Experimental Results

The method was evaluated on synthetic (SOTS-Indoor, SOTS-Outdoor) and real-world (Dense-Haze, NH-HAZE) datasets.

Quantitative Performance:
- SOTS-Outdoor: Achieved 39.76 dB PSNR and 0.996 SSIM, outperforming the second-best method (PGH2Net) by 2.05 dB in PSNR.
- Real-World Datasets: Achieved the best performance on both Dense-Haze and NH-HAZE. On NH-HAZE (non-uniform haze), it improved PSNR by 0.35 dB and SSIM by 0.03 over previous bests.
- Efficiency: Compared to the linear-attention model MAIR, Fourier-RWKV reduced FLOPs to 65.29% while adding only 1.91M parameters, yet improved PSNR by 2.15 dB.
Qualitative Performance:
- Visual comparisons show superior detail recovery, better color fidelity, and fewer artifacts (especially in edge regions and areas with large depth of field) compared to CNNs (FFA-Net) and Transformers (DeHamer, SwinIR).
- It effectively handles non-uniform haze where other methods fail to remove haze completely or introduce blurring.
Ablation Studies:
- Removing DQ-Shift or Fourier Mix individually caused significant PSNR drops (1.47 dB and 1.90 dB respectively).
- The Dual-Domain Gating in the Fourier Mix block was proven essential; removing either the spatial or Fourier gate degraded performance significantly.
- SBM showed the most significant improvement when combined with the other modules, validating the importance of semantic alignment.

5. Significance

Fourier-RWKV represents a significant advancement in efficient image restoration. By successfully bridging frequency-domain learning with linear-complexity State Space Models, it resolves the long-standing trade-off between global context modeling and computational efficiency. The proposed Multi-State Perception paradigm offers a robust solution for real-world, non-uniform haze, making it highly suitable for deployment in resource-constrained, real-time applications like autonomous driving and surveillance. The code is open-sourced, facilitating further research in efficient vision architectures.