Physics-Guided VLM Priors for All-Cloud Removal

Imagine you are looking at a beautiful landscape through a window, but the window is covered in a messy mix of fog and thick, heavy raindrops. Some parts are just a light mist that makes the view blurry, while other parts are so thick you can't see anything at all.

For a long time, scientists trying to "clean" these satellite images (which are like taking photos from space) had to use two completely different tools:

The "Wiper" for the light mist (to clear up the blur).
The "Painter" for the heavy rain (to guess and paint in what's hidden underneath).

The problem? The line between "mist" and "heavy rain" isn't a sharp edge; it's a messy gradient. When scientists tried to switch from the wiper to the painter, they often made mistakes at the boundary, leaving ugly seams or painting things that didn't exist (like a fake river where a mountain should be).

The New Solution: The "Smart Detective" (PhyVLM-CR)

The authors of this paper, Liying Xu and her team, created a new method called PhyVLM-CR. Think of it as hiring a Smart Detective who knows both the laws of physics and the art of storytelling, but uses them in a very specific, safe way.

Here is how their "Smart Detective" works, broken down into simple steps:

1. The "Imagination" Step (The VLM)

First, they ask a powerful AI (called a Vision-Language Model, or VLM) to look at the cloudy photo and say, "What do you think is under there?"

The Analogy: Imagine the AI is a creative writer. If you show it a photo of a cloudy forest, it might write a story describing a forest with trees, a river, and a bird.
The Catch: The writer is great at imagination but terrible at facts. It might accidentally paint a dragon in the river or change the color of the trees. If we just used the writer's story, the photo would look fake.

2. The "Reality Check" Step (The Physics)

Instead of letting the writer's story become the final photo, the team uses the writer's story as a hint. They take the writer's ideas and run them through a strict Physics Calculator.

The Analogy: Think of the Physics Calculator as a strict editor. The writer says, "There's a dragon!" The editor checks the laws of light and atmosphere and says, "No, that's impossible. The light doesn't bend that way. But, the writer was right about the shape of the trees."
The team extracts scattering parameters (how the light bounces off the clouds) and a "Confidence Map" from the writer's guess.
- High Confidence: "The writer is right here; the light matches reality." -> Keep the real physics.
- Low Confidence: "The writer is hallucinating; the light doesn't match." -> Ignore the writer's guess.

3. The "Seamless Blend" Step (The Magic Glue)

This is the most clever part. Instead of cutting the image into pieces (one piece for mist, one for rain), the method uses the Confidence Map as a dimmer switch.

Where the clouds are thin: The "dimmer" is turned up for the Physics. It cleans the blur but keeps the real colors and details exactly as they are.
Where the clouds are thick: The "dimmer" is turned up for the Time Travel. Since the cloud is too thick to see through, the system grabs a photo of the same spot from a different day (when it was sunny) and blends it in.
The Result: Because the "dimmer" changes smoothly from 0 to 100, there are no hard lines or seams. The transition from "cleaned mist" to "reconstructed rain" is invisible.

Why is this a big deal?

No More "Fake" Art: Previous AI methods often hallucinated (made up) fake buildings or trees. This method uses the AI only as a guide, not the final artist, so the result is always grounded in reality.
No More "Seams": Old methods had to guess exactly where the cloud changed from thin to thick, often making mistakes. This method flows naturally, like water, handling the messy middle ground perfectly.
Better Accuracy: In their tests, this method produced much clearer, more accurate images than traditional methods, preserving the true colors of the land while removing the clouds.

In short: They taught an AI to be a "Creative Assistant" that suggests what might be there, but they forced it to obey the strict "Laws of Physics" to ensure the final picture is real, accurate, and seamless. It's the best of both worlds: human-like imagination guided by scientific truth.

Here is a detailed technical summary of the paper "Physics-Guided VLM Priors for All-Cloud Removal" (PhyVLM-CR) by Liying Xu, Huifang Li, and Huanfeng Shen.

1. Problem Statement

Optical remote sensing faces a persistent challenge in cloud removal due to the heterogeneous nature of cloud degradation.

The Continuity Issue: Cloud optical thickness varies continuously within a single scene. Thin clouds cause radiometric distortion (partial transmission), while thick clouds cause total information loss (occlusion).
Limitations of Existing Methods: Current pipelines treat these as separate problems:
- Thin clouds are corrected via physical atmospheric inversion (e.g., Dark Channel Prior).
- Thick clouds are reconstructed using temporal references or generative synthesis.
- Consequence: This separation requires explicit cloud-type segmentation (binary masks). In mixed-cloud scenes, misalignment at transition zones leads to error accumulation, visible boundary artifacts, and a lack of spatial coherence.
The VLM Challenge: While Vision-Language Models (VLMs) possess strong semantic reasoning, their direct application to remote sensing often results in "hallucinations" (generating fictitious textures or land cover) because they lack rigorous physical constraints regarding radiometry.

2. Methodology: PhyVLM-CR

The authors propose PhyVLM-CR, a unified framework that integrates the semantic perception of a VLM with the rigorous constraints of physical radiometry. Instead of using the VLM as a direct image generator, it is redefined as a cognitive prior extractor.

The framework operates in three sequential stages:

A. Cognitive Prior Acquisition

A pre-trained VLM (specifically Qwen-Image-Edit) is prompted with "remove cloud" to generate an initial candidate image, $J_{VLM}(x)$ .
Role: This output is not used as the final result. Instead, it serves as a carrier of cognitive priors, providing plausible scene structures and global illumination contexts that guide subsequent physical parameter estimation.

B. Physics-Guided Parameter Extraction

The method converts the VLM's semantic output into physically grounded constraints by estimating three key parameters:

Global Atmospheric Light ( $A$ ): Regressed from the region with the highest cloud probability (identified via brightness, saturation, and texture gradients) to ensure robust estimation of the atmospheric veil.
Transmission Map ( $t(x)$ ): Estimated by fitting the atmospheric scattering model to the observation $I(x)$ and the VLM prior $J_{VLM}(x)$ . To prevent local hallucination misalignment, the images are decomposed into base layers (edge-preserving) before regression.
Hallucination Confidence Map ( $U(x)$ ): A critical component that quantifies the reliability of the VLM prediction. It distinguishes between:
- Global physical inconsistencies (calculated in base layers).
- Local hallucination misalignments (calculated in the high-frequency domain).
- This map acts as a continuous soft gate, determining how much weight to give to physical inversion vs. temporal reconstruction.

C. Unified All-Cloud Removal

The final restoration is achieved through an adaptive fusion mechanism:

Physical Inversion: A preliminary estimate $J_{phy}(x)$ is derived by inverting the imaging model.
Cognitive Adjustment: The VLM prior is used to correct color distortion and contrast issues in the physical estimate, but high-frequency hallucinations are suppressed using a frequency-decoupled strategy (separating low-pass cognitive correction from high-frequency detail preservation).
Temporal Reconstruction: In regions where clouds are too thick for physical inversion (low transmission), the method seamlessly transitions to using a temporal reference image ( $I_{ref}$ $I_{r e f}$ ).
- The fusion is controlled by a visibility weight $\omega(x)$ , which is a function of the transmission map $t(x)$ .
- Key Mechanism: As transmission drops (thicker clouds), the weight shifts from the physically/cognitively adjusted image to the temporal reference, ensuring no generative artifacts are introduced in occluded areas.

3. Key Contributions

Unified Zero-Shot Framework: Proposes a method that handles thin and thick clouds simultaneously without requiring explicit cloud-type classification or binary segmentation, preserving the spatial continuity of cloud degradation.
Cognitive Prior Extraction Strategy: Innovatively redefines the VLM not as a pixel generator but as a source of semantic priors to guide the derivation of scattering parameters and a hallucination confidence map.
Adaptive Fusion Mechanism: Achieves a seamless transition between physical inversion (for thin clouds) and temporal reconstruction (for thick clouds) via a continuous soft gate, eliminating boundary artifacts common in hybrid approaches.

4. Experimental Results

The method was validated on Sentinel-2 surface reflectance imagery across six diverse scenes (Sichuan, Hainan, Qinghai, Hubei, Jiangsu, Yunnan) with heterogeneous cloud cover.

Comparative Baselines:
- Traditional Physical: Separated thin/thick correction (SSADCP + FRARC). Suffered from boundary artifacts and cloud residuals.
- Zero-Shot Deep Learning: Used ZID and DIP. Struggled with domain gaps (natural images vs. remote sensing) and had high computational costs (~30 mins/scene).
- Pure VLM: Direct Qwen-Image-Edit output. Produced severe hallucinations (fictitious textures, non-existent land cover).
Performance Metrics: PhyVLM-CR significantly outperformed all baselines in PSNR and SSIM.
- Example (Hubei): PhyVLM-CR achieved 27.188 PSNR and 0.9220 SSIM, compared to ~19.7/0.73 for traditional methods and ~18.9/0.56 for pure VLM.
Qualitative Results: The method successfully removed clouds while preserving radiometric fidelity and spectral details, avoiding the "hallucinated" features seen in pure generative models and the "patchy" boundaries seen in segmented methods.

5. Significance

This paper represents a paradigm shift in remote sensing restoration:

Bridging AI and Physics: It demonstrates how to safely leverage the powerful semantic reasoning of Large Multimodal Models (LMMs/VLMs) in scientific domains by constraining them with physical laws.
Solving the "Mixed Cloud" Problem: By treating cloud removal as a continuous spectrum rather than a binary classification task, it solves the long-standing issue of discontinuities in mixed-cloud scenes.
Reliability: It mitigates the risk of AI hallucinations in critical applications (e.g., disaster monitoring, agriculture) by ensuring that reconstructed content is either physically derived or temporally verified, rather than purely generated.