Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth

Imagine you are trying to solve a giant, complex jigsaw puzzle, but someone has thrown away 90% of the pieces. All you have left are a few scattered pieces, and you need to reconstruct the entire picture perfectly.

In the world of cameras, this is exactly what Multispectral Demosaicing is.

The Problem: The "Broken" Camera

Standard cameras (like the one in your phone) take pictures using three colors: Red, Green, and Blue. But special cameras used in surgery or self-driving cars need to see many more colors (wavelengths) to spot tumors or detect ice on the road.

To do this, these cameras use a special filter (like a mosaic) that lets each pixel see only one specific color.

The Result: The raw image coming out of the camera looks like a patchwork quilt. It's missing 90% of the color information for every single pixel.
The Goal: We need an algorithm to "guess" the missing colors and fill in the gaps to create a sharp, full-color image.

The Old Ways: Why They Failed

The "Blender" Approach (Classical Methods): Imagine trying to guess the missing colors by just averaging your neighbors. It's fast, but the result is blurry and fuzzy. Fine details, like tiny blood vessels in a brain surgery, get lost in the blur.
The "Teacher" Approach (Supervised Learning): This is like training a student by showing them thousands of "before" (blurry) and "after" (perfect) pictures. The student learns perfectly... but only if you have the "after" pictures to begin with.
- The Catch: In real life (like inside a human body or on a moving car), getting those perfect "after" pictures is impossible or takes days. It's like trying to teach a chef to cook a steak by showing them a photo of the steak before you've even bought the meat. You're stuck in a "chicken-and-egg" problem.

The New Solution: PEFD (The "Perspective" Trick)

The authors, Andrew Wang and Mike Davies, came up with a clever way to teach the computer without needing the perfect "after" pictures. They call their method PEFD.

Here is how it works, using two main tricks:

Trick 1: The "Moving Camera" Metaphor

Imagine you are taking a photo of a tree.

If you move the camera slightly to the left, the tree shifts.
If you tilt the camera up, the tree looks different because of perspective (the top looks smaller than the bottom).

The authors realized that nature is consistent. Even if you tilt or rotate your camera, the tree is still the same tree. The math behind this is called Perspective-Equivariance.

Instead of just looking at the blurry patchwork image once, the computer pretends to move the camera around (tilting, rotating, shifting). It asks: "If I tilt the camera, how should the missing pieces change to still look like a real tree?"

By forcing the computer to be consistent with these "imaginary camera moves," it can figure out the missing details that were hidden in the gaps. It's like solving the puzzle by realizing that the missing pieces must fit the shape of the tree, no matter how you look at it.

Trick 2: The "Expert Intern" (Fine-Tuning)

Usually, when you try to solve a puzzle from scratch, you start with zero knowledge. That takes forever and often fails.

The authors started with a pre-trained "Foundation Model." Think of this as a super-smart intern who has already seen millions of photos of the world. They know what a car, a brain, or a leaf usually looks like.

The Problem: This intern only knows how to handle standard 3-color photos, not the 16-color "patchwork" photos.
The Fix: Instead of firing the intern and hiring a new one, they fine-tuned the expert. They kept the intern's vast knowledge of the world but taught them specifically how to fill in the missing patches of this new, weird puzzle.

The Result: Magic Without Magic

By combining the "Moving Camera" trick with the "Expert Intern," PEFD can:

See the Invisible: It recovers sharp details (like tiny blood vessels) that other methods blur out.
Keep the Colors Right: It doesn't just guess the shape; it guesses the correct spectral colors, which is vital for medical diagnosis.
Work Without a Teacher: It learns entirely from the blurry, broken images alone, without needing expensive, perfect ground-truth data.

The Bottom Line

Think of PEFD as a super-smart detective who can reconstruct a crime scene from a few blurry, scattered clues. Instead of needing a photo of the crime scene to learn how to solve it, the detective uses their knowledge of how the world works (physics and geometry) and the fact that the scene looks consistent from different angles to fill in the blanks.

This allows surgeons to see tumors clearly and self-driving cars to "see" better in bad weather, all without needing impossible-to-get reference photos.

1. Problem Statement

Multispectral Demosaicing is the process of reconstructing a full-resolution, multi-channel spectral image ( $x \in \mathbb{R}^{H \times W \times C}$ ) from a single-shot mosaiced measurement ( $y \in \mathbb{R}^{H \times W \times 1}$ ). In snapshot multispectral cameras, a Multispectral Filter Array (MSFA) ensures each pixel captures only one spectral band, making the problem a highly ill-posed inverse problem where the number of measurements is far fewer than the total unknowns ( $m \ll n$ ).

Key Challenges:

Lack of Ground Truth (GT): Supervised deep learning methods require large datasets of paired mosaiced and high-resolution GT images. Acquiring such GT data typically requires slow, bulky line-scanning systems, which are incompatible with real-time applications (e.g., neurosurgery, autonomous driving) and prohibitively expensive.
Limitations of Existing Self-Supervised Methods:
- Classical methods (interpolation, TV regularization) suffer from blurring and spectral artifacts.
- Deep Image Prior (DIP) and other self-supervised approaches often train from scratch, leading to subpar performance in data-limited scenarios and failing to leverage knowledge from vast existing datasets.
- Measurement Consistency (MC) alone cannot recover information lost in the null-space of the mosaicing operator.
- Existing Equivariant Imaging (EI) methods rely on simple symmetries (shifts, rotations) which are insufficient to recover high-frequency details in the sparse sampling of multispectral data.

2. Methodology: PEFD

The authors propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that combines self-supervised learning via geometric constraints with the fine-tuning of pretrained foundation models.

A. Perspective-Equivariant Loss

The core innovation is leveraging the projective geometry of camera systems. In dynamic environments (surgical fields, driving scenes), cameras rotate and move, capturing the same scene from different viewpoints.

Geometric Assumption: The set of natural multispectral images is invariant to perspective transformations (homographies). If $x$ is a valid image, then $T_g x$ (a perspective-transformed version) is also a valid image within the same distribution.
Virtual Operators: Unlike standard EI which uses shifts or rotations, PEFD utilizes the full perspective transformation group $G$ . By applying a transformation $T_g$ to the input measurement $y$ , the authors derive a "virtual" forward operator $A_g = A T_g^{-1}$ .
The Loss Function: The framework minimizes a composite loss:
$\mathcal{L} = \underbrace{\|A f_\theta(y) - y\|^2_2}_{\text{Measurement Consistency}} + \alpha \underbrace{\|T_g f_\theta(y) - f_\theta(A_g f_\theta(y))\|^2_2}_{\text{Equivariance}}$
The equivariance term forces the network to produce consistent reconstructions even when the input is transformed, effectively probing the null-space of the mosaicing operator with a richer set of symmetries than previous methods.

B. Fine-Tuning Foundation Models

Instead of training a network from scratch, PEFD adapts a pretrained foundation model (specifically the Reconstruct Anything Model (RAM)).

Architecture Adaptation: The RAM backbone (32M parameters) is frozen to preserve robust feature representations learned from diverse tasks (denoising, deblurring, super-resolution). The grayscale-specific "heads" and "tails" are replicated and initialized for $C$ spectral channels.
Efficiency: This parameter-efficient fine-tuning allows the model to adapt to multispectral demosaicing with limited data, avoiding overfitting while leveraging the strong inductive bias of the pretrained weights.

3. Key Contributions

Self-Supervised Loss via Perspective-Equivariance: A novel loss function that exploits the projective geometry of camera systems to recover null-space information, offering a richer group structure than previous shift or rotation-based methods.
GT-Free Fine-Tuning Framework: A method to adapt robust, pretrained image restoration foundation models to multispectral demosaicing without requiring ground truth or large-scale paired training data.
State-of-the-Art Performance: Extensive validation on real-world surgical and automotive datasets demonstrating that PEFD outperforms classical interpolation, optimization-based methods, and recent self-supervised deep learning approaches, approaching the performance of fully supervised models.

4. Experimental Results

The method was evaluated on two real-world datasets:

HELICoiD: In-vivo hyperspectral images of human brain tissue (neurosurgery).
HyKo: Urban driving scenes captured by vehicle-mounted multispectral cameras.

Quantitative Performance:

HELICoiD: PEFD achieved 44.84 PSNR and 0.992 SSIM, significantly outperforming the next best self-supervised method (Garcia-Barajas et al. at ~40.98 PSNR) and classical methods. It reduced the Spectral Angle Mapper (SAM) error to 0.042.
HyKo: PEFD achieved 34.81 PSNR and 0.938 SSIM, again outperforming all baselines.
Ablation Study:
- RAM Zero-Shot: Performed poorly (28.00 PSNR) due to lack of spectral correlation learning.
- Shift-Equivariance: Improved performance (41.77 PSNR) but introduced mosaic artifacts due to limited symmetry.
- PEFD: Showed that perspective-equivariance is crucial for recovering high-frequency details and spectral fidelity.

Qualitative Results:

PEFD successfully recovered fine anatomical structures (e.g., blood vessels in brain tissue) and sharp edges (e.g., road markings) that were blurred or lost in other methods.
Spectral signatures reconstructed by PEFD closely matched the ground truth, whereas other methods showed shifted or distorted spectra.

5. Significance and Impact

Solving the "Chicken-and-Egg" Problem: PEFD enables high-quality multispectral imaging in domains where acquiring ground truth is impossible or too costly (e.g., real-time surgery, autonomous driving), breaking the dependency on line-scanning systems.
Bridging the Gap to Supervised Learning: By combining self-supervised geometric constraints with foundation model priors, PEFD narrows the performance gap between unsupervised and supervised methods, making it a viable solution for real-world deployment.
Generalizability: The framework is agnostic to the specific MSFA pattern and can be extended to other compressive imaging tasks (e.g., video demosaicing, CASSI systems) by incorporating temporal or other physical symmetries.

In conclusion, this work presents a robust, practical solution for multispectral demosaicing that leverages the inherent geometry of camera motion and the power of pretrained foundation models to achieve high-fidelity reconstruction without ground truth.