Multi-illuminant Color Constancy via Multi-scale Illuminant Estimation and Fusion

🌅 The Problem: The "Bad Lighting" Camera

Imagine you are taking a photo of a beautiful red apple.

If you take the photo under a yellow streetlamp, the apple looks orange.
If you take it under a blue twilight sky, the apple looks purple.

Your human brain is amazing. Even if the light changes, your brain automatically knows, "That's still a red apple," and corrects the color in your mind. This is called Color Constancy.

However, cameras are not that smart. They just record the light hitting the sensor. If the light is weird, the whole photo looks weird (too yellow, too blue, or too green).

🏙️ The Complication: A Room with Many Lights

Most previous computer programs tried to fix this by assuming the whole room has one single light source (like one big ceiling bulb). They would calculate the color of that one bulb and fix the whole image.

But real life is messy! Imagine a room with:

A warm yellow lamp in the corner.
A cool blue window on the left.
A bright white spotlight in the center.

If you try to fix the whole room with just one color correction, the apple near the window will look wrong, and the apple near the lamp will look wrong. This is the Multi-Illuminant problem. The camera needs to know that different parts of the photo need different fixes.

🔍 The Old Way vs. The New Way

The Old Way (Deep Learning):
Previous AI methods tried to look at the whole picture and guess the lighting for every single pixel at once. It's like trying to paint a detailed landscape by looking at it from 100 miles away through a telescope. You might get the big shapes right, but you miss the tiny details.

The New Way (This Paper's Solution):
The authors (Hang Luo, Rongwei Li, and Jinxing Liang) realized that scale matters.

If you zoom out (small scale), you see the big picture: "Okay, the left side is generally blue, and the right side is generally yellow."
If you zoom in (large scale), you see the tiny details: "Ah, this specific leaf is catching a tiny reflection of a red car."

They proposed that the perfect lighting map is a mixture of these different views.

🛠️ The Solution: The "Three-Chef Kitchen"

The authors built a system with three parallel chefs (a "Tri-branch network") working in a kitchen to fix the photo.

Chef Small (The Coarse Chef): Looks at a tiny, blurry version of the photo. They are good at seeing the big trends. "The whole left side is dark and blue."
Chef Medium: Looks at a medium-sized version. They see the structure. "The shadow under the table is green."
Chef Large (The Fine Chef): Looks at the full, high-definition photo. They see the tiny details. "That specific pixel on the apple is reflecting a red light."

🤝 The "Smart Manager" (Attentional Fusion)

If you just asked the three chefs to mix their ideas together, it might be a mess. You need a Smart Manager (called the Attentional Illuminant Fusion Module).

The Manager looks at every single pixel in the photo.
For a pixel in a blurry, big area, the Manager says, "I trust Chef Small more here."
For a pixel on a sharp edge, the Manager says, "I trust Chef Large more here."
The Manager creates a "weight map" (a voting system) to decide exactly how much to listen to each chef for every single pixel.

🧪 The Results

They tested this on a massive dataset of photos with mixed lighting.

The Score: They measured how close the corrected colors were to the "true" colors (using something called "Mean Angular Error").
The Win: Their method beat all the other top methods. It was like getting a 98% on a test where everyone else got 90%.
Visual Proof: When they fixed the photos, the colors looked natural and realistic, whereas other methods left some parts looking weirdly tinted.

🚀 In a Nutshell

Instead of trying to guess the lighting for a whole complex scene with one giant guess, this paper says: "Let's look at the scene from three different distances, let three different AI models guess the lighting, and then let a smart manager blend their best guesses together for every single pixel."

This approach allows computers to see the world with the same color-perfect vision that humans have, even in rooms with a dozen different light bulbs.

1. Problem Statement

Color constancy is the ability of a visual system to perceive consistent object colors under varying illumination. While human vision possesses this ability, cameras often produce images with color casts (e.g., reddish or blueish tints) due to different light sources.

Single-Illuminant Limitation: Traditional methods assume a single light source per image. They fail in real-world scenarios where natural scenes contain multiple illuminants, leading to inaccurate corrections of local color casts.
Existing Multi-Illuminant Limitations: While recent deep learning approaches attempt pixel-wise illuminant estimation, they largely overlook the impact of image scales.
- Small-scale images capture coarse, uniform distributions.
- Large-scale images capture fine-grained, diverse details.
- Current methods often treat the image at a single resolution, missing the complementary features available at different scales, which limits the precision of pixel-wise illuminant estimation.

2. Methodology

The authors propose a Coarse-Fine Decomposed Framework that treats the final illuminant map as a linear combination of multi-grained components estimated from multi-scale images.

A. Multi-Scale Illuminant Estimation Framework

The architecture consists of three parallel branches, each processing the input image at a different scale (Large, Medium, Small).

Illuminant Estimation Module (IEM): Each branch utilizes a modified U-Net architecture (based on the LSMI-U variant).
- Structure: Composed of Double Convolution Blocks (DCB) and Upsampling Convolution Blocks (UCB).
- Function: The encoding pathway progressively halves spatial dimensions, while the decoding pathway doubles them, fusing low- and high-level features via skip connections.
- Output: Each branch outputs an illuminant distribution map (Red and Blue channels; Green is fixed at 1).
- Scale Logic:
  - Small Scale: Captures coarse-grained, smooth illuminant distributions.
  - Medium Scale: Reveals finer structural details.
  - Large Scale: Captures the most fine-grained details.

B. Attentional Illuminant Fusion Module (AIFM)

To integrate the outputs from the three branches, the authors introduce an adaptive fusion mechanism:

Concatenation: The three illuminant maps are concatenated along the channel dimension.
Weight Generation: A convolutional layer processes the concatenated tensor, followed by a Softmax function across the channel dimension. This generates three pixel-wise weight maps.
Linear Combination: The final illuminant map ( $I_{final}$ ) is computed as a weighted sum of the branch outputs:
$I_{final} = I_l \times W_l + I_m \times W_m + I_s \times W_s$
Where $I$ represents the illuminant maps and $W$ represents the learned weight maps. This allows the network to dynamically assign importance to specific scales for each pixel.

C. Loss Function

The model is trained to minimize the Mean Angular Error (MAE) between the predicted illuminant map and the ground truth, calculated using the arccosine of the dot product of the normalized vectors.

3. Key Contributions

Multi-Grained Decomposition: The paper proposes that an illuminant map can be effectively decomposed into multi-grained components derived from multi-scale images, addressing the scale-dependent variation of illuminant distributions.
Tri-Branch Architecture: A novel three-branch convolutional network (using U-Nets) is designed to extract complementary features (coarse to fine) from different image scales.
Adaptive Fusion: The introduction of the Attentional Illuminant Fusion Module (AIFM) enables the model to automatically learn pixel-wise weights, adaptively selecting the most relevant scale information for each specific pixel.
State-of-the-Art Performance: Extensive experiments validate that this approach outperforms existing methods on the LSMI dataset.

4. Experimental Results

Dataset: The method was evaluated on the LSMI (Large Scale Multi-Illuminant) dataset, containing 7,486 images captured by three different devices (Samsung, Nikon, Sony).
Metrics: Performance was measured using Mean, Standard Deviation, Median, and Trimean of angular errors (in degrees).
Quantitative Performance:
- The proposed method achieved a Mean Angular Error of 1.96° on the Galaxy subset, significantly outperforming the second-best method (One-Net at 2.23°) by approximately 12%.
- It also achieved the best results across the Nikon and Sony subsets.
Ablation Studies:
- Removing any single branch (Large, Medium, or Small) or the AIFM module resulted in performance degradation, confirming the necessity of the multi-scale fusion and the attention mechanism.
- Visualizations showed that small-scale branches produced smoother maps (coarse), while large-scale branches captured fine details, proving the complementary nature of the branches.
Qualitative Results: Visual comparisons showed that the proposed method corrected local color biases more accurately than competing methods, with results visually closer to the ground truth.

5. Significance

This paper addresses a critical gap in multi-illuminant color constancy by recognizing that image scale is a fundamental factor in estimating complex lighting environments. By moving away from single-scale processing to a multi-scale, attention-driven fusion strategy, the method achieves superior robustness and accuracy. This approach not only sets a new state-of-the-art benchmark for pixel-wise illuminant estimation but also provides a new paradigm for handling spatially varying lighting conditions in computer vision tasks, such as image enhancement and downstream object recognition.