Bridging Human Evaluation to Infrared and Visible Image Fusion

Imagine you are trying to take a perfect photo of a foggy night street. You have two cameras:

The Night Vision Camera (Infrared): It sees heat. It can spot a person hiding in the bushes or a car engine running, but the picture looks like a blurry, gray ghost.
The Regular Camera (Visible): It sees light and color. It can show the texture of the brick wall and the color of the traffic sign, but in the fog, everything is just a white blur.

The Goal: You want to smash these two photos together to get one "Super Photo" that shows the heat and the details. This is called Image Fusion.

The Problem: The "Robot Chef" vs. The "Human Food Critic"

For years, scientists tried to build a "Robot Chef" (an AI) to mix these photos. But the Robot Chef was following a very strict, boring rulebook. It was told: "Make the numbers match perfectly. If the pixel brightness is off by 0.01, you failed."

The problem? Robots are bad at judging what looks good to humans.
The Robot Chef might produce a photo that scores 10/10 on its math test but looks weird, blurry, or has strange "ghosting" artifacts to a human eye. It's like a chef who perfectly measures the salt but serves a dish that tastes like soap because they ignored the human palate.

The Solution: Teaching the Robot to Listen to Humans

This paper introduces a new way to teach the Robot Chef. Instead of just checking math, they taught it to listen to Human Food Critics.

Here is how they did it, step-by-step:

1. The "Taste Test" Dataset (The Human Feedback)

First, the researchers needed a massive library of "Good" and "Bad" fused photos.

They took thousands of photo pairs and mixed them using 11 different existing AI methods.
They hired experts (human critics) to taste-test these photos. They didn't just say "Good" or "Bad." They gave scores on specific things:
- Did we keep the heat signature? (Thermal Retention)
- Is the texture of the road clear? (Texture Retention)
- Are there weird glitches or ghosts? (Artifacts)
- Is it sharp? (Sharpness)
The Magic Trick: They used a super-smart AI (GPT-4o) to help grade the thousands of photos, but they trained it first on the experts' notes. So now, the AI can act like a human critic, spotting glitches and scoring photos automatically.

2. The "Smart Judge" (The Reward Model)

They built a special AI judge called a Reward Model.

Think of this as a Taste Test Robot.
You feed it the Infrared photo, the Visible photo, and the new Fused photo.
It looks at them and says: "This one gets a 4 out of 5 for texture, but a 2 out of 5 because there's a weird blur on the car."
It also draws a Heatmap (like a "X marks the spot" map) showing exactly where the weird blur is.

3. The "Training Camp" (Reinforcement Learning)

Now comes the real magic. They didn't just tell the Robot Chef the answer; they let it learn by trial and error, guided by the Smart Judge.

The Game: The Robot Chef tries to fuse an image.
The Score: The Smart Judge grades it.
The Lesson: If the Chef makes a mistake (like blurring a car), the Judge gives a low score. The Chef learns, "Oh, I shouldn't do that!"
The Secret Sauce (GRPO): They used a technique called Group Relative Policy Optimization. Imagine a classroom of students (different versions of the AI). They all try to solve the puzzle. The teacher (the Reward Model) doesn't just grade them individually; she compares them. "Student A did better than Student B on the car details, so Student A gets a bonus." This pushes the AI to constantly try to be better than its previous self, specifically on the things humans care about.

The Result: A Photo You Actually Want to Look At

When they tested this new method:

Mathematically: It scored higher than all the old methods on standard tests.
Visually: It looked much better to humans. The cars were sharper, the fog was clearer, and there were fewer weird "ghost" glitches.
Real World: When they used these photos for self-driving cars or security cameras, the cars could actually see pedestrians in the fog better than before.

The Big Picture Analogy

Imagine you are training a dog.

Old Way: You teach the dog to sit using a clicker that only clicks if the dog's paw is exactly 3 inches off the ground. The dog sits, but it looks stiff and unnatural.
New Way (This Paper): You hire a professional dog trainer (the Human Feedback) to watch the dog. You build a "Clicker" (Reward Model) that clicks not just for the paw height, but for "Does this look like a happy, natural dog?" You then use a special training technique (GRPO) where the dog learns by comparing its best tricks against its worst ones, constantly improving until it performs exactly how a human trainer wants.

In short: This paper bridges the gap between "computer math" and "human eyes," creating fused images that look great to us, not just to a calculator.

1. Problem Statement

Infrared and Visible Image Fusion (IVIF) aims to synthesize a single image that combines thermal radiation information from infrared images with rich texture details from visible images. Despite advancements, current IVIF methods face a critical limitation:

Misalignment with Human Perception: Existing methods primarily optimize handcrafted objective functions (e.g., entropy, gradient-based metrics) and pixel-level losses. These mathematical proxies often fail to correlate with genuine human visual preferences, leading to fused images that may score well numerically but appear unnatural or contain artifacts to human observers.
Ill-Posed Nature: The IVIF task is inherently ill-posed because there is no unique "ground truth" fused image. This makes it difficult to define a single optimal solution without human-centric guidance.
Lack of Data and Metrics: There is a scarcity of large-scale datasets containing human feedback (subjective scores and artifact annotations) specifically for IVIF, and a lack of automated mechanisms to quantify perceptual quality for model training.

2. Methodology

The authors propose a Feedback Reinforcement Framework that integrates subjective human evaluation directly into the fusion pipeline. The methodology consists of three core stages:

A. Construction of a Large-Scale Human Feedback Dataset

To address the data gap, the authors constructed the first large-scale human feedback dataset for IVIF:

Data Collection: 850 high-quality infrared-visible image pairs were selected from 8 diverse benchmark datasets (e.g., LLVIP, M3FD, TNO).
Generation: These pairs were fused using 11 state-of-the-art (SOTA) models, generating 9,350 fused images.
Annotation Strategy (Human-in-the-Loop):
- Seed Dataset: Four senior experts annotated 100 images with fine-grained scores (1-5 scale) on four dimensions: Thermal Retention, Texture Retention, Artifacts, and Sharpness, along with artifact heatmaps.
- LLM-Assisted Scaling: A GPT-4o model, fine-tuned on the seed dataset, was used to automatically annotate the remaining 9,250 images.
- Expert Review: Human experts reviewed and refined the GPT-4o outputs to ensure high-quality alignment with human perception.
Output: The final dataset includes multidimensional subjective scores and detailed artifact heatmaps.

B. Fusion-Oriented Reward Model

A specialized reward model was trained to quantify perceptual quality based on the new dataset:

Architecture: Based on a Vision Transformer (ViT) architecture (specifically a Vision-Language Model backbone).
Input: The model takes the infrared image, visible image, and the fused image as input.
Mechanism:
- Features are extracted via a weight-shared ViT encoder.
- Features are concatenated and processed through a cross-modal fusion ViT.
- Dual Prediction Branches:
  1. Score Prediction: Regresses fine-grained scores for the four quality dimensions.
  2. Heatmap Prediction: Generates a spatial probability map highlighting artifact regions.
Training: Optimized using a weighted sum of Mean Squared Error (MSE) losses for scores and heatmaps.

C. RLHF-Driven Policy Optimization (GRPO)

The fusion network is fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy inspired by RLHF:

Baseline: The DCEvo framework serves as the initial fusion policy network ( $\pi_\theta$ ).
Semantic Segmentation: The Segment Anything Model (SAM) is used to segment the fused image into key semantic regions (e.g., cars, people, roads).
Reward Calculation: The trained Reward Model evaluates each segmented region, producing a set of multidimensional scores.
Advantage Estimation: Normalized relative advantages are computed within each group of regions to guide the policy update.
Objective: The policy network is updated to maximize the reward while minimizing the KL divergence from a reference policy, ensuring stable fine-tuning that aligns with human aesthetics.

3. Key Contributions

Feedback Reinforcement Framework: A novel pipeline that explicitly integrates subjective human preferences into the IVIF optimization loop, bridging the gap between objective metrics and human perception.
First Large-Scale IVIF Human Feedback Dataset: A dataset containing 9,350 fused images with multidimensional subjective scores and artifact annotations, created via a hybrid expert-LLM workflow.
Specialized Reward Model & RL Strategy: The development of a ViT-based reward model capable of quantifying perceptual quality and a GRPO-based fine-tuning strategy that allows fusion models to learn human visual preferences effectively.

4. Experimental Results

The proposed method (named EVAFusion) was evaluated against 13 SOTA methods on three benchmark datasets (TNO, RoadScene, M3FD).

Quantitative Performance:
- Achieved State-of-the-Art (SOTA) performance in reference-based metrics: CC (Cross-Correlation), PSNR, Qabf, and SSIM.
- Ranked first or second in no-reference metrics (NIQE and BRISQUE), indicating superior naturalness and quality without ground truth.
Qualitative & Human Preference:
- Subjective Ranking: In a blind study with 15 participants (5 experts, 10 non-experts), the proposed method achieved the highest average preference ranking across all datasets.
- Visual Quality: Generated images showed better preservation of thermal targets (e.g., vehicles in fog) and texture details while minimizing artifacts compared to baselines.
Downstream Applications:
- Semantic Segmentation: Improved mIoU, particularly for critical classes like "Person" and "Car."
- Object Detection: Achieved the highest mAP, successfully detecting targets in low-light and dense fog conditions where other methods failed (e.g., detecting motorcycles in the dark).
Ablation Studies:
- Removing the score/heatmap branches or the SAM segmentation module resulted in significant performance drops, validating the necessity of each component.
- The GRPO strategy outperformed other RL baselines like DPO and PPO.

5. Significance

This work represents a paradigm shift in the IVIF field by moving away from purely mathematical optimization toward human-centric evaluation.

Bridging the Gap: It solves the "ill-posed" nature of IVIF by using human feedback as the ultimate supervisor.
Scalability: The use of fine-tuned LLMs (GPT-4o) for dataset annotation demonstrates a scalable approach to generating high-quality perceptual datasets for computer vision tasks.
Practical Impact: By aligning fusion results with human aesthetics and perception, the method significantly enhances the reliability of IVIF in critical real-world applications such as autonomous driving, security surveillance, and military reconnaissance, where human operators rely on these fused images for decision-making.