IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

The Big Problem: The "Over-Enthusiastic Student"

Imagine you have a brilliant student (a Pre-trained Visual Model, or PVM) who has studied millions of photos of cats, dogs, and cars in daylight. This student is a genius at recognizing things in the sun.

Now, you want this student to learn a new skill: seeing in the dark using infrared cameras (which see heat) and seeing through fog.

The Old Way (Full Fine-Tuning):
In the past, researchers tried to teach this student by forcing them to re-learn everything from scratch using a small set of new "dark and foggy" photos.

The Result: The student gets confused. They memorize the specific fog patterns in the training photos so well that they fail when they see a different kind of fog. They also forget the general knowledge they learned about cats and dogs in the sun.
The Analogy: It's like a chef who memorizes a specific recipe for a cake perfectly but forgets how to bake any cake if the ingredients change slightly. They are "overfitting"—they are too focused on the details of the practice test to pass the real exam.

The New Solution: IV-tuning (The "Smart Guide")

The authors of this paper propose IV-tuning. Instead of making the student re-learn everything, they keep the student's original brain frozen (so they don't forget their general knowledge) and just give them a few specialized notes (called "Prompts") to help them adapt to the new situation.

Think of it like giving a seasoned detective a magnifying glass and a thermal imaging goggles without making them re-learn how to walk or talk.

How It Works: The "Two-Stream" Strategy

The paper realizes that Visible Light (what our eyes see) and Infrared (heat signatures) are very different.

Visible Light is like a high-definition photo: It has lots of sharp edges, textures, and fine details (like the fur on a cat).
- The Strategy: The system uses "convolutions" (complex filters) to sharpen these details, just like a photo editor enhancing a picture.
Infrared is like a heat map: It doesn't have sharp edges; it shows broad, glowing shapes (like a warm blob where a person is standing).
- The Strategy: The system treats this gently. It uses simple linear projections (straight lines) to pass the information through.
- The Analogy: If you try to sharpen a blurry heat map with a high-definition filter, you ruin the image. IV-tuning knows that infrared is "low-frequency" (smooth and broad), so it doesn't try to force sharp edges onto it. It preserves the "glow."

The Secret Sauce: The "Modality-Aware Prompter"

The core of their invention is a module called the Modality-Aware Prompter.

The "Prompt": Imagine the student is taking a test. The "Prompt" is a sticky note the teacher puts on the desk saying, "Hey, remember, in this room, the walls are hot, but the floor is cold. Look for heat, not just shapes."
The "Cascade": The system puts these sticky notes at every single layer of the student's brain. As the student processes the image deeper and deeper, the notes get updated to give more specific advice.
The "Rank-Adaptive" Fusion:
- In the early layers of the brain, the information is simple and repetitive. The system uses a compact, efficient fusion (like a quick summary).
- In the deep layers, the information is complex and diverse. The system switches to a rich, detailed fusion (like a full essay) to make sure no important details are lost.

Why Is This Better? (The Results)

The paper tested this on three difficult tasks:

Finding the most important object (Salient Object Detection).
Labeling every pixel (Semantic Segmentation).
Finding and boxing objects (Object Detection).

The Wins:

Less Memory, More Brains: They only trained 3% of the model's parameters. It's like training a whole army by only teaching the generals, while the soldiers (the frozen backbone) already know how to fight.
No Overfitting: Because they didn't force the model to re-learn everything, it didn't memorize the training data. It generalized better to new, unseen scenarios.
Speed & Cost: It uses less computer memory and trains faster than the old "re-learn everything" methods.

The Bottom Line

IV-tuning is a smart way to take a powerful AI that was trained on sunny, clear days and teach it to work in the dark and fog. Instead of forcing the AI to forget its past and re-learn everything (which makes it clumsy and prone to mistakes), it simply gives the AI specialized, gentle instructions on how to interpret heat and low-light images.

It's the difference between rewriting a dictionary to learn a new language versus adding a few helpful footnotes to an existing, perfect dictionary. The result is a smarter, faster, and more adaptable system.

1. Problem Statement

Existing Infrared-Visible (IR-VIS) fusion methods face two critical challenges when leveraging modern Pre-trained Visual Models (PVMs) (e.g., Swin Transformer, EVA02):

Overfitting and Generalization Failure: The standard approach involves full fine-tuning of dual-branch networks. However, on small-scale IR-VIS datasets, this leads to severe overfitting. Principal Component Analysis (PCA) reveals that full fine-tuning forces the feature space into a highly constrained, low-rank subspace, causing the model to memorize trivial patterns (backgrounds) rather than learning robust, generalizable features.
Inefficient Modality Handling: Traditional methods often use dual-branch architectures with heavy backbones or complex decoders. Furthermore, they frequently apply standard convolutional operations to infrared data, which inadvertently destroys critical low-frequency thermal information (essential for IR) while attempting to extract high-frequency details.
Scalability Issues: Full fine-tuning of large PVMs for dual-modal tasks requires massive computational resources and memory, making it impractical for deployment.

2. Methodology: IV-tuning

The authors propose IV-tuning, a Parameter-Efficient Transfer Learning (PETL) framework designed to harness PVMs for IR-VIS tasks with minimal trainable parameters (approx. 3% of backbone parameters).

Core Architecture

Frozen Backbone: The pre-trained visual backbone (PVM) is completely frozen to preserve general representations and prevent catastrophic forgetting.
Single-Branch Paradigm: Instead of a dual-branch network, IV-tuning uses a single backbone. Infrared (IR) data is processed through a lightweight Patch Embedding layer to generate prompt tokens, which are then cascaded into the backbone alongside visible (VIS) tokens.
Modality-aware Prompter (MP): The core innovation is a cascade of learnable prompt blocks inserted into the frozen transformer layers. These blocks consist of two stages:
1. MP- $\alpha$ (Initial Prompt Generation): Generates the initial modal prompt ( $P_0$ ) by fusing VIS and IR features in a low-dimensional subspace.
2. MP- $\beta$ (Progressive Refinement): Inserted after every Attention and FFN layer in the backbone. These blocks refine the prompts and backbone features as they pass through deeper layers.

Key Technical Components

Task-Agnostic Feature Transform: Uses learnable scaling matrices to normalize and recalibrate features from both modalities before processing.
Modality-Specific Processing:
- Visible Stream: Utilizes a Split-Fuse Enhancer. It splits channels, applies a $3\times3$ depth-wise convolution to a subset (to capture high-frequency textures/edges), and fuses them back. This leverages the inductive bias of CNNs for VIS data.
- Infrared Stream: Uses Linear Projection only. The authors observed that convolutions degrade the low-frequency thermal signals critical to IR. Linear projections preserve these global thermal structures without introducing high-frequency noise.
Rank-Adaptive Fusion Strategies:
- $\alpha$ -Fusion: Used in MP- $\alpha$ for shallow layers where the feature space has low intrinsic dimension. It fuses features in a compressed latent space to align coarse-grained modalities efficiently.
- $\beta$ -Fusion: Used in MP- $\beta$ for deeper layers where the feature space is diverse and high-rank. It projects features back to the full dimension before fusion, preserving the structural independence of complex semantic manifolds.

3. Key Contributions

New Perspective on Overfitting: The paper provides empirical evidence (via PCA) that full fine-tuning of PVMs on IR-VIS tasks collapses the feature space into a low-rank subspace, severely limiting generalization.
Frequency Domain Insight: Through energy distribution analysis, the authors demonstrate that IR data is dominated by low-frequency components. They prove that standard convolutions harm these signals, motivating the use of linear projections for the IR branch.
Novel Framework (IV-tuning): A general, efficient framework that freezes the backbone and inserts Modality-aware Prompters. It effectively learns inter-modal complementarities with only ~3% trainable parameters.
Dual Fusion Strategy: The introduction of distinct fusion mechanisms ( $\alpha$ for low-rank, $\beta$ for high-rank spaces) tailored to the evolving intrinsic dimensionality of features across network depths.

4. Experimental Results

The method was evaluated on three downstream tasks across multiple datasets and backbone models (Swin-L, EVA02-L, CLIP, MAE, SAM, DINOv3).

Salient Object Detection (VT821, VT1000, VT5000): IV-tuning outperformed full fine-tuning baselines and state-of-the-art (SOTA) methods (e.g., TCINet, ConTriNet) while using significantly fewer parameters. For example, on VT5000 with Swin-L, it achieved an $S_\alpha$ of 0.917 vs. 0.912 for the full fine-tuned baseline, with only 5.0M trainable parameters vs. 192.5M.
Semantic Segmentation (MFNet): IV-tuning achieved the highest mIoU (60.44%) using Swin-L, surpassing the full fine-tuning baseline (56.78%) and other PETL methods (e.g., LoRA, Bi-AdaptFormer). It also showed superior performance on EVA02-L (60.14% vs. 54.53%).
Object Detection (M3FD): IV-tuning improved mAP by 2.0% over ICAFusion and achieved higher mAP@75, indicating better boundary precision.
Efficiency: IV-tuning reduced training GPU memory usage by 45.1% (Semantic Segmentation) and 36.9% (Object Detection) compared to dual-branch full fine-tuning. It also demonstrated faster inference speeds.
Generalization: The method successfully transferred to other PVMs (CLIP, MAE, SAM) and even extended to RGB-D tasks (NYUDepthV2) without architectural changes, outperforming full fine-tuning in all cases.

5. Significance

Paradigm Shift: IV-tuning challenges the prevailing "dual-branch full fine-tuning" paradigm in the IR-VIS community, proving that a streamlined, single-branch approach with frozen backbones yields superior generalization.
Resource Efficiency: It makes large-scale foundation models accessible for IR-VIS tasks by drastically reducing computational costs and memory requirements, enabling deployment on resource-constrained devices.
Theoretical Insight: The work bridges the gap between frequency domain analysis (low-frequency preservation for IR) and parameter-efficient learning, offering a blueprint for adapting foundation models to other multi-modal scenarios where modalities have distinct physical characteristics.
Scalability: By decoupling the heavy backbone from the task-specific learning, IV-tuning allows the IR-VIS community to seamlessly adopt future, more powerful foundation models without re-engineering complex fusion architectures.