Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

Imagine you are a security guard watching a live feed of a busy street. You have two cameras: a Visible Camera (like your eyes, seeing colors and shapes in the day) and an Infrared Camera (like night-vision goggles, seeing heat signatures of people and cars even in the dark).

Usually, security systems use both cameras together to create the perfect picture: the clear shapes from the visible camera and the heat data from the infrared one.

The Problem:
What happens if the Infrared camera breaks or is missing at night?
Most existing AI systems try to "hallucinate" or guess what the missing heat picture should look like. They try to paint a new image from scratch. This is like a painter trying to guess what a person looks like in the dark just by looking at a photo of them in the sun. The result is often blurry, weird, or full of fake details (like a ghost appearing where no one is).

The Solution: "Missing No More"
The authors of this paper propose a smarter way to handle a missing infrared camera. Instead of trying to paint a new heat image, they use a Dictionary and a Translator.

Here is how it works, broken down into simple analogies:

1. The Shared Dictionary (The Universal Translator)

Imagine you have a giant dictionary of "building blocks" (atoms).

The Old Way: You try to build a house (the image) using two different sets of bricks (Visible bricks and Infrared bricks) that don't quite fit together.
The New Way: The authors create one single set of bricks that both cameras agree on.
- When the Visible camera sees a tree, it breaks the tree down into these specific bricks.
- When the Infrared camera sees a hot engine, it also breaks it down into the same bricks, just arranged differently.
- Why this helps: Because both cameras speak the same "brick language," we can translate information from one to the other without losing the structure.

2. The Translation Process (The Coefficient Domain)

Instead of trying to generate a whole new picture (which is messy), the AI works with the blueprints (the coefficients) of the bricks.

Step 1: Encode. The AI looks at the Visible image and says, "Okay, this tree is made of Brick A, Brick B, and Brick C."
Step 2: Translate. It asks, "If this were a heat image, how would those same bricks be arranged?" It doesn't guess the whole picture; it just rearranges the blueprints.
Step 3: The "Smart Editor" (The LLM). Here is the clever part. The AI uses a frozen Large Language Model (like a very smart, but quiet, editor). It doesn't write the picture; it just gives a tiny nudge.
- Analogy: Imagine you are translating a book. You get the draft, but it feels a bit flat. You ask a literary critic (the LLM), "Does this scene feel warm enough?" The critic doesn't rewrite the book; they just say, "Make the fire a little brighter here." The AI then adjusts the blueprints slightly to make the heat feel more realistic.

3. The Final Assembly (Fusion)

Now, the AI takes the original Visible blueprint and the newly "translated" Heat blueprint.

It mixes them together intelligently. If there's a sharp edge (like a car bumper), it keeps the clear shape from the Visible camera. If there's a hot spot (like a person's body), it uses the heat data from the translated blueprint.
Finally, it uses the Shared Dictionary to rebuild the image. Because the blueprints were consistent from the start, the final picture is sharp, natural, and doesn't have those weird "ghost" artifacts.

Why is this a big deal?

No More "Fake" Pictures: Old methods tried to generate a whole new image, which often looked fake. This method just rearranges existing information, so it stays true to reality.
Explainable: Because they are working with "bricks" (dictionary atoms) instead of magic pixels, we can actually see how the AI made its decision. It's not a black box; it's a logical process.
Works with Just One Camera: You don't need the broken infrared camera to be fixed. You can take a photo with just the visible camera, and the AI will "fill in the blanks" with heat data that actually makes sense.

In Summary:
Instead of trying to paint a missing heat map from scratch (which leads to mistakes), this method translates the visible image into a shared language, uses a smart editor to tweak the heat details, and then rebuilds the perfect image. It's like having a master architect who can look at a blueprint for a house and instantly tell you exactly where the heating pipes should go, even if you only have the drawing for the walls.

Here is a detailed technical summary of the paper "Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared."

1. Problem Statement

Infrared-Visible (IR-VIS) image fusion is critical for robust perception in surveillance, robotics, and autonomous systems. However, existing state-of-the-art methods assume that both IR and VIS modalities are available during both training and inference.

The Challenge: In real-world scenarios, the IR modality may be missing at test time (e.g., sensor failure or cost constraints).
Limitations of Current Solutions: Existing approaches to handle missing modalities typically rely on pixel-space generative models (e.g., GANs, Diffusion models) to synthesize a pseudo-IR image from the VIS input before fusing. These methods suffer from:
- Lack of Interpretability: They act as "black boxes," making it difficult to control the generation process.
- Physical Inconsistency: They often produce unstable thermal cues, hallucinated patterns, or structural details that do not align with the visible image.
- High Computational Cost: Generative models are often heavy and slow.

The authors propose a novel framework that avoids direct pixel-space generation, instead performing interpretable inference and fusion in a coefficient domain guided by a shared dictionary.

2. Methodology

The proposed framework, DCMIF (Dictionary-Guided Cross-Modal Image Fusion), operates on a unified "encode $\to$ transfer $\to$ fuse $\to$ reconstruct" pipeline entirely within a dictionary-coefficient space. It consists of three synergistic modules:

A. Joint Shared-dictionary Representation Learning (JSRL)

Goal: Learn a unified, interpretable atom space shared by both IR and VIS modalities.
Mechanism:
- It formulates the problem as a deep unfolding network based on Convolutional Dictionary Learning (CDL).
- It learns a shared dictionary $D$ and coefficient maps ( $S_{vis}, S_{ir}$ ) by minimizing reconstruction errors for both modalities while imposing sparsity priors.
- Architecture: Uses a cascaded structure of Infrared-Visible Dictionary Learning Blocks (IV-DLBs). Each block contains:
  - Coefficient Solvers: A frequency-domain orthogonalization block (CSB) and a U-Net based CoeNet to solve for coefficients.
  - Dictionary Solver: A DSB and DicNet to update the dictionary kernels.
  - HypNet: Predicts adaptive hyperparameters dynamically.
Outcome: Establishes a structural correspondence between VIS and IR at the atom level, creating a stable foundation for inference.

B. VIS-Guided IR Inference (VGII)

Goal: Infer missing IR coefficients from VIS coefficients without generating raw IR pixels.
Mechanism:
1. Encoding: The visible image is encoded into coefficients ( $\tilde{S}_{vis}$ ) using the frozen JSRL components (HeadNet, CSB, CoeNet).
2. Transfer: A Representation Inference Network (RIN) maps $\tilde{S}_{vis}$ to pseudo-IR coefficients ( $S^{(0)}_{pir}$ ).
3. Semantic Refinement (LLM): To enhance thermal cues, a frozen Large Language Model (LLM) acts as a semantic critic.
  - The LLM processes the VIS image and the initial pseudo-IR reconstruction to extract text features.
  - These features are used to predict linear modulation parameters ( $\gamma, \beta$ ) to refine the VIS coefficients ( $S_{fm} = \gamma \odot \tilde{S}_{vis} + \beta$ ).
  - The refined coefficients are passed back through RIN for a second-stage transfer.
Advantage: The LLM provides a "weak semantic prior" via simple linear modulation in the coefficient domain, avoiding the instability of heavy generative heads.

C. Adaptive Fusion via Representation Inference (AFRI)

Goal: Merge VIS structures and inferred IR cues to reconstruct the final image.
Mechanism:
- A Reasoning Fusion Network (RFN) takes the VIS coefficients and the refined pseudo-IR coefficients.
- It employs Window Self-Attention and Convolutional Mixing to learn adaptive, atom-wise gating weights ( $W_{vis}, W_{pir}$ ).
- Gating Logic: Atoms corresponding to structural edges favor VIS coefficients, while those encoding thermal semantics favor the inferred IR coefficients.
- Reconstruction: The fused coefficients are reconstructed using the shared dictionary $D$ and the TailNet.

3. Key Contributions

Dictionary-Guided Coefficient-Domain Paradigm:
- The first framework to jointly learn a shared dictionary and perform coefficient-domain inference-fusion for missing-IR scenarios.
- It closes the loop (encode-transfer-fuse-reconstruct) within the dictionary space, ensuring data consistency and interpretability, unlike black-box pixel generation.
Controllable Completion with Weak Semantic Prior:
- Introduces a frozen LLM as a lightweight semantic prior. Instead of generating pixels, it modulates atomic responses in the coefficient domain, enabling stable thermal completion without introducing artifacts.
Efficiency and Reproducibility:
- The method requires no real IR images during inference, only the VIS image and the pre-trained shared dictionary.
- Training avoids adversarial or diffusion machinery, relying on simple reconstruction and consistency losses, resulting in a stable, reproducible, and low-overhead system.

4. Experimental Results

The method was evaluated on FLIR, MSRS, and KAIST datasets for fusion quality, and M3FD and FMB for downstream tasks.

Fusion Quality:
- Quantitative: Outperformed 10 state-of-the-art methods (including U2Fusion, CDDFuse, TarDAL) across metrics like Average Gradient (AG), Edge Intensity (EI), and Spatial Frequency (SF). Notably, it achieved performance comparable to methods that do have access to real IR images.
- Qualitative: Visual results showed superior detail fidelity, better thermal target enhancement, and fewer artifacts (ghosting/blurring) compared to methods that generate IR images first.
Downstream Tasks:
- Object Detection (YOLOv5): Achieved mAP scores comparable to full-modal fusion methods.
- Semantic Segmentation (SegFormer): Demonstrated high mIoU, often outperforming full-modal baselines, with clearer boundaries and consistent internal regions.
Ablation Studies:
- Confirmed that both the Joint Sparse Dictionary and LLM Modulation are critical. Removing the dictionary led to blurry results; removing the LLM reduced thermal contrast and edge clarity.
Complexity:
- Significantly lower parameter count (21.8M) and FLOPs compared to diffusion-based or GAN-based generation pipelines, with faster inference times.

5. Significance

This paper represents a paradigm shift in handling missing modalities in image fusion. By moving away from pixel-space generation (which is prone to hallucination and instability) to coefficient-domain inference (which is grounded in mathematical dictionary learning), the authors provide a solution that is:

Interpretable: The fusion process is transparent and physically grounded in atom-level correspondences.
Robust: It maintains structural consistency even when the IR sensor is completely absent.
Practical: It offers a lightweight, deployable solution for real-world applications where sensor redundancy is not guaranteed, ensuring that critical thermal information can still be recovered and fused for downstream AI tasks.

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

1. The Shared Dictionary (The Universal Translator)

2. The Translation Process (The Coefficient Domain)

3. The Final Assembly (Fusion)

Why is this a big deal?

1. Problem Statement

2. Methodology

A. Joint Shared-dictionary Representation Learning (JSRL)

B. VIS-Guided IR Inference (VGII)

C. Adaptive Fusion via Representation Inference (AFRI)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers