Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

The Big Problem: Blurry Night Vision

Imagine you are trying to read a street sign at night using a thermal camera (which sees heat instead of light). In the real world, these images are often blurry, fuzzy, or distorted. This happens because of two main reasons:

The Lens: The camera might be slightly out of focus, or the camera itself is shaking (motion blur).
The Physics: Heat doesn't always match the shape of objects. For example, a car engine is very hot, but the heat might "bleed" out past the edges of the car, making the car look like a glowing blob rather than a sharp vehicle.

Most current AI tools are trained on fake data. They are like students who only studied from a textbook of perfect, clean drawings. When they try to fix a real, messy photo, they get confused. They might sharpen the edges but lose the heat information, or they might make the heat look sharp but distort the shape of the object.

The Solution: A New Teacher and a New Student

The authors of this paper built two things to fix this: a new textbook (a dataset) and a new student (an AI model).

1. The New Textbook: FLIR-IISR

Think of this as a "Real-World Training Camp."

What they did: Instead of using computers to simulate blurry images, they went out into 6 different cities, across 3 seasons, and took 1,457 real photos with a high-end thermal camera.
The Trick: They took a super-clear photo (High Resolution), then intentionally messed it up by shaking the camera or blurring the focus to create a "bad" photo (Low Resolution).
Why it matters: Now, the AI has a "Ground Truth" pair. It can see exactly what the blurry mess looked like and what the clear version should have looked like. It's like having a "Before and After" photo album of real-life disasters, which helps the AI learn how to fix them properly.

2. The New Student: Real-IISR

This is the AI model. Instead of just guessing, it uses a special "Autoregressive" method.

The Analogy: Imagine painting a picture. A normal AI might try to paint the whole canvas at once, which often leads to a muddy mess. Real-IISR paints scale-by-scale. It starts by sketching the rough outline of the scene, then fills in the big shapes, and finally adds the tiny details (like the texture of a brick wall or the glow of a hot pipe). This step-by-step approach prevents the AI from getting overwhelmed.

The Three Superpowers of Real-IISR

To make sure the AI doesn't just make things look sharp but also feel physically correct (thermally), the authors gave it three special tools:

A. The "Heat & Shape" GPS (Thermal-Structural Guidance)

The Problem: In thermal images, the "hot spot" (like a running engine) doesn't always line up perfectly with the "edge" (the outline of the car). If the AI only looks at the heat, it might draw a car that is too round. If it only looks at the edge, it might miss the heat.
The Fix: This module acts like a GPS that holds two maps at once: a Heat Map and an Edge Map. It constantly checks both. If the heat is bleeding out, the GPS says, "Wait, the edge is over here, keep the heat contained!" This ensures the object looks like a car, not a glowing cloud.

B. The "Smart Dictionary" (Condition-Adaptive Codebook)

The Problem: AI models usually use a fixed dictionary of "pixels" to rebuild images. But a blurry pixel caused by motion looks different from a blurry pixel caused by a dirty lens. Using the same dictionary for both is like trying to fix a broken vase and a torn shirt with the exact same glue.
The Fix: Real-IISR has a Smart Dictionary that changes its definitions on the fly. If it sees motion blur, it swaps in "motion-friendly" pixels. If it sees heat noise, it swaps in "heat-friendly" pixels. It adapts its vocabulary to the specific mess it's trying to clean up.

C. The "Thermostat Rule" (Thermal Order Consistency Loss)

The Problem: In the real world, hotter things are always brighter in thermal images. If the AI accidentally makes a cold rock look brighter than a hot engine, it breaks the laws of physics.
The Fix: The AI is given a strict rule: "If Object A is hotter than Object B in the blurry photo, it MUST be brighter in the clear photo." It doesn't care about the exact temperature number, but it strictly enforces the order. This prevents the AI from creating weird, glowing artifacts that don't make sense physically.

The Result

When they tested this new system against the best existing methods:

Sharper Edges: The outlines of cars and people were much clearer.
Better Heat: The hot spots stayed hot and didn't bleed into the cold areas.
Realism: The images looked like they were taken by a high-end camera, not a computer program guessing.

Summary

The paper says: "Stop training AI on fake, perfect data. Give it a real-world dataset of messy thermal photos, and teach it to fix them step-by-step using a system that respects both the shape of objects and the laws of heat."

It's like upgrading from a student who only memorized a dictionary to a master restorer who understands the history, the material, and the physics of the painting they are fixing.

1. Problem Statement

Infrared Image Super-Resolution (IISR) is critical for perception tasks like autonomous driving and surveillance under low-light or adverse conditions. However, existing methods face two fundamental challenges when applied to real-world scenarios:

Lack of Realistic Datasets: Current IISR methods are typically trained on synthetic datasets (e.g., downsampled fusion datasets) that fail to capture the complex, coupled degradations of real infrared imaging, such as spatially varying optical blur, sensor noise, and thermal drift.
Inadequate Degradation Modeling:
- Diffusion models often rely on fixed degradation priors and stochastic sampling, which struggle with the spatially heterogeneous blur and noise specific to infrared sensors.
- Visual Autoregressive (VAR) models are primarily designed for visible light. They often fail to account for the weak correlation between thermal intensity and structural edges in infrared images, leading to boundary distortion and thermal drift (where heat sources do not align with object contours).

2. Methodology: Real-IISR Framework

The authors propose Real-IISR, a unified autoregressive framework designed to reconstruct fine-grained thermal structures and clear backgrounds scale-by-scale. The framework consists of three core components:

A. Thermal-Structural Guidance (TSG) Module

Problem Addressed: The inherent mismatch between thermal radiation (heat sources) and structural edges (object boundaries). For example, a car engine is hot but its thermal radiation may not perfectly match the car's contour.
Mechanism: The module constructs two auxiliary representations from the low-resolution input: a Heat Map (semantic heat-source info) and an Edge Map (geometric boundaries).
Implementation: Using pre-trained encoders (DINOv3), these maps are fused via an adaptive attention gate. This fused guidance is injected into the autoregressive backbone via cross-attention, ensuring the model aligns thermal distributions with spatial boundaries to prevent structural distortion.

B. Condition-Adaptive Codebook (CAC)

Problem Addressed: Real-world infrared images suffer from non-uniform degradations (defocus, motion blur, noise) that induce quantization bias in standard Vector Quantization (VQ) models. Static codebooks cannot adapt to these varying conditions, leading to over-smoothed textures.
Mechanism: Instead of a static table lookup, CAC dynamically modulates discrete embeddings based on degradation-aware priors (thermal distributions and edge structures).
Implementation: It applies low-rank perturbations to the base code embeddings conditioned on the input state. This allows the same discrete index to decode into different embedding vectors depending on the specific degradation, enhancing texture realism and robustness.

C. Thermal Order Consistency Loss ( $L_{TOC}$ )

Problem Addressed: Standard pixel-wise losses (like MSE) fail to restore physically accurate thermal distributions because real-world degradations cause spatial shifts and local temperature compression.
Mechanism: Enforces a monotonic relationship between temperature and pixel intensity. It ensures that if a region is hotter in the High-Resolution (HR) ground truth, it must be brighter in the Super-Resolution (SR) output, regardless of absolute pixel values.
Implementation: A patch-wise loss function that penalizes inverted thermal ordering between adjacent patches. This maintains physical consistency even when there is slight spatial misalignment between LR and HR pairs.

3. Key Contributions

A. FLIR-IISR Dataset

The authors constructed the first large-scale, real-world paired dataset for IISR:

Scale: 1,457 paired LR–HR images.
Acquisition: Captured using a FLIR T1050sc camera (1024×768 resolution) across 6 cities, 3 seasons, and 12 scene categories.
Degradations: Real-world degradations generated via automated focus variation (defocus) and object motion (motion blur), covering both optical and sensor-level imperfections.
Format: Stored in lossless BMP to preserve radiometric fidelity.

B. Unified Autoregressive Framework

Real-IISR is the first framework to integrate thermal priors directly into an autoregressive generation process, specifically addressing the unique physics of infrared imaging (thermal-structural misalignment and radiometric drift).

C. Comprehensive Benchmarking

The paper provides extensive evaluations on both the new FLIR-IISR dataset and the existing M3FD dataset, establishing a new standard for real-world IISR research.

4. Experimental Results

Quantitative Performance

No-Reference Metrics: Real-IISR achieved state-of-the-art (SOTA) performance in MUSIQ (perceptual quality) and MANIQA on both FLIR-IISR and M3FD datasets. For example, on FLIR-IISR Set5, it scored 59.90 (MUSIQ), significantly outperforming the previous best (VARSR at 52.76).
Reference-Based Metrics: It achieved the highest PSNR (28.51 dB) and SSIM (0.8278) on FLIR-IISR, while maintaining competitive LPIPS (0.1615), indicating a superior balance between structural fidelity and perceptual quality.

Qualitative Performance

Visual Analysis: Compared to diffusion-based and other autoregressive methods, Real-IISR produces sharper edges and more faithful heat distributions.
Artifact Reduction: It effectively mitigates "thermal peak drift" (where hot spots shift away from objects) and boundary distortion, which are common in competing methods.
Efficiency: Despite having a large parameter count (1144.6M), Real-IISR achieves 2.45 FPS on an A800 GPU, outperforming diffusion-based methods in inference speed due to its deterministic token-level prediction.

Ablation Studies

Removing TSG led to blurred edges and misaligned thermal boundaries.
Removing CAC resulted in unstable textures and inconsistent heat distributions.
Removing $L_{TOC}$ caused thermal peak drift and local temperature compression.
Replacing the VAR backbone with a diffusion-based architecture resulted in significant performance drops, confirming that autoregressive generation is better suited for the discrete, structured nature of infrared imaging.

5. Significance and Impact

Bridging the Gap: This work addresses the critical gap between synthetic training data and real-world infrared applications, providing a dataset and method that generalize to actual deployment scenarios.
Physical Consistency: By enforcing thermal order consistency, the method ensures that reconstructed images are not just visually sharp but also radiometrically reliable, which is crucial for thermal monitoring and safety-critical systems.
Foundation for Future Work: The release of the FLIR-IISR dataset and the Real-IISR code provides a unified foundation for future research in real-world infrared restoration, autonomous driving, and thermal surveillance.

In summary, the paper presents a holistic solution to real-world IISR by combining a high-fidelity dataset with a novel autoregressive architecture that explicitly models the unique physical constraints of infrared imaging.