RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

The Big Problem: The "Amnesiac" AI

Imagine you are a detective trying to solve a crime in a city you've never visited. You have a blurry, low-resolution photo of the scene. To make it clear, you need to guess what the missing details look like.

Most current AI super-resolution models are like amnesiac detectives. Every time they look at a new photo, they treat it as if it's the first time they've ever seen that street. They have to relearn everything from scratch: "Oh, the sky is usually at the top, the road is at the bottom, and buildings are in the middle." They spend a huge amount of mental energy re-discovering these basic facts for every single image, which is inefficient and sometimes leads to mistakes.

This is especially a problem for Infrared Cameras (used in self-driving cars and night surveillance). These cameras often face the same view every day (like a traffic camera on a highway). The layout never changes, but the current AI models don't "remember" this. They are statistically naive, wasting their brainpower on things they should already know.

The Solution: The "Local Guide" and the "Memory Book"

The authors of this paper, RPT-SR, decided to fix this by giving the AI a "cheat sheet" and a "local guide." They created a new system called Regional Prior Attention.

Think of the system as having two distinct types of workers (tokens) working together:

The Memory Book (Regional Prior Token):
Imagine a permanent, learnable notebook that sits on the desk. This notebook doesn't care about today's specific traffic or weather. Instead, it learns the permanent layout of the scene over time.
- Analogy: It's like a map of a city that knows, "The highway is always at the bottom of the frame, and the sky is always at the top." It remembers the "skeleton" of the scene.
The Local Guide (Local Token):
This is a worker who looks at the current blurry photo. They see the specific details: "Today, there is a red truck here, and a pedestrian there." They capture the unique, changing content of the moment.

How They Work Together (The Magic Trick)

In the old models, the AI tried to guess the details using only the blurry photo (the Local Guide). In the new RPT-SR model, the Local Guide and the Memory Book hold hands and talk to each other.

The Process: The AI takes the "Local Guide's" observations of the current image and mixes them with the "Memory Book's" knowledge of the scene's layout.
The Result: The AI doesn't have to guess where the road is; the Memory Book tells it, "The road is here." This frees up the AI's brain to focus entirely on making the texture of the road and the details of the truck look sharp and realistic.

It's like hiring a local tour guide (Local Token) who knows the current traffic, but giving them a GPS (Regional Prior) that already knows the map. The guide doesn't waste time asking, "Which way is North?" because the GPS already told them. They can just focus on driving smoothly.

Why This Matters for Infrared

Infrared cameras (which see heat or light through fog) are often low-resolution because high-resolution sensors are incredibly expensive. Super-resolution is the software trick to make cheap sensors look like expensive ones.

The researchers tested this on two very different types of infrared light:

LWIR (Long-Wave): Sees heat (like a thermal camera).
SWIR (Short-Wave): Sees reflected light (like a camera that can see through smoke).

Even though these two types of cameras "see" the world in completely different ways, the RPT-SR model worked perfectly on both. This proves that the model isn't just memorizing heat patterns; it's actually learning the structural rules of the scene (where things usually sit), which applies to almost any fixed-view camera.

The Results

When they tested this new AI against the best existing models:

It looked better: The images were sharper, with fewer weird artifacts (like blurry ghosts or ringing edges).
It was smarter: It didn't waste energy re-learning the layout of the road or the sky.
It was versatile: It worked on both heat-sensing cameras and smoke-penetrating cameras.

In a Nutshell

RPT-SR is a new type of AI that stops trying to re-invent the wheel for every image. Instead, it remembers the permanent layout of the scene (like a map) and combines that memory with the current details. This allows it to turn blurry, low-quality infrared images into crystal-clear, high-definition pictures much faster and more accurately than before. It's the difference between a detective who forgets the map every day and one who has a perfect map in their pocket.

1. Problem Statement

The Challenge of Fixed-Viewpoint Infrared Imaging:
While general-purpose Super-Resolution (SR) models, particularly Vision Transformers (ViTs), have achieved state-of-the-art results in natural image restoration, they exhibit fundamental inefficiencies in specific infrared (IR) scenarios. These scenarios include traffic surveillance, autonomous driving, and roadside monitoring, which operate from fixed or nearly static viewpoints.

Key Issues Identified:

Structural Amnesia: Existing models treat every frame as an independent input, failing to exploit the strong, persistent spatial priors inherent in fixed-view scenes (e.g., the road is always at the bottom, the sky at the top).
Redundant Learning: Because these models do not explicitly encode scene layout, they must implicitly relearn the same spatial regularities for every frame. This wastes the model's "attention budget" on low-information regions and slows convergence.
Inefficiency: The dynamic global context modeling capability of modern Transformers becomes a liability in static environments, as the model expends significant capacity rediscovering redundant information rather than focusing on frame-specific details.

2. Methodology: RPT-SR

The authors propose RPT-SR (Regional Prior attention Transformer), a novel architecture designed to explicitly encode scene layout information into the attention mechanism.

Core Concept: Dual-Token Framework

The architecture introduces a Regional Prior Attention (RPA) mechanism that fuses two distinct types of information carriers:

Learnable Regional Prior (R.P.) Tokens (Static):
- These act as a persistent memory for the scene's global structure.
- They are learnable parameters indexed by macro-window locations.
- They are shared across all images in the dataset and optimized end-to-end to capture the invariant spatial layout (e.g., statistical distribution of textures and objects at specific pixel locations) over training epochs.
Local Tokens (Dynamic):
- Generated from the current input image to capture frame-specific content and unique details.
- These are distilled from the feature map of the specific input.

Architecture Flow

Shallow Feature Stem: Converts the Low-Resolution (LR) input to a feature map without absolute positional encoding to maintain size agnosticism.
Deep RPA Body: The core consists of cascaded RPA Blocks.
- Token Fusion: At each macro-window location, the Local Token and the Regional Prior Token are concatenated to form a Dynamic Token.
- Hierarchical Attention: The model employs a hierarchical windowing strategy (similar to SwinIR but enhanced).
- Attention Mechanism:
  - Stage 1: Dynamic tokens undergo self-attention to exchange global information.
  - Stage 2: Refined dynamic tokens are prepended to the window tokens. The attention mechanism then processes the concatenated sequence, allowing the static prior to dynamically modulate the reconstruction of local details.
Reconstruction Head: Aggregates features, upsamples via pixel-shuffle, and outputs the High-Resolution (HR) image.

3. Key Contributions

Regional Prior Attention (RPA): A novel attention mechanism that implements a dual-token architecture. It explicitly fuses persistent, static priors with dynamic, frame-specific tokens to encode spatial priors of fixed-viewpoint scenes, effectively solving the "structural amnesia" problem.
Broad Applicability & Versatility: Unlike most prior works focusing on a single band, the authors validate RPT-SR across two physically distinct infrared spectra:
- Long-Wave Infrared (LWIR): Captures thermal radiation (emitted light).
- Short-Wave Infrared (SWIR): Captures reflected light (similar to visible spectrum but penetrates fog/smoke).
- The model achieves state-of-the-art (SOTA) performance on both, proving the regional prior mechanism learns underlying structural regularities regardless of imaging physics.
Efficiency: The approach achieves significant performance gains with only a marginal increase in computational cost (FLOPs) and parameters compared to baseline models.

4. Experimental Results

The model was evaluated on multiple datasets: M3FD and TNO (LWIR), and RASMD (SWIR), covering both $\times2$ and $\times4$ upscaling tasks.

Quantitative Performance:

Metrics: Evaluated using LPIPS (perceptual similarity), MUSIQ, and MANIQA (no-reference perceptual quality), in addition to traditional PSNR/SSIM.
LWIR (M3FD $\times4$ ): RPT-SR achieved the best scores in LPIPS (0.1038) and MANIQA (0.2621), outperforming strong baselines like SwinIR, HAT, and DAT.
SWIR (RASMD $\times4$ ): Set a new SOTA with an LPIPS of 0.1535.
Generalization: The model maintained competitive performance on the TNO dataset (cross-dataset generalization) and across different scaling factors ( $\times2$ and $\times4$ ).

Qualitative Performance:

Visual comparisons show RPT-SR reconstructs sharper details and more plausible textures than competitors.
It excels at preserving structural integrity in human silhouettes and building facades without introducing ringing artifacts or over-sharpening common in other methods.
It effectively mitigates noise amplification in low-contrast scenes.

Ablation Study:

Baseline (Local only): Performs well but lacks structural guidance.
Static Only (Prior only): Improves some metrics but fails to reconstruct fine, frame-specific details.
Full RPT (Fusion): Combining both tokens yields the best results, confirming that frame-specific local information must be modulated by persistent regional priors for optimal reconstruction.

5. Significance

Paradigm Shift for Fixed-View SR: The paper challenges the assumption that general-purpose SR models are optimal for all scenarios. It demonstrates that for fixed-view applications (surveillance, autonomous driving), explicitly encoding scene layout is more efficient than relying solely on dynamic global context.
Cost-Effective High-Res IR: By enabling high-quality super-resolution on low-cost, low-resolution IR sensors, RPT-SR offers a practical alternative to expensive high-resolution hardware, which is crucial for deploying robust all-weather perception systems.
Cross-Modal Robustness: The success across both LWIR and SWIR spectra suggests the method captures fundamental geometric and structural regularities of scenes, making it a versatile tool for diverse infrared imaging applications.