Deformation-Free Cross-Domain Image Registration via Position-Encoded Temporal Attention

Imagine you have two photos of the same city street, but they were taken under very different conditions. One is a crisp, high-definition photo taken at noon (the Fixed Image), and the other is a blurry, yellow-tinted photo taken at sunset (the Moving Image).

Your goal is to take the sunset photo and make it look exactly like the noon photo—matching the buildings' positions perfectly while also changing the colors to match the bright daylight.

Usually, computers try to do this by "warping" the image, stretching and squishing pixels like taffy to make them fit. This is slow, computationally heavy, and often fails if the lighting is too different.

This paper introduces a new, smarter way to do it called GPEReg-Net. Here is how it works, explained through simple analogies:

1. The "Recipe vs. The Ingredients" (Factorization)

The core idea is that every image is actually made of two separate things:

The Scene (The Blueprint): This is the structure of the world. It's the shape of the buildings, the trees, and the roads. It doesn't care if the photo is black and white, sepia, or neon green.
The Appearance (The Paint): This is the style. It's the lighting, the color temperature, and the contrast.

The Old Way: Trying to stretch the "Paint" to fit the "Blueprint" while simultaneously trying to fix the "Blueprint." It's a messy, tangled mess.

The New Way (GPEReg-Net): The computer acts like a master chef who separates the ingredients from the recipe.

It looks at the sunset photo and strips away all the "sunset paint," leaving only the raw Blueprint (the shape of the street).
It looks at the noon photo and extracts only the Paint (the bright daylight colors).
It takes the sunset Blueprint and instantly "pours" the noon Paint over it.

Because it doesn't have to stretch pixels around (no "deformation field"), it's much faster and cleaner. It's like swapping a car's body paint without ever taking the car apart.

2. The "Time-Traveling GPS" (Position-Encoded Attention)

The paper mentions "sequential acquisitions." Imagine you aren't just looking at one photo, but a video stream of frames.

The model has a special feature called Position-Encoded Temporal Attention. Think of this as a Time-Traveling GPS.

When the computer looks at the current frame, it doesn't just look at that single moment.
It checks its "memory" of the previous few frames (the neighbors).
It uses a "GPS" to know exactly where it is in time.
This helps the computer understand that a tree moving slightly in frame 5 is the same tree in frame 6, even if the lighting changes. It keeps the video smooth and consistent, like a director ensuring continuity between movie shots.

3. The Results: Faster and Sharper

The authors tested this on two very different challenges:

Retinal Eye Scans: Taking photos of the inside of an eye, where the tissue is semi-rigid and lighting varies wildly.
Synthetic Textures: Computer-generated patches of patterns with random rotations and shifts.

The Outcome:

Better Quality: Their method produced sharper, more accurate images than previous state-of-the-art methods (which tried to stretch pixels).
Much Faster: It runs 1.87 times faster than the previous best method (SAS-Net).
Real-Time: It's fast enough to be used in live medical procedures or real-time video processing.

Summary

In short, instead of trying to force two mismatched images to fit together by stretching them like rubber bands, this new method says: "Let's just take the shape from one and the color from the other, and snap them together."

By separating the structure from the style and using a time-aware GPS to keep things consistent, they created a system that is faster, more accurate, and works across different types of cameras and lighting conditions.

1. Problem Definition

The paper addresses cross-domain image registration, a challenging task where a moving image ( $I_m$ ) and a fixed image ( $I_f$ ) suffer from two simultaneous issues:

Geometric Misalignment: Spatial displacement between the images.
Domain-Specific Appearance Shift: Significant differences in intensity distributions caused by varying acquisition conditions (e.g., different lighting, modalities, or viewpoints).

The Core Challenge: Conventional registration methods rely on the brightness constancy assumption ( $I_m(x) \approx I_f(x+u)$ ), which fails in cross-domain settings. Existing deep learning methods often attempt to estimate a complex deformation field ( $u$ ) while struggling to disentangle geometric structure from appearance shifts, leading to suboptimal performance or high computational costs.

2. Methodology: GPEReg-Net

The authors propose GPEReg-Net, a novel framework that reframes registration as a factorization problem rather than a deformation estimation problem. The core insight is that an image can be decomposed into a domain-invariant scene representation ( $s$ ) and a domain-specific appearance statistic ( $a$ ).

A. Core Architecture

The network consists of four learned modules:

Scene Encoder ( $S$ ):
- Input: Moving image ( $I_m$ ).
- Mechanism: Uses a U-Net backbone with Instance Normalization (IN). IN strips per-instance intensity statistics (mean and variance) while preserving spatial structure.
- Output: A 64-channel domain-invariant scene feature map ( $s \in \mathbb{R}^{64 \times H \times W}$ ).
Appearance Encoder ( $A$ ):
- Input: Fixed image ( $I_f$ ).
- Mechanism: Uses CNNs with Global Average Pooling (GAP) and fully connected layers.
- Output: A 32-dimensional global appearance code ( $a \in \mathbb{R}^{32}$ ) capturing the target domain's intensity profile.
Global Position Encoding (GPE) Module:
- Purpose: Exploits temporal coherence in sequential acquisitions.
- Mechanism: Fuses three components for frame index $t$ $t$ :
  - Learnable position embeddings.
  - Sinusoidal position encodings.
  - Cross-Frame Attention: A multi-head attention mechanism (4 heads) that queries the current frame against a sliding window of $k=2$ neighboring frames.
- Output: An enhanced scene representation ( $\tilde{s}$ ) enriched with inter-frame context.
Image Decoder ( $D$ ):
- Mechanism: Reconstructs the registered image using Adaptive Instance Normalization (AdaIN).
- Process: It injects the target appearance code ( $a$ ) into the enhanced scene features ( $\tilde{s}$ ) via AdaIN:
  $\hat{I}_r = D(\text{AdaIN}(\tilde{s}, a))$
- Key Feature: This step eliminates the need for estimating an explicit deformation field. The geometric alignment is implicitly handled by the shared scene representation, while the appearance is transferred via AdaIN.

B. Training Objective

The model is optimized using a bi-objective loss function:
$\mathcal{L} = \mathcal{L}_{recon} + \lambda \cdot \mathcal{L}_{scene}$

$\mathcal{L}_{recon}$ : Pixel-wise L1 loss ( $\|\hat{I}_r - I_f\|_1$ ) ensuring the output matches the fixed image's intensity and structure.
$\mathcal{L}_{scene}$ : A factorization consistency regularizer ( $\|S(I_m) - S(I_f)\|_2^2$ ) that forces the scene encoder to produce identical representations for the same scene regardless of domain, enforcing the disentanglement of structure and appearance.

3. Key Contributions

Deformation-Free Factorization: The paper formalizes cross-domain registration as a latent factorization problem. By separating scene structure from appearance, it avoids the complexity and failure modes of explicit deformation field estimation.
Position-Encoded Temporal Attention: Introduction of a GPE module that combines learnable embeddings, sinusoidal encoding, and cross-frame attention. This allows the model to leverage temporal context in sequential data, improving consistency without architectural changes.
State-of-the-Art Performance: The framework achieves superior results on both medical (retinal) and synthetic benchmarks, outperforming traditional methods (SIFT, Demons) and advanced deep learning baselines (VoxelMorph, TransMorph, SAS-Net).
Efficiency: Despite having a similar parameter count to SAS-Net, GPEReg-Net runs 1.87× faster due to its simpler decoding architecture.

4. Experimental Results

The method was evaluated on two diverse benchmarks:

A. FIRE-Reg-256 (Retinal Fundus, Semi-Rigid)

Metrics: NCC, SSIM, PSNR.
Results:
- SSIM: 0.928 (vs. 0.916 for VoxelMorph, 0.855 for SAS-Net).
- PSNR: 33.47 dB (vs. 32.21 dB for SAS-Net).
- NCC: 0.851.
Significance: Demonstrates that the factorization approach captures structural alignment as effectively as deformation-based methods while handling domain shifts better.

B. HPatches-Reg-256 (Synthetic Textured Patches, Affine)

Context: Tests generalization to a fundamentally different domain with large affine transformations (rotation $\pm15^\circ$ , shear, scaling).
Results:
- SSIM: 0.450 (vs. 0.421 for SAS-Net).
- PSNR: 21.01 dB (vs. 20.15 dB for SAS-Net).
- NCC: 0.536.
Significance: Confirms the model's ability to transfer across domains without re-tuning deformation parameters, a common failure point for deformation-based networks.

C. Computational Efficiency

Throughput: 69 FPS on an NVIDIA RTX 5090.
Latency: 14.52 ms.
Comparison: Significantly faster than SAS-Net (37 FPS) while maintaining higher accuracy. Deformation-based methods (VoxelMorph) are faster (327 FPS) but yield much lower registration quality.

5. Significance and Conclusion

This paper presents a paradigm shift in image registration by moving away from explicit deformation field estimation toward scene-appearance factorization.

Theoretical Impact: It validates the information-theoretic argument that separating "what is where" (scene) from "how it looks" (appearance) is a robust solution for cross-domain alignment.
Practical Impact: The method enables real-time processing (69 FPS) suitable for clinical and research applications involving sequential imaging (e.g., retinal imaging).
Generalization: The framework proves highly effective across different imaging modalities and deformation types (semi-rigid to affine), suggesting a versatile solution for future cross-domain computer vision tasks.

Limitations: The current appearance model captures only global statistics; it may struggle with spatially varying domain shifts (e.g., local illumination gradients). Future work suggests spatially-conditioned appearance maps and adaptive position encodings.