NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures

Imagine you are a detective trying to figure out what changed in a city over the last year. You have two photos: one taken last January and one taken this January. Your job is to point out exactly where new buildings went up, where trees were cut down, or where a flood happened.

This is the job of Remote Sensing Change Detection. But it's tricky. The photos might be slightly crooked (like taking a picture from a slightly different angle), the lighting might be different (sunny vs. cloudy), or the seasons might have changed the color of the grass. These "fake changes" can confuse even the smartest computer programs.

For a while, the hottest new technology for this job was called Mamba (a type of State Space Model). Think of Mamba like a very efficient, single-file line of people passing a note down a long hallway. It's fast and great for reading long stories, but because it reads things one by one in a line, it sometimes struggles to understand the shape of things in a 2D photo, like the exact outline of a building.

The authors of this paper, NeXt2Former-CD, decided to try a different approach. They asked: "What if we don't use the new 'line' technology, but instead use the best old-school tools, just upgraded with the latest superpowers?"

Here is how their new system works, explained with simple analogies:

1. The Super-Smart Eyes (The Backbone)

Instead of teaching the computer to learn from scratch, they gave it DINOv3 glasses.

The Analogy: Imagine hiring a detective who has already memorized every single building, tree, and road in the world from a massive library of photos. They don't need to be taught what a "roof" looks like; they already know it instantly.
The Tech: They used a pre-trained model called ConvNeXt (a modern version of a classic photo-recognizer) that was trained on a massive dataset. This gives the system a huge head start.

2. The "Wiggle-Room" Comparison (Deformable Attention)

This is the secret sauce. When comparing the two photos, the computer needs to match a house in Photo A to the same house in Photo B. But what if the photos are slightly shifted?

The Analogy: Imagine trying to match two jigsaw puzzles that are slightly misaligned. A rigid computer might say, "These pieces don't match!" because they are off by a millimeter.
The Solution: The authors used Deformable Attention. Think of this as giving the computer "elastic fingers." If the computer sees a roof in the first photo, its "fingers" can stretch and wiggle slightly to grab the matching roof in the second photo, even if it's a tiny bit off-center. This handles the "crooked photos" problem perfectly.

3. The Master Editor (Mask2Former Decoder)

Once the computer finds the differences, it needs to draw a clean map of exactly where the changes are.

The Analogy: Imagine the computer has a rough sketch of the changes. The Mask2Former decoder is like a professional editor with a fine-tipped pen. It looks at the rough sketch and traces the edges perfectly, ensuring the new building looks like a building and not a jagged, messy blob. It also ignores the "noise" (like shadows or seasonal color changes) so it only highlights the real changes.

The Results: Why It Matters

The authors tested their "Detective with Elastic Fingers" against the current champions (the Mamba models) on three major datasets (like a final exam).

Accuracy: Their system won. It found more changes and made fewer mistakes. It was better at drawing clean lines around buildings and ignoring fake changes caused by seasons.
Speed: You might think, "If it's so smart and uses elastic fingers, it must be slow, right?" Surprisingly, no. Even though the system is more complex, it runs just as fast as the Mamba models on modern graphics cards. It's like having a Ferrari that gets better gas mileage than a motorcycle.

The Big Takeaway

For a while, everyone thought the only way forward was to use these new "State Space" (Mamba) models. This paper says: "Wait a minute! If we combine the best pre-trained eyes, flexible matching, and a sharp editor, we can actually do better than the new trend, without sacrificing speed."

It's a reminder that sometimes, the best innovation isn't inventing a completely new engine, but rather tuning the existing one to perfection.

1. Problem Statement

Remote Sensing Change Detection (CD) aims to identify semantic changes between bi-temporal images (pre- and post-event). Despite recent advances, the field faces several critical challenges:

Pseudo-changes: Distinguishing true semantic changes from artifacts caused by illumination variations, seasonal effects, noise, and imperfect image co-registration.
Spatial Misalignment: Even with orthorectification, bi-temporal pairs often exhibit small residual spatial offsets and object displacements.
Architectural Trade-offs:
- CNNs: Strong local inductive biases but limited receptive fields for global context.
- Transformers: Excellent global modeling but computationally expensive (quadratic complexity) on high-resolution imagery.
- State Space Models (SSMs/Mamba): Recently popular for their linear scaling and efficiency in long-context modeling. However, they require serializing 2D features into 1D scan orders, which can disrupt spatial locality and boundary alignment depending on the traversal strategy.

The authors argue that while SSMs are efficient, modern convolutional and attention-based architectures, when combined with strong pre-training, may offer a superior alternative that retains strong 2D inductive biases without sacrificing accuracy.

2. Methodology: NeXt2Former-CD

The proposed framework is an end-to-end Siamese network designed to handle temporal reasoning while tolerating spatial shifts. It consists of three main components:

A. Siamese DINOv3 Backbone (Encoder)

Architecture: Utilizes a ConvNeXt-Large encoder initialized with weights from DINOv3 (pre-trained on the massive LVD-1689M web dataset).
Mechanism: Two parallel branches share weights to process the pre-change ( $I_1$ ) and post-change ( $I_2$ ) images.
Output: Extracts multi-scale feature maps at strides of 4, 8, 16, and 32 pixels. The DINOv3 initialization provides robust, transferable semantic representations.

B. Spatiotemporal Feature Interaction

To address the challenges of noise and misalignment, the authors introduce a two-stage interaction module at each scale:

Feature Rectify Module (FRM): Inspired by Sigma, this module computes channel and spatial weights based on the concatenation of features from both time steps. It "rectifies" features to highlight regions of interest and suppress pseudo-changes (e.g., seasonal noise).
Feature Fusion Module (FFM): Instead of standard cross-attention, the authors employ Deformable Attention.
- Rationale: Deformable attention allows adaptive sampling around spatial locations, making it highly effective at handling the residual spatial offsets and boundary misalignments common in bi-temporal remote sensing pairs.
- Output: Produces a fused feature map $Z_i$ for each scale.

C. Mask2Former Decoder

Architecture: Adapts the Mask2Former decoder, which uses a pixel decoder for high-resolution embeddings and a transformer decoder with learnable query embeddings.
Query-to-Pixel Aggregation: The model predicts a set of queries (class logits and soft masks). These are aggregated into a dense pixel-wise change map using a log-sum-exp operation.
Hybrid Loss Function: To improve optimization stability and ensure complete pixel coverage, the training objective combines:
1. Set Loss ( $L_{set}$ ): Standard Hungarian matching loss from Mask2Former (classification + mask losses).
2. Pixel-wise Loss ( $L_{pixel}$ ): A dense weighted cross-entropy loss applied to the aggregated logits against the ground truth.
- Weighting: $\lambda_{set} = 0.1$ , $\lambda_{pixel} = 10$ .

3. Key Contributions

Architectural Re-evaluation: The paper challenges the recent trend of exclusively adopting SSMs (Mamba) for CD, demonstrating that optimized 2D convolutional and attention-based architectures can outperform them.
Deformable Attention for Temporal Fusion: The integration of deformable attention in the fusion stage specifically addresses the problem of residual co-registration noise and object displacement, a weakness in standard attention and rigid SSM scanning strategies.
Foundation Model Integration: Successfully leverages DINOv3-pretrained ConvNeXt for remote sensing CD, showing that large-scale self-supervised pre-training significantly boosts performance.
Hybrid Supervision: The combination of set-based (query) and dense pixel-wise losses provides a robust training signal that improves convergence and boundary precision.

4. Experimental Results

The method was evaluated on three standard benchmarks: LEVIR-CD, WHU-CD, and CDD.

Performance: NeXt2Former-CD achieved State-of-the-Art (SOTA) results across all datasets, outperforming recent Mamba-based baselines (including M-CD, ChangeMamba, and CDMamba).
- Example (LEVIR-CD): Achieved an F1 score of 0.955 and IoU of 0.914, surpassing the previous best M-CD (F1: 0.954, IoU: 0.911).
- Generalization: Showed consistent improvements in Overall Accuracy (OA) as well.
Efficiency: Despite having a significantly larger parameter count (392M vs. 69.8M for M-CD) and higher GFLOPs, the inference latency on an RTX 5090 GPU was comparable (36.79ms vs. 33.84ms). This is attributed to the high parallelism of convolutional and attention layers on modern GPUs.
Qualitative Analysis:
- Boundary Precision: The model produces smoother, more accurate boundaries for large structures compared to the jagged edges of M-CD.
- Noise Suppression: Effectively reduces false positives in unchanged areas caused by seasonal variations.
- Completeness: Detects changed objects more completely in complex scenes where baselines miss significant portions.
Ablation Studies:
- Replacing Deformable Attention with standard Cross-Attention resulted in lower F1/IoU, confirming its necessity for handling spatial shifts.
- The Hybrid Loss configuration yielded better metrics than using only Cross-Entropy or only Set Loss.

5. Significance

The paper provides compelling evidence that State Space Models are not the only path forward for efficient remote sensing change detection. By combining:

Strong Pre-training (DINOv3),
Robust 2D Inductive Biases (ConvNeXt),
Geometric Flexibility (Deformable Attention), and
Advanced Decoding (Mask2Former),

The authors demonstrate that modern vision architectures can achieve superior accuracy and competitive inference speeds. This work motivates a broader re-examination of architecture choices for high-resolution remote sensing, suggesting that well-optimized 2D designs remain highly competitive against SSM-centric approaches.