NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

Imagine you are trying to learn how to recognize different types of trees in a forest.

The Old Way (Traditional AI):
Most current AI models learn by looking at one single photo of a tree at a time. They are given a picture, and the computer is told to "guess" what the missing parts of the tree look like based on the parts it can see. It's like trying to solve a jigsaw puzzle while wearing blinders, only looking at one tiny corner of the puzzle at a time. While this works okay, it misses the bigger picture. It doesn't know that the tree in the photo is part of a larger forest, or that the ground next to it is usually covered in moss.

The Problem:
The Earth is continuous. A satellite doesn't just take one photo; it takes thousands of overlapping photos as it flies over. These neighboring photos are like pieces of a giant, seamless mosaic. But existing AI models treat each photo as an isolated island, ignoring the fact that the landscape flows from one image to the next.

The New Solution: NeighborMAE
The authors of this paper, "NeighborMAE," came up with a clever idea: Why not teach the AI to look at two neighboring photos at the same time?

Think of it like this:

The Old Way: You are shown a photo of a house, and you have to guess what the roof looks like because part of it is covered by a cloud. You have to guess based only on the walls you can see.
NeighborMAE: You are shown the house and the photo of the house next door (or the same house taken a few seconds later). Even if the roof is hidden in the first photo, you might see the roof clearly in the second photo, or you might see the garden next door which gives you a clue about the style of the house.

How It Works (The Magic Tricks):

The "Neighboring" Pair: The system grabs two satellite images that overlap slightly. It stitches them together in the computer's mind so the AI understands they are neighbors.
The "Masking" Game: To make the AI learn hard, the system covers up (masks) random parts of both images with digital "clouds."
The Challenge: The AI has to fill in the missing clouds. But here's the twist: If a part of the house is hidden in Image A, but visible in Image B, the AI has to use that information to guess what's missing in Image A. It's like solving a puzzle where you have two different views of the same scene to help you.
Smart Difficulty: The system is smart about how hard it makes the game. If the two images overlap a lot (showing almost the same thing), it covers up more of the image to make it harder. If they are very different, it covers up less. This keeps the AI on its toes.
No Cheating: Sometimes, if the two images are almost identical, the AI might try to "cheat" by just copying the visible part from Image B to fill the hole in Image A. The authors added a special rule (a "loss weight") to punish the AI if it just copies and pastes without actually understanding the scene. It forces the AI to learn the structure of the world, not just memorize patterns.

Why Does This Matter?
By learning from neighbors, the AI builds a much stronger mental map of the world. It understands that a road doesn't just stop at the edge of a photo; it continues. It understands that a forest has a texture that flows across boundaries.

The Results:
When they tested this new AI on real-world tasks—like counting buildings, detecting fires, or mapping forests—it performed significantly better than the old "single-photo" models. It was even competitive with massive, super-complex models that use many different types of sensors, but NeighborMAE did it using just standard RGB (color) photos.

In a Nutshell:
NeighborMAE is like teaching a student geography not by showing them isolated flashcards of cities, but by showing them a map where they can see how one city connects to the next. It turns the AI from a myopic observer into a global thinker, using the natural continuity of the Earth to learn faster and smarter.

1. Problem Statement

Context: Masked Image Modeling (MIM) has become a dominant self-supervised learning (SSL) paradigm for Earth Observation (EO), successfully learning representations from unlabeled satellite imagery by reconstructing masked patches. Existing approaches have effectively leveraged multi-temporal and multi-modal data.
The Gap: Despite the continuous nature of the Earth's surface, current MIM frameworks typically treat EO images as isolated tiles. They overlook the rich spatial dependencies between neighboring images (e.g., overlapping acquisitions, satellite revisits, or adjacent tiles).
The Challenge: Simply increasing input image size or using synthetic augmentations does not capture the true spatial continuity and variability (in time, geometry, or sensor type) found in real neighboring EO pairs. Furthermore, learning from neighbors introduces a risk of "shortcut learning," where a model might trivially copy-paste visible pixels from a neighbor to reconstruct masked areas, rather than learning meaningful representations.

2. Methodology: NeighborMAE

The authors propose NeighborMAE, a novel MIM framework designed to jointly reconstruct masked regions across pairs of neighboring EO images.

Core Architecture

Built upon the Masked Autoencoder (MAE) backbone (Vision Transformer), NeighborMAE processes pairs of neighboring images $(I_i, I_j)$ :

Input Sampling: Pairs are sampled based on the Intersection-over-Union (IoU) of their geospatial footprints.
Joint Encoding: Visible patches from both images are concatenated and fed into a single encoder. The model learns joint representations via self-attention across all tokens from both images.
Joint Decoding: The decoder reconstructs the masked patches for both images simultaneously, leveraging information from the visible parts of the neighbor to infer missing parts of the target.

Key Technical Innovations

To ensure the task remains challenging and prevents trivial solutions, three specific mechanisms are introduced:

Relative Positional Embedding:
- To handle spatial relationships without relying on absolute geographic metadata (for downstream flexibility), the georeferenced bounding boxes of the pair are normalized to a shared $[0, 1]$ coordinate system.
- Standard sinusoidal positional encodings are applied to these normalized coordinates, allowing the model to understand the relative spatial layout of the two images.
Dynamic Mask Ratio:
- Neighboring images provide extra context, potentially making reconstruction too easy.
- The mask ratio is dynamically adjusted based on the IoU of the augmented image pair.
- Formula: $mask\_ratio = m_1 + IoU \times (m_2 - m_1)$ .
- Higher overlap (higher IoU) triggers a higher mask ratio to maintain task difficulty.
Heuristic Loss Weighting (Input Visibility):
- Pixels to be reconstructed are categorized into three types:
  1. Self-visible: Visible in the target image.
  2. Cross-visible: Masked in the target but visible in the neighbor.
  3. Not-visible: Masked in both.
- The Problem: For "cross-visible" pixels, the model could simply copy the pixel value from the neighbor (a shortcut).
- The Solution: The loss weight for cross-visible pixels is bounded by the error of a naive "copy-paste" prediction.
  - If the neighbor's pixel is a perfect match, the weight is low (discouraging the model from just copying).
  - If the neighbor's pixel differs (due to time, angle, or sensor), the weight is higher, forcing the model to learn the transformation or context.
- Formula: $weight = \min(\frac{MSE(neighbor, target)}{MSE(reconstruction, target)}, 1)$ .

3. Key Contributions

Identification of Neglected Dependencies: The paper argues that spatial dependencies between neighboring EO images are a critical, underutilized resource in SSL for Earth Observation.
NeighborMAE Framework: A novel MIM architecture that jointly reconstructs neighboring images, incorporating adaptive masking and visibility-based loss weighting to prevent shortcut learning.
Comprehensive Evaluation: Extensive pretraining on diverse datasets (fMoW-RGB and Satellogic) and evaluation across multiple downstream tasks (classification and segmentation) demonstrates consistent improvements over baselines.
Ablation Studies: Rigorous validation showing that modeling spatial dependencies improves representation quality, even when combined with temporal dependencies, and that the proposed loss weighting is crucial for datasets with low temporal variation.

4. Experimental Results

The authors evaluated NeighborMAE against strong baselines including vanilla MAE, SatMAE, ScaleMAE, and the state-of-the-art DOFA.

Datasets:
- fMoW-RGB: Object-centric, multi-temporal sequences.
- Satellogic: Large-area captures with sliding-window crops, fewer revisits, and limited context per image.
Downstream Tasks: Image classification (fMoW, UC Merced, RESISC-45, FireRisk, ForestNet) and Semantic Segmentation (Five-Billion-Pixels, PASTIS-HD).
Performance:
- vs. MAE: NeighborMAE consistently outperforms vanilla MAE. For example, on fMoW classification, it improved linear probing accuracy by +2.0% and fine-tuning by +1.1%.
- vs. Specialized EO Models: It outperforms other MIM-based EO models (SatMAE, ScaleMAE) across most benchmarks.
- vs. SOTA (DOFA): NeighborMAE (trained on RGB only) achieves competitive performance against DOFA (trained on large-scale multi-modal/multi-spectral data). In some tasks (e.g., FireRisk, ForestNet), it even slightly outperforms DOFA.
Ablation Insights:
- Input Type: Sampling real neighbors yields better results than simply enlarging the input image size or using synthetic crops.
- Dynamic Masking: The dynamic mask ratio (0.75–0.85 based on IoU) significantly outperforms static ratios.
- Loss Weighting: The visibility-based weighting is particularly effective on the Satellogic dataset (low temporal change), where it prevents the model from relying on trivial copy-paste strategies.

5. Significance and Future Work

Efficiency: NeighborMAE offers a better performance-to-computation ratio compared to multi-scale reconstruction methods (like SatMAE++), which are computationally expensive due to upsampling. While NeighborMAE has slightly higher memory usage than vanilla MAE due to processing two images, it avoids the heavy cost of multi-scale architectures.
Generalization: The results suggest that explicitly modeling spatial continuity leads to more robust and generalizable representations, crucial for real-world EO applications where data is spatially coherent.
Future Directions: The authors plan to extend NeighborMAE to multi-spectral and multi-modal data. They also aim to address the computational overhead of processing more than two neighbors by exploring token-reduction strategies or next-generation architectures to bypass the $O(n^2)$ complexity of self-attention.

Conclusion: NeighborMAE successfully bridges the gap between the continuous nature of Earth Observation data and current discrete MIM frameworks, proving that leveraging spatial neighbors is a powerful, previously overlooked dimension for self-supervised learning in remote sensing.

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

1. Problem Statement

2. Methodology: NeighborMAE

Core Architecture

Key Technical Innovations

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization