NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

NeighborMAE is a self-supervised learning framework that enhances Earth Observation image representation by leveraging the spatial dependencies between neighboring images through joint reconstruction and a dynamic heuristic strategy for mask ratios and loss weighting, resulting in superior performance across various downstream tasks compared to existing baselines.

Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you are trying to learn how to recognize different types of trees in a forest.

The Old Way (Traditional AI):
Most current AI models learn by looking at one single photo of a tree at a time. They are given a picture, and the computer is told to "guess" what the missing parts of the tree look like based on the parts it can see. It's like trying to solve a jigsaw puzzle while wearing blinders, only looking at one tiny corner of the puzzle at a time. While this works okay, it misses the bigger picture. It doesn't know that the tree in the photo is part of a larger forest, or that the ground next to it is usually covered in moss.

The Problem:
The Earth is continuous. A satellite doesn't just take one photo; it takes thousands of overlapping photos as it flies over. These neighboring photos are like pieces of a giant, seamless mosaic. But existing AI models treat each photo as an isolated island, ignoring the fact that the landscape flows from one image to the next.

The New Solution: NeighborMAE
The authors of this paper, "NeighborMAE," came up with a clever idea: Why not teach the AI to look at two neighboring photos at the same time?

Think of it like this:

  • The Old Way: You are shown a photo of a house, and you have to guess what the roof looks like because part of it is covered by a cloud. You have to guess based only on the walls you can see.
  • NeighborMAE: You are shown the house and the photo of the house next door (or the same house taken a few seconds later). Even if the roof is hidden in the first photo, you might see the roof clearly in the second photo, or you might see the garden next door which gives you a clue about the style of the house.

How It Works (The Magic Tricks):

  1. The "Neighboring" Pair: The system grabs two satellite images that overlap slightly. It stitches them together in the computer's mind so the AI understands they are neighbors.
  2. The "Masking" Game: To make the AI learn hard, the system covers up (masks) random parts of both images with digital "clouds."
  3. The Challenge: The AI has to fill in the missing clouds. But here's the twist: If a part of the house is hidden in Image A, but visible in Image B, the AI has to use that information to guess what's missing in Image A. It's like solving a puzzle where you have two different views of the same scene to help you.
  4. Smart Difficulty: The system is smart about how hard it makes the game. If the two images overlap a lot (showing almost the same thing), it covers up more of the image to make it harder. If they are very different, it covers up less. This keeps the AI on its toes.
  5. No Cheating: Sometimes, if the two images are almost identical, the AI might try to "cheat" by just copying the visible part from Image B to fill the hole in Image A. The authors added a special rule (a "loss weight") to punish the AI if it just copies and pastes without actually understanding the scene. It forces the AI to learn the structure of the world, not just memorize patterns.

Why Does This Matter?
By learning from neighbors, the AI builds a much stronger mental map of the world. It understands that a road doesn't just stop at the edge of a photo; it continues. It understands that a forest has a texture that flows across boundaries.

The Results:
When they tested this new AI on real-world tasks—like counting buildings, detecting fires, or mapping forests—it performed significantly better than the old "single-photo" models. It was even competitive with massive, super-complex models that use many different types of sensors, but NeighborMAE did it using just standard RGB (color) photos.

In a Nutshell:
NeighborMAE is like teaching a student geography not by showing them isolated flashcards of cities, but by showing them a map where they can see how one city connects to the next. It turns the AI from a myopic observer into a global thinker, using the natural continuity of the Earth to learn faster and smarter.