NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation

Imagine you have a massive library of satellite photos of the Earth. These photos are huge, detailed, and come in different "seasons" and "spectrums" (like infrared or radar). Storing and sending all this data is like trying to mail a library of encyclopedias to a friend; it's slow, expensive, and clogs up the mail system.

For a long time, scientists tried to compress these photos by making them look smaller but still recognizable to the human eye (like JPEGs). But computers don't care if a photo looks pretty; they care if the photo contains the right information to solve a problem.

This paper introduces NeuCo-Bench, a new "test drive" for a smarter way to shrink this data.

The Core Idea: The "Summary Note" vs. The "Photo Album"

Think of the satellite data as a 1,000-page photo album of a forest.

Old Way (JPEG): You shrink the photo album so it fits in a backpack, but you still keep every single photo, just with lower quality.
New Way (NeuCo-Bench): Instead of sending the whole album, you hire a super-smart AI to read the whole album and write a one-page summary note. This note doesn't look like a photo; it's just a list of numbers (an "embedding").

The goal of NeuCo-Bench is to answer: "Can this one-page summary note tell us everything we need to know about the forest?"

How the Test Works (The "Blind Taste Test")

The authors created a framework to test these summary notes. Here is the analogy:

The Contestants (The Compressors): Different AI models try to turn the massive satellite photos into these tiny summary notes (embeddings).
The Judges (The Tasks): The authors have a list of questions they want to answer about the forest, such as:
- "How much wood is in this forest?" (Biomass)
- "Is this a cornfield or a soybean field?" (Crops)
- "Is it cloudy?" (Clouds)
- "Is this city getting hotter?" (Heat Islands)
The Secret Sauce (Hidden Tasks): In their big competition (the "CVPR EarthVision Challenge"), the contestants didn't know which questions they would be asked. They just had to make the best possible summary note. This prevented them from "cramming" for a specific test.
The Grading (Linear Probing): To see if the summary note is good, the judges try to answer the questions using only that note. They use a very simple, fast calculator (a "linear probe") to see if the numbers in the note correlate with the answer.
- Analogy: If the summary note says "High Green, Low Cloud," and the question is "Is it cloudy?", a good note should make the calculator say "No."

The Scoring System: The "Consistency Trophy"

How do you rank the contestants?

Accuracy: Did they get the answer right?
Stability: Did they get the right answer every time, or did they get lucky once and fail the next time?

NeuCo-Bench uses a special scoring formula that rewards consistency. If a model is great at predicting crops but terrible at predicting clouds, it gets a lower score than a model that is "okay" at everything. It's like a sports league where the team that wins the most games consistently is ranked higher than the team that wins one big game and loses the rest.

What They Found

The paper ran this test with 23 teams and many different AI models. Here are the takeaways:

The "Foundation Models" Won: The best summary notes came from massive, pre-trained AI models (like "TerraMind") that had already learned a lot about the Earth. They were like students who had read every book in the library before the test.
Size Matters (But not too much): There is a "Goldilocks" size for these notes. If the note is too short (too compressed), it forgets important details. If it's too long, it's just as heavy as the original photo album. The sweet spot found was 1,024 numbers long.
Simple is Better: Surprisingly, you don't need a complex, heavy calculator to read the summary note. A simple, fast calculator worked just as well as a complex one for the best notes. This means these notes are efficient and easy to use.

Why This Matters

Imagine a future where satellites send back these tiny "summary notes" instead of huge photos.

Speed: They travel instantly.
Storage: You can store millions of years of Earth data on a single hard drive.
Privacy: Because the note is just a list of numbers and not a picture, you can't easily reconstruct the original image to spy on someone's backyard. It's a "privacy-preserving" way to monitor the planet.

In short: NeuCo-Bench is a new rulebook and a scoreboard that helps scientists figure out how to shrink our planet's data into tiny, useful "summary notes" that computers can use instantly to solve real-world problems like climate change and disaster response.

1. Problem Statement

The Earth Observation (EO) domain generates petabyte-scale, multi-modal, and multi-temporal datasets (e.g., satellite imagery from Sentinel-1/2). Traditional image compression codecs (like JPEG2000) are optimized for human visual perception (pixel-level distortion), whereas machine learning pipelines require semantic fidelity—preserving information necessary for downstream tasks (e.g., land cover classification, biomass estimation).

While Self-Supervised Learning (SSL) and Foundation Models (FMs) have shown promise in generating rich embeddings, two critical gaps remain:

Lack of Standardization: There is no unified framework to evaluate how well compressed, fixed-size embeddings retain task-relevant information across diverse downstream tasks.
Evaluation Bias: Existing benchmarks often rely on fine-tuning large backbones or evaluating high-dimensional embeddings that reintroduce storage/bandwidth bottlenecks. They often fail to test "task-agnostic" generalization, leading to overfitting on specific known tasks.

The core question addressed is: How much task-relevant information can be squeezed into compact, fixed-size data representations?

2. Methodology: NeuCo-Bench Framework

NeuCo-Bench is a model-agnostic benchmark designed to evaluate the semantic quality of compressed embeddings under strict size constraints.

A. Core Workflow

Input: Multi-modal, multi-temporal data cubes (e.g., Sentinel-1/2 over 4 seasons).
Compression: A user-defined encoder $E$ compresses the input into a fixed-size embedding $z$ (e.g., $N=1024$ dimensions). The encoder is treated as a black box.
Evaluation (Linear Probing): The embeddings are evaluated on a suite of downstream tasks using linear regression (for continuous values) and softmax classification (for discrete classes). Crucially, the encoder is not fine-tuned; only the linear probe is trained.
Scoring:
- Quality Score ( $Q_t$ ): For each task $t$ , the score balances mean performance ( $\langle s_{t,k} \rangle$ ) against stability (standard deviation $\text{std}_k$ ) across $K$ random train/test splits.
  $Q_t^{(p)} = 100\epsilon \frac{\langle s_{t,k} \rangle_k}{\text{std}_k(s_{t,k}) + \epsilon}$
  This penalizes methods with high variance, ensuring robustness.
- Dynamic Ranking: A novel "rank-then-aggregate" scheme weights tasks based on their discriminative power. Tasks where all participants perform similarly (low variance) are down-weighted, while tasks that differentiate between models are up-weighted. This prevents easy tasks from dominating the leaderboard.

B. Dataset: SSL4EO-S12-downstream

To support reproducibility, the authors released a curated dataset containing:

Data: Aligned Sentinel-1 (radar) and Sentinel-2 (optical) data cubes (13 L1C + 12 L2A + 2 SAR channels) across 4 seasons.
Tasks: 11 downstream tasks covering:
- Agriculture: Crop fraction (Corn/Soybean).
- Land Cover: Forest and Agriculture percentages (Corine Land Cover).
- Environment: Above-ground biomass (GEDI), Urban Heat Islands (Landsat-8 LST), and Cloud cover.
Structure: Global distribution, 264x264 pixel patches, with labels derived from authoritative sources (USDA, EEA, NASA).

C. Challenge Mode

The framework includes a "hidden-task" mode where participants submit embeddings without knowing the specific downstream tasks or their number. This mitigates overfitting and encourages the development of truly general-purpose representations.

3. Key Contributions

Standardized Framework: A model-agnostic pipeline for evaluating compressed embeddings via task-agnostic linear probing, focusing on fixed-size vectors suitable for machine-to-machine workflows.
Novel Scoring System: A dynamic ranking mechanism that weights tasks by their difficulty and discriminative power, alongside a stability-aware quality score ( $Q$ ).
Curated Dataset & Tasks: The release of SSL4EO-S12-downstream, a multi-modal, multi-temporal dataset with diverse regression and classification labels.
Empirical Validation: Results from the 2025 CVPR EarthVision Data Challenge, demonstrating the framework's ability to rank state-of-the-art methods fairly.

4. Results & Findings

The paper presents results from the CVPR challenge and ablation studies using various encoders (Neural Compressors, SSL FMs like DINO, MAE, Prithvi, TerraMind).

Foundation Models vs. Neural Compression:
- TerraMind (a multimodal FM) achieved the highest overall performance, particularly on semantic land-cover tasks.
- Neural Rate-Distortion Compressors (e.g., Factorized Prior) outperformed simple averaging baselines but generally lagged behind FMs, struggling with the extreme compression ratio (~7,000:1) and linear probing constraints.
- Task Dependency: FMs excelled at semantic tasks (land cover) but struggled with sub-pixel geophysical predictions (biomass).
Aggregation Strategy: Post-encoding aggregation (encoding each season separately then averaging embeddings) consistently outperformed pre-encoding aggregation (averaging images before encoding), especially for temporally sensitive tasks like cloud detection.
Embedding Size:
- Performance for CNN backbones peaked between 128–1024 dimensions.
- ViT backbones performed best at 1024 dimensions (their native patch-token size).
- Increasing size beyond 1024 yielded diminishing returns, while smaller sizes degraded performance significantly.
Linear Probing Validity: Replacing linear probes with small non-linear MLPs provided only marginal gains for top-performing embeddings but significantly higher computational costs. This validates linear probing as an efficient and reliable metric for intrinsic embedding quality.
Challenge Outcomes: The dynamic ranking successfully prevented overfitting. The top teams used ensembles of FMs, while a non-FM approach (MOSAIKS) also achieved high scores, proving the framework's ability to evaluate diverse architectures.

5. Significance

Shift to Task-Centric Compression: NeuCo-Bench moves the field away from pixel-reconstruction metrics (PSNR/SSIM) toward utility-based metrics, aligning compression with the needs of machine learning pipelines.
Community Standard: By providing a hidden-task challenge format and open-source tools, it fosters a community-driven ecosystem for developing compact, generalizable EO representations.
Efficiency: The framework demonstrates that high-quality semantic information can be retained in compact vectors (1024 dims) without the need for heavy decoder architectures, enabling efficient storage and transmission of EO data.
Privacy & Security: The framework supports privacy-preserving approaches where pixel-level reconstruction is impossible, yet task-relevant analytics remain viable.

In conclusion, NeuCo-Bench establishes a new paradigm for evaluating Earth Observation embeddings, proving that compact, fixed-size representations can effectively support a wide range of downstream analytical tasks when evaluated through a rigorous, task-agnostic, and stability-aware framework.