Ecological mapping with geospatial foundation models

Imagine you have a super-smart robot that has spent its entire childhood reading every geography textbook, looking at millions of satellite photos, and learning how the Earth looks from space. This robot is a Geospatial Foundation Model (GFM). It's like a world-class detective who knows the general rules of the planet but hasn't yet been trained on the specific, tricky cases of ecology.

This paper is essentially a report card on how well we can take these "generalist" robots and teach them to solve very specific, high-stakes ecological mysteries. The researchers wanted to see if these robots could do better than the old-school, "dumb" models (like ResNet) that were just trained on random internet photos of cats and cars.

Here is the breakdown of their three main "mystery cases":

1. The Three Mysteries They Solved

The team tested the robots on three different ecological challenges:

Case A: The Forest Detective (Leaf & Canopy Mapping)
- The Goal: Figure out exactly what kind of trees are in a forest (pine vs. oak) and how thick the "roof" of leaves is.
- The Analogy: Imagine trying to tell the difference between a dense pine forest and a sparse oak grove just by looking at a blurry photo from a drone. The old models (ResNet) were like a person squinting in the dark, guessing. The new AI models (Prithvi and TerraMind) were like having a pair of high-tech night-vision goggles. They saw the details clearly and got the answer right 20% more often than the old models.
Case B: The Peatland Hunter (Finding Spongy Ground)
- The Goal: Find "peatlands"—those soggy, mossy wetlands that act like giant carbon sponges, storing huge amounts of climate-warming gas.
- The Analogy: Peat moss looks reddish-brown from space, but so do many other plants. It's like trying to find a specific red apple in a basket full of red tomatoes and red peppers.
- The Twist: The researchers found that if they gave the robot more senses (not just a camera, but also radar and elevation maps), it became a much better hunter. It was like giving the detective a metal detector and a map of underground pipes, not just a pair of eyes.
Case C: The Time Traveler (Generating Missing Data)
- The Goal: Sometimes satellites get blocked by clouds, leaving gaps in the data.
- The Analogy: Imagine you are watching a movie, but the screen flickers off for a few seconds. The TerraMind model is like a super-creative editor who can look at the scene before and after the flicker and guess exactly what happened in the missing seconds. It successfully "filled in the blanks" to create a complete map of the land, even when the original data was missing.

2. The Results: Who Won the Race?

The Old Guard (ResNet): This is the "standard" model. It's reliable but slow and often gets confused when the scenery changes. It's like a student who memorized the textbook but fails the test if the questions are phrased differently.
The New Stars (Prithvi & TerraMind): These are the foundation models. They learned from so much data that they understood the "language" of the Earth.
- Prithvi was great, like a top-tier student.
- TerraMind was the valedictorian. It performed slightly better, especially when the researchers gave it extra tools (like radar data). It proved that having a "multimodal" brain (seeing with eyes, ears, and touch) is better than just having eyes.

3. The Catch: Why They Didn't Get 100%

Even though the new robots were amazing, they weren't perfect. The paper points out a few reasons why:

The "Blurry Photo" Problem: The satellite images they used were 10 meters per pixel. That's like trying to identify a specific bird species from a photo taken from a plane; you can see the tree, but you can't see the bird's feathers. The AI needs higher-resolution "photos" to see the tiny details of nature.
The "Bad Map" Problem: Sometimes the "answer key" (the labels) the researchers used to train the AI was wrong or too vague. If you teach a robot with a blurry map, it will draw a blurry map. The AI is only as good as the homework it's given.
The "Underground" Blind Spot: The AI can see the surface (the moss), but it can't see what's happening under the ground (the water table or carbon content). It's like trying to diagnose a patient's health just by looking at their skin, without being able to feel their pulse or take an X-ray.

The Big Takeaway

This paper tells us that AI is ready to help save the planet, but we need to treat it like a brilliant intern, not a magic wand.

If we give these foundation models the right data, the right tools (like radar and elevation), and high-quality "homework" (accurate labels), they can outperform our old methods by a huge margin. They are the future of ecological mapping, helping us track forests and wetlands faster and more accurately than ever before. But to get the best results, we need to keep feeding them better data and sharper eyes.

1. Problem Statement

Geospatial Foundation Models (GFMs) have shown promise in general land cover mapping and object detection. However, their utility for high-impact, specialized ecological applications remains insufficiently characterized.

The Gap: Ecological tasks (e.g., forest trait estimation, peatland detection) involve intricate biodiversity characteristics not directly measured by standard Earth Observation (EO) sensors used in pretraining.
The Challenge: Existing GFMs often suffer from low domain generalization, geographic bias, and temporal bias when applied to these complex tasks. Consequently, task-specific supervised models (like ResNet) remain the standard, despite the potential for GFMs to reduce data labeling requirements and improve transfer learning.
Objective: This study systematically evaluates the performance, limitations, and practical considerations of fine-tuning GFMs for three specific ecological use cases: forest functional trait estimation, land use/land cover (LULC) mapping, and peatland detection.

2. Methodology

A. Models Evaluated

The study compares two transformer-based GFMs against a strong convolutional baseline:

Prithvi-EO-2.0: A GFM pretrained on Earth Observation data (specifically 6 HLS bands).
TerraMind: A multimodal GFM designed for joint reasoning across heterogeneous geospatial inputs, pretrained on all 12 Sentinel-2 (S2-L2A) bands.
ResNet-101: A convolutional neural network pretrained on general RGB images, serving as the baseline.

B. Study Sites & Use Cases

NEON Sites (USA): Used for Forest Trait Mapping.
- Tasks: Segmenting leaf form (needle, broad, mixed, unknown) and canopy cover density (open, closed).
- Data: Sentinel-2 L2A composites (May–Aug 2019).
Karukinka Nature Park (Patagonia, Chile): Used for Peatland Detection.
- Task: Delineating peatlands dominated by Sphagnum Magellanicum.
- Data: Sentinel-2 (Oct–Mar 2018-19), Sentinel-1 (S1-GRD), and Digital Elevation Models (DEM).

C. Experimental Design

Fine-tuning: All models were fine-tuned using the TerraTorch toolkit on a single A100-80GB GPU.
Training Configuration: AdamW optimizer, learning rate $1\times10^{-4}$ , Dice loss, and weight decay $5\times10^{-2}$ .
Input Modalities:
- Unimodal: S2-L2A (12 bands).
- Multimodal (TerraMind only): Combined S2-L2A, S1-GRD (VV/VH), RGB, NDVI, and DEM.
Label Generation:
- Forest Traits: Derived from Copernicus Global Dynamic Land Cover (100m resolution), aggregated into specific classes.
- Peatlands: Three distinct label datasets were created to test robustness:
  1. COP 2020 LULC: Extracted herbaceous/wetland classes, filtered by elevation/slope.
  2. Chilean CONAF: Extracted specific "peat bog" (turbales) classes.
  3. PEATGRIDS-NDVI: Global peat thickness data combined with NDVI thresholds to create binary masks.

D. Specific Experiments

Zero-Shot Generation: Testing TerraMind's "any-to-any" capability to generate ESRI LULC maps from S2-L2A inputs without fine-tuning.
Forest Trait Mapping: Fine-tuning for leaf form and canopy density segmentation.
Peatland Detection: Fine-tuning for peat vs. non-peat segmentation across the three label datasets, comparing unimodal vs. multimodal approaches.

3. Key Results

A. Zero-Shot Generation

TerraMind successfully generated ESRI LULC maps from S2-L2A inputs with a weighted IoU of 78.82%.
Performance was highest for "Trees" (F1: 0.90) and "Rangeland" (F1: 0.87), but struggled with "Flooded vegetation" and "Bare ground," indicating limitations in distinguishing specific wetland subclasses without fine-tuning.

B. Forest Trait Mapping

Performance: Both GFMs significantly outperformed the ResNet baseline by >20% in F1-scores for both canopy density and leaf form.
Ranking: TerraMind (F1: 0.75) slightly edged out Prithvi (F1: 0.74), which both surpassed ResNet (F1: 0.54).
Observation: Despite being trained on coarse-grained labels, the transformer models effectively captured spatio-spectral dimensions, demonstrating superior generalization.

C. Peatland Detection

Unimodal Performance: Prithvi and TerraMind consistently outperformed ResNet across all three label datasets.
Multimodal Advantage: TerraMind showed substantial gains when incorporating S1, NDVI, and DEM modalities.
- Example: On the COP 2020 dataset, TerraMind (multimodal) achieved an F1 of 0.90 and a Peatland IoU of 0.77, compared to 0.57 for the unimodal version.
Label Sensitivity:
- COP 2020 LULC: Produced high false positives (confusing peat with general herbaceous vegetation).
- PEATGRIDS-NDVI & Chilean LULC: Produced fewer false positives but high false negatives.
- Root Cause: Models failed to detect subsurface dynamics (peat layer presence, carbon content) because pretraining data lacks these physical properties.

4. Key Contributions

Systematic Evaluation: One of the first studies to benchmark GFMs against traditional baselines (ResNet) specifically for complex ecological tasks rather than generic land cover.
Multimodal Validation: Demonstrated that TerraMind's ability to ingest heterogeneous data (Optical + Radar + Topography) significantly improves performance in distinguishing subtle ecological features like peatlands.
Label Quality Analysis: Highlighted that "off-the-shelf" datasets derived via machine learning or coarse resolution (e.g., 100m) introduce significant noise and artifacts, limiting the ceiling of model performance.
Generative Capability: Validated the "any-to-any" modality generation feature of TerraMind for filling data gaps in ecological monitoring.

5. Significance and Limitations

Significance

Domain Gap Reduction: Pretraining on EO data allows GFMs to bridge the gap that causes non-geospatial models (like ResNet) to underperform in ecological contexts.
Data Efficiency: GFMs offer a pathway to high-fidelity ecological mapping with reduced reliance on massive, perfectly labeled datasets, provided the downstream data aligns with pretraining modalities.
Multimodal Synergy: The study proves that combining optical, radar, and topographic data is critical for detecting complex ecological phenomena that optical data alone cannot resolve.

Limitations & Future Directions

Resolution Constraints: The 10m input resolution limits the detection of fine-scale ecological dynamics. Higher-resolution inputs are needed for pixel-level accuracy.
Subsurface Blindness: Current GFMs cannot detect subsurface features (e.g., peat depth, carbon stock) as these are not present in standard satellite imagery.
Label Quality: Model performance is heavily bottlenecked by the quality and resolution of ground truth labels. Expert-verified, field-based labels are superior to automated or coarse global datasets.
Temporal Bias: Performance varies significantly between single-date acquisitions and temporal aggregates, suggesting a need for multi-temporal fine-tuning strategies.

Conclusion: The paper concludes that while GFMs (particularly TerraMind) offer superior performance over traditional baselines for ecological mapping, their full potential is currently constrained by label quality, input resolution, and the inherent inability of optical sensors to capture subsurface ecological dynamics.