REMSA: Foundation Model Selection for Remote Sensing via a Constraint-Aware Agent

The Big Problem: The "Remote Sensing" Supermarket is a Mess

Imagine you are a chef trying to cook a specific dish (like a "flood detection stew"). You need a very specific type of pot, a specific heat source, and ingredients that match your recipe.

Now, imagine walking into a massive, chaotic supermarket called Remote Sensing. This store has over 160 different types of "super-pots" (these are the AI models). Some pots are made for radar, some for optical cameras, some for hyperspectral sensors. Some are huge and need a nuclear power plant to run; others are tiny and fit in a backpack.

The problem? The labels are scattered. Some are in old magazines, some are in code repositories, and some are written in confusing technical jargon. If you ask a human to find the perfect pot for your specific stew, they might spend weeks reading manuals, get confused, or pick the wrong one.

The Solution: Meet "Remsa" (Your Personal Shopping Agent)

The authors of this paper built a smart assistant named Remsa. Think of Remsa as a super-smart, constraint-aware personal shopper who knows exactly what you need.

Instead of you wandering the aisles, you just tell Remsa: "I need a pot for flood detection using radar data, but I only have a laptop (no supercomputer) and I need it to be fast."

Remsa doesn't just guess. It goes through a strict, logical process to find the best match.

How Remsa Works: The 4-Step Dance

Remsa uses a special "brain" (a Large Language Model) and a massive, organized "library" (a database) to do its job. Here is how it works:

1. The Library: The "RS-FMD" (The Organized Catalog)

Before Remsa can help, the authors had to organize the messy supermarket. They built the RS-FMD, a structured database of over 160 models.

The Analogy: Imagine taking all those scattered magazine clippings and messy labels and turning them into a perfectly organized digital catalog. Every pot now has a clear tag saying: Size, Power Needs, Best Use Case, and Price.
How they did it: They used AI to read the messy papers and automatically fill in the catalog tags, but they added a "confidence score." If the AI wasn't 100% sure about a tag (like "How many layers does this model have?"), it flagged it for a human to double-check.

2. The Interpreter: The "Translator"

When you type a question like "I need to find oil spills in the ocean," Remsa's Interpreter translates your casual English into a strict checklist.

The Analogy: It's like a translator who turns "I want a car that's fast and cheap" into a specific list: Max Speed > 100mph, Price < $20k, Fuel Type: Gas. It turns vague wishes into hard constraints.

3. The Orchestrator: The "Traffic Cop"

This is the most important part. Remsa doesn't just search and guess. It acts like a traffic cop directing the flow of information.

Step A (Retrieval): It quickly grabs a list of 50 potential models from the catalog that might work.
Step B (Filtering): It immediately throws out the ones that break your hard rules (e.g., "This model needs a supercomputer, but you only have a laptop").
Step C (The "Wait, I need more info" Moment): If the list is still too long or the AI is confused, Remsa stops and asks you a clarifying question. "You mentioned 'fast,' do you mean fast training or fast processing?" This is the Constraint-Aware part—it knows when it needs more details to make a good decision.
Step D (Ranking): It uses its "brain" to read the remaining candidates and rank them from best to worst, explaining why it picked them.

4. The Reporter: The "Explainable Guide"

Finally, Remsa gives you a report. It doesn't just say "Here is Model X." It says: "I picked Model X because it handles radar data well, fits on your laptop, and is great for oil spills. However, Model Y is slightly more accurate but too heavy for your computer."

Why is this a Big Deal? (The Results)

The authors tested Remsa against other methods:

The "Naive" Agent: A bot that just searches without thinking or asking questions. (Like a robot that grabs the first 3 pots it sees).
The "Dense Retrieval" System: A system that just matches keywords without understanding the context. (Like a search engine that finds "pot" but doesn't know you need a "pressure cooker").
The "Unstructured" System: A bot that reads the messy papers without a catalog. (Like asking a human to read 160 books to find one answer).

The Verdict: Remsa won every time.

It was more accurate.
It handled complex constraints better.
It provided better explanations.

Even when they tested it with different "brains" (different AI models), Remsa's structure made it work better than the others.

The Bottom Line

Remsa is a tool that turns the impossible task of choosing the right AI model for satellite data into a simple conversation.

Before: You were lost in a library with no index, trying to find a needle in a haystack.
Now: You have a smart librarian (Remsa) who has organized the whole library, understands your specific needs, asks you smart questions, and hands you the perfect book with a note explaining why it's the best choice.

This makes advanced AI for Earth observation (like monitoring climate change or disasters) accessible to everyone, not just computer science experts.

1. Problem Statement

The rapid proliferation of Remote Sensing Foundation Models (RSFMs) has created a significant bottleneck for practitioners. While hundreds of models exist (e.g., vision-only encoders, vision-language models, multimodal architectures), selecting the most suitable model for a specific task is increasingly difficult due to:

Information Fragmentation: Model documentation is scattered across papers, GitHub repositories, and model cards in unstructured formats.
Complex Constraints: RS tasks involve intricate trade-offs between data modalities (SAR, multispectral, hyperspectral), spatial/spectral/temporal resolutions, computational resources, and specific downstream tasks (e.g., change detection vs. land cover mapping).
Lack of Automation: Current selection processes are manual, time-consuming, error-prone, and lack reproducibility. Existing benchmarks focus on fixed task performance rather than matching models to user-specific deployment constraints.

2. Methodology

The authors propose Remsa, a constraint-aware Large Language Model (LLM) agent designed to automate RSFM selection. The system relies on two core components: a structured database and a modular agent workflow.

A. The RSFM Database (RS-FMD)

To enable machine-readable selection, the authors constructed RS-FMD, the first structured, schema-guided database covering over 160 RSFMs.

Schema Design: The database uses a comprehensive schema capturing model identifiers, architecture details (backbone, layers), supported modalities, pretraining strategies (datasets, masking ratios), and benchmark performance.
Semi-Automated Curation: Due to the scale of documentation, a semi-automated pipeline was used. It employs an LLM-based extraction process (inspired by OneKE) to parse unstructured sources (papers, model cards) into structured JSON.
Confidence Scoring: A confidence score is calculated for each extracted field based on the LLM's generation probability and self-consistency across multiple sampling rounds. Fields with low confidence ( $<0.75$ ) are flagged for human verification, ensuring high data reliability.

B. The Remsa Agent Architecture

Remsa operates as a modular agent that orchestrates tools to interpret user intent and select models.

Interpreter: Parses free-text user queries into structured constraints (e.g., application type, required modality, compute budget).
Task Orchestrator: A control loop that manages the selection workflow based on the current state (available constraints, candidate count, confidence scores).
Tool Suite:
- Retrieval Tool: Uses Sentence-BERT embeddings and FAISS to perform dense retrieval from RS-FMD, generating an initial candidate set based on semantic similarity.
- Ranking Tool: Refines candidates using a hybrid approach:
  - Rule-Based Filtering: Removes candidates violating hard constraints (e.g., missing sensor support).
  - In-Context LLM Ranking: Uses few-shot prompting to re-rank candidates based on nuanced trade-offs (e.g., efficiency vs. accuracy) without fine-tuning the LLM.
- Clarification Generator: If constraints are ambiguous or confidence is low, the agent asks the user targeted questions (up to 3 rounds) to refine the query.
- Explanation Generator: Produces transparent, human-readable justifications for the top-ranked models, citing specific metadata and trade-offs.
Memory: A lightweight vector database stores past interactions to support personalization and long-term refinement.

3. Key Contributions

RS-FMD: The first structured, schema-guided database of 160+ RSFMs, serving as a foundational resource for automated model selection and comparison.
Remsa Agent: A novel, modular LLM agent that integrates structured metadata grounding, dense retrieval, in-context ranking, and interactive clarification to solve complex, constraint-heavy selection problems.
Expert-Centered Benchmark: The creation of a new evaluation protocol comprising 100 realistic query scenarios and 3,000 expert-scored evaluations. This benchmark assesses systems across 7 criteria (Application Compatibility, Modality Match, Performance, Efficiency, Popularity, Generalizability, Recency).

4. Experimental Results

The authors evaluated Remsa against three baselines:

Remsa-Naive: Same tools but without adaptive orchestration (single-step execution).
DB-Retrieval: Pure dense retrieval without ranking or reasoning.
Unstructured-RAG: A standard RAG approach using unstructured text descriptions.

Key Findings:

Performance: Remsa consistently outperformed all baselines across multiple LLM backbones (GPT-4.1, DeepSeek3.2, LLaMA-3.3-70B).
- Average Top-1 Score: Remsa achieved 75.76 (GPT-4.1) vs. 72.67 for Remsa-Naive and 67.37 for DB-Retrieval.
- Top-1 Hit Rate: Remsa achieved 21.33%, significantly higher than DB-Retrieval (12.00%) and Unstructured-RAG (13.33%).
- High-Quality Hit Rate: Remsa reached 40.00%, indicating a higher probability of selecting a model with an expert score $\ge 80$ .
Robustness: The performance gains were consistent across different LLM backbones, suggesting the improvement stems from the agent's architecture (orchestration and structured grounding) rather than the specific LLM used.
Sensitivity Analysis: Removing "Application Compatibility" or "Modality Match" from the scoring rubric caused significant performance drops, confirming Remsa prioritizes functional suitability. Interestingly, removing "Efficiency" or "Popularity" slightly improved scores, suggesting these criteria sometimes favor established models over technically optimal ones.
Latency: Remsa has a higher latency (~~31.7s) compared to retrieval-only methods (~~0.77s) due to multi-turn reasoning, but this overhead yields significantly higher accuracy and relevance.

5. Significance

Bridging the Gap: Remsa addresses the critical gap between the abundance of RSFMs and the practical difficulty of deploying them. It transforms model selection from a manual literature review into an automated, reproducible workflow.
Constraint-Awareness: Unlike previous benchmarks that focus solely on accuracy, Remsa explicitly handles operational constraints (compute, data availability, sensor types), making it highly relevant for real-world industry and scientific applications.
Transparency and Trust: By providing structured justifications and leveraging a curated database, Remsa offers transparent decision-making, which is crucial for high-stakes applications like disaster response and environmental monitoring.
Community Resource: The release of RS-FMD and the benchmark dataset provides a standardized foundation for future research in automated AI model selection for remote sensing and beyond.

In conclusion, Remsa demonstrates that combining structured knowledge bases with constraint-aware agentic workflows significantly enhances the ability to select the right foundation model for complex remote sensing tasks, outperforming both naive retrieval and unstructured RAG approaches.