OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Imagine the Earth not just as a blue marble, but as a giant, complex machine made of six different "departments" working together: the Atmosphere (air), the Lithosphere (rocks), the Oceansphere (water), the Cryosphere (ice), the Biosphere (living things), and the Human-activity sphere (our cities and farms).

For a long time, the smartest AI computers (called Multimodal Large Language Models, or MLLMs) have been tested on how well they understand pictures and text. But these tests were like giving a pilot a driving test: they only checked if the AI could recognize a stop sign (human activity) or read a weather map (atmosphere). They never asked the AI to understand how a landslide (rocks) affects a river (water) which then floods a city (humans).

Enter "OmniEarth-Bench."

Think of OmniEarth-Bench as the ultimate "Earth Science Olympiad" for AI. It's the first test that forces these AI models to prove they understand the entire planet and how all its departments talk to each other.

Here is a breakdown of what the paper is about, using some everyday analogies:

1. The Problem: The "Silos" of Knowledge

Previously, AI benchmarks were like specialized training camps.

One camp taught AI how to count cars in a city (Human sphere).
Another taught it to identify clouds (Atmosphere).
But no one taught them how to connect the dots. If it rains heavily (Atmosphere), the soil gets wet (Lithosphere), the river swells (Oceansphere), and the city floods (Human sphere).
The Gap: Existing AI models are like students who memorized the dictionary but can't write a story. They know what a "flood" looks like, but they don't understand why it happened or what caused it.

2. The Solution: The "Earth Doctor" Exam

The researchers created OmniEarth-Bench, a massive, 29,855-question exam designed by 20 real-world Earth scientists (Ph.D.s) and 45 helpers.

The Curriculum: Instead of just asking "What is this cloud?", the exam asks complex questions like: "Based on the soil moisture, the river flow, and the snow melting, will this town flood tomorrow?"
The Ingredients: They didn't just use textbook pictures. They fed the AI real, raw data from 33 different sources—satellite images, seismic waves (earthquake sounds), and ocean sensors. It's like giving the AI a stethoscope, a thermometer, and a seismograph all at once.
The Difficulty: The questions are organized into four levels of difficulty, from "What do you see?" (Perception) to "Explain the chain reaction of events" (Scientific Reasoning).

3. The Results: The AI Got a "F"

The researchers tested 9 of the smartest AI models available today (including giants like GPT-4o and Gemini).

The Score: The results were shocking. None of the models scored above 35%. In fact, some of the most advanced models got questions completely wrong, sometimes even refusing to answer because they were too confused.
The Analogy: Imagine giving a medical student a patient with a broken leg and a fever. A smart student should ask, "Did they fall?" or "Is there an infection?" But these AI models were like students who just guessed "It's a broken leg" without looking at the fever, or guessed "It's a fever" without seeing the cast. They couldn't connect the symptoms to the whole body.
The "Refusal" Issue: Some models were so cautious that when they didn't know the answer, they said, "I can't decide." While this sounds honest, in a test, it counts as a wrong answer. Others just guessed blindly, which is worse.

4. Why This Matters

This paper isn't just about grading AI; it's a wake-up call.

Current AI is "Surface Level": Today's AI is great at recognizing patterns (like seeing a picture of a tiger). But Earth science is about processes (like understanding how a tiger's diet affects the forest, which affects the soil).
The Need for Specialists: The paper concludes that we can't just make AI "bigger" (adding more brain power) to solve this. We need to teach them Earth Science. We need to build models that are trained specifically on how the planet works, not just on general internet text.

The Takeaway

OmniEarth-Bench is a mirror held up to Artificial Intelligence. It shows us that while our AI is getting very good at "seeing" the world, it is still very bad at "understanding" the world.

Just as a pilot needs to understand aerodynamics, not just how to push buttons, our future AI tools for climate change, disaster relief, and farming need to understand the deep, interconnected dance of the Earth's six spheres. Until they pass this "Earth Science Olympiad," we can't fully trust them to make critical decisions about our planet's future.

1. Problem Statement

Current benchmarks for Multimodal Large Language Models (MLLMs) in Earth science suffer from three critical limitations:

Siloed Coverage: Existing benchmarks (e.g., WeatherQA, VRSBench) typically focus on isolated spheres, primarily the human-activity sphere or atmosphere, neglecting the interconnected nature of the Earth system. They fail to cover the lithosphere, oceansphere, cryosphere, and biosphere comprehensively.
Lack of Cross-Sphere Reasoning: Real-world Earth science challenges (e.g., flood prediction) require modeling complex couplings between spheres (e.g., atmospheric precipitation $\to$ soil moisture $\to$ lithospheric runoff). Current benchmarks do not evaluate these cross-sphere interactions.
Data Heterogeneity and Scientific Formulation: Earth Observation (EO) data is multi-source and heterogeneous (satellite imagery, seismic signals, reanalysis data). Existing benchmarks often rely on exam-style questions or simple image-text pairs, failing to capture the rigorous scientific reasoning and domain-specific knowledge required to interpret raw observational data.

2. Methodology: OmniEarth-Bench Construction

The authors introduce OmniEarth-Bench, the first multimodal benchmark designed to holistically evaluate MLLMs across all six Earth spheres and their cross-sphere interactions.

A. Data Pipeline and Sources

Data Ingestion: The benchmark integrates 33 distinct native Earth-observation data sources, including satellite imagery (MODIS, Sentinel, GOCI), seismic waveforms (STEAD), and reanalysis data (ERA5, GFF).
Expert-in-the-Loop Curation: A team of 20 domain experts (Ph.D. holders/candidates) and 45 crowd-sourcing annotators curated the dataset. Experts handled source screening, task definition, and quality control, while annotators assisted in data alignment and formatting.
Data Processing: Raw data (e.g., NetCDF, seismic waveforms) was transformed into MLLM-compatible formats (RGB images, standardized grids) while preserving scientific integrity (e.g., converting multi-spectral data to single-channel grayscale to avoid misleading RGB artifacts).

B. Hierarchical Task Framework

The benchmark organizes 29,855 expert-curated annotations into a four-level hierarchy:

L1 (Sphere): Covers 7 domains: Atmosphere, Lithosphere, Oceansphere, Cryosphere, Biosphere, Human-Activity Sphere, and Cross-Sphere.
L2 (Scenario): 25 representative scenarios per sphere (e.g., Typhoon Events, Earthquake Monitoring, Global Flood Forecasting).
L3 (Ability): 5 core reasoning capabilities:
- Perception: Sensory input recognition.
- General Reasoning: Inference based on visual patterns.
- Scientific-Knowledge Reasoning: Complex reasoning requiring deep domain expertise (e.g., carbon flux estimation).
- Chain-of-Thought (CoT): Multi-step logical reasoning.
L4 (Task): 109 specific evaluation tasks with real-world applicability.

C. Evaluation Formats

The benchmark employs diverse question formats to test different capabilities:

Multiple-Choice Questions (MCQ): Standardized evaluation with an "Unable to decide" option to penalize hallucination.
Open-Ended Questions: Tests generative capabilities without predefined choices.
Visual Grounding: Requires locating specific features (bounding boxes) in complex scenes.
Image Captioning: Synthesizing visual data with historical context (e.g., disaster reports).
CoT Annotations: 610 samples with detailed, expert-verified reasoning chains to evaluate logical depth.

3. Key Contributions

Unified Earth-Observation Pipeline: A scalable, modular pipeline that ingests 33 heterogeneous data sources and produces standardized, expert-curated annotations, bridging the gap between raw scientific data and MLLM inputs.
First Sphere-Complete Benchmark: The only benchmark systematically covering all six Earth spheres plus explicit cross-sphere scenarios, moving beyond the siloed evaluation of previous works.
Rigorous Evaluation Framework: A four-level hierarchy (L1-L4) with 109 tasks designed by domain experts to ensure scientific validity and real-world relevance, specifically targeting "Scientific-Knowledge Reasoning" which is often missing in general benchmarks.
Comprehensive Baseline: Extensive evaluation of 9 state-of-the-art MLLMs (including GPT-4o, Gemini-2.0, Claude 3.7, InternVL3, Qwen2.5-VL) establishing a new baseline for Earth-system AI.

4. Experimental Results

The evaluation of 9 MLLMs on OmniEarth-Bench reveals significant performance gaps:

Overall Performance: None of the tested models achieved an accuracy above 35% across the benchmark. Even the most advanced models (e.g., GPT-4o, Claude 3.7) struggled significantly.
Cross-Sphere Failure: Performance dropped drastically in cross-sphere tasks. For example, in some cross-sphere scenarios, leading models like GPT-4o achieved 0.0% accuracy.
Domain-Specific Weaknesses:
- Scientific Reasoning: Models failed at tasks requiring deep domain knowledge (e.g., ENSO identification, earthquake magnitude estimation).
- Visual Grounding: Performance was near zero for most models in locating specific features in complex Earth observation scenes (e.g., damaged buildings, oil spills).
- Scaling Laws: Increasing model size (e.g., from 7B to 72B parameters in InternVL3) did not yield significant improvements in Earth-science tasks, suggesting the bottleneck is lack of domain-specific knowledge rather than model capacity.
Safety vs. Performance: Some models (e.g., Qwen2.5-VL) performed poorly because they frequently refused to answer ("Unable to decide") when uncertain, whereas others guessed randomly.

5. Significance and Future Impact

Benchmarking Standard: OmniEarth-Bench sets a new, rigorous standard for evaluating AI in geoscience, moving beyond simple image recognition to complex, multi-sphere reasoning.
Highlighting Limitations: The results expose a fundamental gap in current MLLMs: they lack the specialized knowledge and reasoning mechanisms required for Earth-system science.
Research Catalyst: The benchmark serves as a catalyst for developing Earth-specialized MLLMs. It suggests that future progress requires integrating domain-specific knowledge and training on observational data rather than merely scaling general-purpose models.
Real-World Application: By focusing on tasks like flood forecasting, disaster assessment, and climate monitoring, the benchmark directly addresses critical societal challenges in environmental management and climate science.

In conclusion, OmniEarth-Bench demonstrates that while MLLMs have advanced in general vision-language tasks, they are currently ill-equipped for the holistic, cross-disciplinary reasoning required to understand and model the Earth system. The dataset and code are publicly released to drive future research in geoscientific AI.