OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

OmniEarth-Bench is the first multimodal benchmark designed to holistically evaluate Earth system intelligence across all six spheres and their interactions through 109 expert-curated tasks, revealing that current state-of-the-art multimodal large language models struggle significantly with these complex, cross-sphere challenges.

Fengxiang Wang, Mingshuo Chen, Xuming He, Yi-Fan Zhang, Yueying Li, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Junchao Gong, Di Wang, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang

Published 2026-02-17
📖 4 min read☕ Coffee break read

Imagine the Earth not just as a blue marble, but as a giant, complex machine made of six different "departments" working together: the Atmosphere (air), the Lithosphere (rocks), the Oceansphere (water), the Cryosphere (ice), the Biosphere (living things), and the Human-activity sphere (our cities and farms).

For a long time, the smartest AI computers (called Multimodal Large Language Models, or MLLMs) have been tested on how well they understand pictures and text. But these tests were like giving a pilot a driving test: they only checked if the AI could recognize a stop sign (human activity) or read a weather map (atmosphere). They never asked the AI to understand how a landslide (rocks) affects a river (water) which then floods a city (humans).

Enter "OmniEarth-Bench."

Think of OmniEarth-Bench as the ultimate "Earth Science Olympiad" for AI. It's the first test that forces these AI models to prove they understand the entire planet and how all its departments talk to each other.

Here is a breakdown of what the paper is about, using some everyday analogies:

1. The Problem: The "Silos" of Knowledge

Previously, AI benchmarks were like specialized training camps.

  • One camp taught AI how to count cars in a city (Human sphere).
  • Another taught it to identify clouds (Atmosphere).
  • But no one taught them how to connect the dots. If it rains heavily (Atmosphere), the soil gets wet (Lithosphere), the river swells (Oceansphere), and the city floods (Human sphere).
  • The Gap: Existing AI models are like students who memorized the dictionary but can't write a story. They know what a "flood" looks like, but they don't understand why it happened or what caused it.

2. The Solution: The "Earth Doctor" Exam

The researchers created OmniEarth-Bench, a massive, 29,855-question exam designed by 20 real-world Earth scientists (Ph.D.s) and 45 helpers.

  • The Curriculum: Instead of just asking "What is this cloud?", the exam asks complex questions like: "Based on the soil moisture, the river flow, and the snow melting, will this town flood tomorrow?"
  • The Ingredients: They didn't just use textbook pictures. They fed the AI real, raw data from 33 different sources—satellite images, seismic waves (earthquake sounds), and ocean sensors. It's like giving the AI a stethoscope, a thermometer, and a seismograph all at once.
  • The Difficulty: The questions are organized into four levels of difficulty, from "What do you see?" (Perception) to "Explain the chain reaction of events" (Scientific Reasoning).

3. The Results: The AI Got a "F"

The researchers tested 9 of the smartest AI models available today (including giants like GPT-4o and Gemini).

  • The Score: The results were shocking. None of the models scored above 35%. In fact, some of the most advanced models got questions completely wrong, sometimes even refusing to answer because they were too confused.
  • The Analogy: Imagine giving a medical student a patient with a broken leg and a fever. A smart student should ask, "Did they fall?" or "Is there an infection?" But these AI models were like students who just guessed "It's a broken leg" without looking at the fever, or guessed "It's a fever" without seeing the cast. They couldn't connect the symptoms to the whole body.
  • The "Refusal" Issue: Some models were so cautious that when they didn't know the answer, they said, "I can't decide." While this sounds honest, in a test, it counts as a wrong answer. Others just guessed blindly, which is worse.

4. Why This Matters

This paper isn't just about grading AI; it's a wake-up call.

  • Current AI is "Surface Level": Today's AI is great at recognizing patterns (like seeing a picture of a tiger). But Earth science is about processes (like understanding how a tiger's diet affects the forest, which affects the soil).
  • The Need for Specialists: The paper concludes that we can't just make AI "bigger" (adding more brain power) to solve this. We need to teach them Earth Science. We need to build models that are trained specifically on how the planet works, not just on general internet text.

The Takeaway

OmniEarth-Bench is a mirror held up to Artificial Intelligence. It shows us that while our AI is getting very good at "seeing" the world, it is still very bad at "understanding" the world.

Just as a pilot needs to understand aerodynamics, not just how to push buttons, our future AI tools for climate change, disaster relief, and farming need to understand the deep, interconnected dance of the Earth's six spheres. Until they pass this "Earth Science Olympiad," we can't fully trust them to make critical decisions about our planet's future.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →