CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning

Imagine the Earth as a giant, breathing organism. Every day, forests, grasslands, and oceans inhale carbon dioxide (CO₂) and exhale it back out. Scientists call this the "carbon flux." To fight climate change, we need to know exactly how much carbon the Earth is storing or releasing.

However, we only have a few "stethoscopes" (called Eddy Covariance towers) placed on the ground to listen to this breathing. There are only about 567 of them scattered across the entire planet. They are like a handful of microphones in a massive stadium; they give us perfect sound in their immediate spot, but we have no idea what's happening in the empty seats, the VIP boxes, or the other side of the field.

The Problem: The "Zero-Shot" Guessing Game
Scientists want to use these few microphones to guess the sound of the entire stadium. This is called "upscaling."

The tricky part is that every part of the stadium is different. A microphone in a tropical rainforest hears a very different rhythm than one in a frozen tundra. If you train a computer model to listen to the rainforest, it will likely fail miserably when you ask it to guess what the tundra sounds like, because it has never "heard" the tundra before.

In machine learning terms, this is a "Zero-Shot" problem: The model must make a prediction about a place it has never seen, with no practice data from that specific location.

The Solution: CarbonBench
The authors of this paper, a team from the University of Minnesota, realized that while scientists were trying to solve this, they were all playing different games with different rules. Some used different maps, some measured different things, and no one could agree on who was actually the best at guessing.

So, they built CarbonBench. Think of this as the "Olympics for Carbon Guessing."

Here is what CarbonBench does, using simple analogies:

The Massive Dataset (The Scoreboard): They gathered data from 567 real-world "microphones" (towers) spanning 24 years. They combined this with satellite photos (to see what the plants look like) and weather data (to see if it's hot, cold, or raining). This creates a massive, harmonized library of 1.3 million daily observations.
The Rules of the Game (The Evaluation): They created strict rules to test the models.
- The "Vegetation" Test: Train the model on forests, then test it on grasslands.
- The "Climate" Test: Train the model on tropical zones, then test it on polar zones.
- The "Zero-Shot" Rule: The model is strictly forbidden from seeing the test locations during training. It has to guess purely based on what it learned elsewhere.
The Athletes (The Models): They tested various "athletes" (computer algorithms) to see who wins.
- The Veterans: Old-school tree-based models (like XGBoost), which are reliable but sometimes stubborn.
- The Time-Travelers: Advanced AI models (like Transformers and LSTMs) that look at patterns over time, not just a single snapshot.
- The Specialists: New models designed specifically to handle "domain shifts" (moving from one environment to another).

What Did They Find?

Time Matters: Models that look at the history of the weather and plants (time-series models) were much better at guessing than models that just looked at a single day. It's like knowing a plant's history helps you predict its future better than just looking at it once.
The "Specialist" Wins: One model called TAM-RL was the MVP. It didn't just get the average right; it was the most consistent. It rarely made "catastrophic failures" (guessing wildly wrong numbers) in the hardest-to-reach places like the Arctic or deep tropics.
The Hard Part: Predicting the net balance (how much carbon is actually stored vs. released) is incredibly hard. It's like trying to guess the exact weight of a person by subtracting two huge numbers (food eaten minus waste produced); a tiny error in either number creates a huge error in the final answer.

Why Should You Care?
CarbonBench isn't just a computer science project; it's a tool for saving the planet.

Better Climate Policies: Governments need accurate numbers to decide how much carbon they can emit. If the models are bad, policies will be wrong.
Finding the Blind Spots: By seeing where the models fail (e.g., in tropical rainforests), scientists know exactly where they need to build more towers or send more satellites.
A New Standard: It gives researchers a common language. Instead of arguing about who is best, they can now all run their code on CarbonBench and see who actually wins.

In a Nutshell:
The Earth is breathing, but we can only hear a few spots. CarbonBench is the new training ground and scoreboard that teaches computers how to listen to the whole planet, ensuring that when we try to fix the climate, we aren't just guessing in the dark.

Here is a detailed technical summary of the paper "CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning."

1. Problem Definition

The paper addresses the critical challenge of terrestrial carbon flux upscaling. While Eddy Covariance (EC) towers provide high-fidelity measurements of carbon exchange (GPP, RECO, NEE), their global coverage is extremely sparse (<0.015% of Earth's land surface) and biased toward specific regions, leaving critical biomes (tropical rainforests, high-latitude permafrost) underrepresented.

The core problem is formulated as a Zero-Shot Spatial Transfer Learning task for time series regression:

Goal: Predict carbon fluxes at unseen geographical locations (test sites) using only meteorological and remote sensing features, without any labeled flux data from those specific locations.
Challenge: The relationship between drivers (weather, vegetation indices) and fluxes is highly context-dependent and non-stationary across different climate regimes and vegetation types. Standard models often suffer from distribution shift, performing well on training domains but failing catastrophically on unseen, underrepresented ecosystems.
Gap: Existing ML benchmarks focus on temporal generalization (forecasting at known sites) or classification tasks. There is no standardized framework for evaluating spatial generalization in time series regression, nor are there standardized baselines for this specific Earth system science problem.

2. Methodology: The CarbonBench Framework

A. Dataset Construction

CarbonBench aggregates 1.3 million daily observations from 567 flux tower sites globally (2000–2024).

Targets: Three carbon flux variables: Gross Primary Production (GPP), Ecosystem Respiration (RECO), and Net Ecosystem Exchange (NEE).
Inputs (Features):
- Remote Sensing: MODIS data (7 spectral bands + cloud mask) at 500m resolution, aggregated to a 2km window around each tower.
- Meteorology: 150 variables from ERA5-Land reanalysis (radiation, moisture, temperature, wind, etc.) at 0.1° resolution.
- Metadata: Site-specific IGBP vegetation type and Köppen-Geiger climate class.
Preprocessing: Data is harmonized to daily resolution. Observations are quality-controlled (using a QC flag), and predictors are standardized via z-score normalization on the training set only.

B. Evaluation Protocols

To rigorously test spatial generalization, the authors propose stratified train-test splits that separate sites based on ecological and climatic contexts, rather than time steps:

IGBP-Stratified Split: Sites are partitioned by vegetation type (16 classes). The test set contains unseen vegetation types relative to the training set (or ensures representation of all types in both, depending on the specific split logic, but the goal is to test generalization across ecosystem types). Correction based on text: The split ensures that for rare classes (e.g., Snow/Ice), a 50/50 split is used, while others use 80/20. The goal is to evaluate if a model trained on "Forests" can predict fluxes for "Grasslands" if they are in the test set.
Köppen-Stratified Split: Sites are partitioned by climate class (Tropical, Arid, Temperate, Continental, Polar). This tests generalization across climate regimes.

Zero-Shot Setting: The model is trained on $D_{train}$ (sites $S_{train}$ ) and evaluated on $D_{test}$ (sites $S_{test}$ ). The model must predict fluxes for $S_{test}$ using only the features available at those sites and the representations learned from $S_{train}$ .

C. Baseline Models

The benchmark evaluates a diverse set of architectures to establish baselines:

Static Models: Tree-based methods (XGBoost, LightGBM) which dominate current literature.
Temporal Models: Recurrent networks (LSTM, GRU) and their variants with concatenated inputs (CT-LSTM/GRU) that incorporate categorical metadata.
Transformer Architectures: Encoder-only Transformers and Patch-Transformers (based on PatchTST).
Domain Generalization: TAM-RL, a transfer-learning-specific architecture designed for cross-domain generalization in environmental applications.

D. Training & Metrics

Loss Function: A composite loss incorporating quality weighting, class balancing, and physics-based flux balance constraints.
Metrics: Performance is evaluated using $R^2$ , RMSE, and normalized MAE (nMAE). Crucially, results are reported as quantiles (25th, 50th, 75th) across test sites to expose tail performance and catastrophic failures, rather than just global averages.

3. Key Results

Temporal vs. Static: Temporal models (LSTM, Transformers) consistently outperform static tree-based baselines (XGBoost) across all targets and stratification schemes.
Robustness of Domain Generalization: The TAM-RL model demonstrated superior robustness. While it achieved competitive median performance, its 25th percentile metrics were significantly higher than other models. This indicates fewer "catastrophic failures" on difficult, unseen sites.
- Example: Under IGBP stratification, TAM-RL achieved a GPP $R^2$ of 0.631 (median) and 0.251 (25th percentile), outperforming the Transformer's 0.198 (25th percentile).
Climate vs. Vegetation Difficulty:
- Köppen (Climate) splits proved more challenging for robust generalization. While median performance was sometimes higher, the 25th percentile performance dropped significantly (e.g., XGBoost RECO $R^2$ dropped to -0.601), indicating that climate-driven distribution shifts cause severe failures in specific zones (likely Tropical or Polar).
- IGBP (Vegetation) splits showed more consistent, albeit moderate, performance across the distribution.
NEE Prediction Difficulty: Net Ecosystem Exchange (NEE) is significantly harder to predict than GPP or RECO. The nMAE for NEE is 5–6 times higher than for GPP, likely because NEE is a small residual balance between two large, opposing fluxes, amplifying errors.

4. Key Contributions

First Zero-Shot Spatial Benchmark: CarbonBench is the first standardized benchmark for zero-shot spatial transfer learning in time series regression, specifically targeting carbon flux upscaling.
Stratified Evaluation Framework: It introduces rigorous protocols (IGBP and Köppen splits) that explicitly separate spatial transfer from temporal autocorrelation, allowing researchers to identify which environmental dimensions (climate vs. vegetation) pose the greatest transfer challenges.
Comprehensive Baselines: It provides a unified Python library and reproducible baselines ranging from classical tree-based methods to advanced domain-generalization architectures (TAM-RL) and Transformers.
Quantile-Based Reporting: By shifting focus from mean metrics to quantile-based reporting (specifically the 25th percentile), the benchmark highlights model robustness and tail failures, which are critical for climate policy and carbon accounting where underestimating fluxes in critical biomes can have severe consequences.
Bridging ML and Earth Science: The work creates a "testbed" to cross-pollinate methods between machine learning (domain adaptation, meta-learning) and Earth system science.

5. Significance and Future Directions

Scientific Impact: Reliable zero-shot models are essential for filling gaps in global carbon maps, particularly in data-sparse regions like the tropics and high latitudes, which are critical for climate policy and carbon accounting.
ML Impact: CarbonBench addresses a major gap in ML research: spatial transfer for regression. It demonstrates that standard supervised learning often fails in spatially heterogeneous settings and that domain-generalization architectures are necessary for robust performance.
Future Research Opportunities: The paper outlines several directions, including:
- Exploring richer feature sets (beyond the minimal 6 variables).
- Developing uncertainty quantification methods (Bayesian NNs, conformal prediction) for data-sparse regions.
- Investigating "Mixture of Experts" to handle ecosystem-specific dynamics.
- Integrating self-supervised learning on unlabeled satellite data.
- Combining ML with physics-based land surface models for pre-training.

In conclusion, CarbonBench provides the necessary infrastructure to move carbon flux upscaling from ad-hoc regional studies to a systematic, rigorous scientific discipline capable of handling the complexities of global climate modeling.

CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning

1. Problem Definition

2. Methodology: The CarbonBench Framework

A. Dataset Construction

B. Evaluation Protocols

C. Baseline Models

D. Training & Metrics

3. Key Results

4. Key Contributions

5. Significance and Future Directions

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning