COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Imagine you are trying to describe a specific place on Earth to a friend who has never seen it. You tell them, "It's a flat, green field with a small hill in the middle."

If you asked a traditional computer program to draw this, it would likely give you one specific image: a perfectly flat green field with a perfectly round hill. It would be a "safe" guess, the average of all possible fields. But in reality, that description could be a sunny wheat field in France, a rainy pasture in Ireland, or a snowy meadow in Canada. The description is the same, but the reality is different.

This is the problem COP-GEN solves.

The Problem: The "One Answer" Trap

Earth observation (satellite data) is messy. We have optical cameras (like your phone), radar (which sees through clouds), elevation maps (topography), and land-cover maps.

The relationship between these is one-to-many.

Input: "A forest on a mountain."
Output: It could be a sunny summer forest, a foggy winter forest, a forest with snow, or a forest with a storm.

Older AI models act like a strict librarian who only gives you the "average" book. If you ask for a forest, they give you a blurry, boring picture that looks like every forest and no forest at all. They collapse all the possibilities into one safe, boring answer.

The Solution: COP-GEN (The "Imaginative Artist")

COP-GEN is a new AI model designed by researchers at the University of Edinburgh and the European Space Agency. Instead of trying to guess the one right answer, it learns the entire range of possibilities.

Think of COP-GEN not as a calculator, but as a creative artist who understands the rules of physics.

If you show it a map of a mountain and a forest, it doesn't just draw one picture.
It says, "Ah, I can paint a sunny version, a foggy version, or a stormy version. All of these are physically possible."
It generates multiple, diverse, and realistic versions of the same scene.

How It Works: The "Universal Translator"

The world of satellite data is like a group of people speaking different languages:

Optical cameras speak "Visible Light."
Radar speaks "Microwaves."
Elevation maps speak "Height."
Land cover speaks "Types of Ground."

Most AI models struggle to translate between these languages, especially if the data comes in different sizes (some images are high-res, some are low-res).

COP-GEN uses a clever trick called Latent Diffusion Transformers.

The Translator: It first translates all these different "languages" into a common, secret code (called "latent tokens"). It's like converting French, German, and Japanese into a universal "Morse code" that the AI understands.
The Artist: It then uses a powerful "Transformer" (a type of AI brain good at understanding context) to mix these codes together.
The Magic: When you ask it to generate an image, it doesn't just copy-paste. It "denoises" the secret code, slowly turning random static into a clear picture, while respecting the rules you gave it.

Why This Matters: The "Weather Forecast" Analogy

Imagine you are a disaster manager. You have a radar image of a storm, but the optical camera is blocked by clouds. You need to know what the ground looks like underneath to plan a rescue.

Old AI: "Here is the average ground." (It might look like a muddy mess, but it misses the specific details needed for rescue).
COP-GEN: "Here are five possible scenarios. In Scenario A, it's a muddy river. In Scenario B, it's a flooded road. In Scenario C, it's a dry field."

By giving you options instead of a single guess, COP-GEN helps humans understand the uncertainty. It tells you, "It could be this, or it could be that," which is much more useful for making real-world decisions.

The "Zero-Shot" Superpower

The coolest part? COP-GEN is a universal translator that doesn't need to be retrained for every new job.

Want to turn a map into a photo? Done.
Want to turn a photo into a radar image? Done.
Want to fill in missing colors in a photo? Done.

It's like a Swiss Army knife for satellite data. You don't need a different tool for every job; you just tell it what you have and what you want, and it figures out the rest.

Summary

COP-GEN is a breakthrough because it stops trying to force the chaotic, changing Earth into a single, static box. Instead, it embraces the chaos. It understands that for every piece of data, there are many valid realities. By generating many possibilities instead of one average, it creates a more honest, flexible, and useful tool for understanding our planet.

It's the difference between a robot that says, "I am 100% sure this is a field," and an artist who says, "Based on what I see, it could be any of these beautiful, realistic fields."

1. Problem Statement

Earth Observation (EO) applications increasingly rely on integrating data from heterogeneous sensors (optical, radar, elevation, land-cover). A fundamental challenge in this domain is the non-injective nature of cross-modal mappings: a single set of conditioning variables (e.g., terrain elevation and land cover) can correspond to multiple physically plausible observations (e.g., different atmospheric conditions, illumination angles, or spectral appearances).

Limitation of Deterministic Models: Existing generative models for EO often rely on deterministic architectures (e.g., masked autoencoders or UNet-based diffusion). These models tend to collapse toward the conditional mean, producing blurry or averaged outputs that fail to capture the inherent uncertainty and variability of real-world scenes.
Evaluation Mismatch: Standard EO benchmarks use single-reference, pointwise metrics (e.g., MAE, PSNR). These metrics penalize stochastic models that generate diverse, valid outputs because they rarely match a single ground-truth pixel-perfectly, even if the generated distribution is physically correct.
Data Heterogeneity: Current multimodal models often force all data onto a fixed spatial grid, requiring aggressive resampling that destroys native sensor resolutions and physical relationships.

2. Methodology: COP-GEN Architecture

COP-GEN is a multimodal latent diffusion transformer designed to model the joint probability distribution of heterogeneous Copernicus data at their native resolutions.

A. Data and Modalities

The model is trained on a global dataset of 1,017,469 paired samples derived from MajorTOM, encompassing:

Optical: Sentinel-2 Level 1C and Level 2A (various spectral bands at 10m, 20m, and 60m resolutions).
Radar: Sentinel-1 RTC (VV/VH polarization).
Topography: Digital Elevation Model (DEM, 30m).
Semantic: Land Use/Land Cover (LULC, 10m).
Metadata: Geolocation (Lat-Lon) and Acquisition Timestamp.

B. Latent Representation Learning

To handle heterogeneous resolutions without aggressive resampling, COP-GEN employs modality-specific Variational Autoencoders (VAEs):

Each modality (or resolution group) is encoded into a compact latent space using independent VAEs trained with a composite loss (L1, MSE, LPIPS, KL-divergence, and adversarial loss).
This preserves the native spatial structure of each sensor (e.g., 10m bands remain distinct from 60m bands).
Scalar inputs (Lat-Lon, Time) are tokenized as global embeddings (Cartesian unit vectors for coordinates; sine-cosine for time).

C. Unified Transformer Backbone

Tokenization: Image latents are patchified into tokens. Each modality is assigned a modality-specific diffusion timestep ( $t^{(i)}$ ), encoded as dedicated tokens.
Architecture: A U-shaped Vision Transformer (U-ViT) with 20 layers, 1024-dimensional embeddings, and 16 attention heads.
Training Objective: The model learns to jointly predict noise for all modalities simultaneously using the standard DDPM $\epsilon$ -prediction objective.
Any-to-Any Conditioning: By controlling the diffusion timestep of specific modalities, the model can perform flexible generation:
- Unconditional: All modalities generated from noise.
- Conditional: Some modalities fixed at $t=0$ (observed), others denoised from noise. This enables zero-shot translation (e.g., DEM + LULC $\to$ Optical) without task-specific retraining.

3. Key Contributions

Stochastic-by-Design Modeling: COP-GEN explicitly models the one-to-many relationship in EO data, generating diverse, physically consistent realizations rather than collapsing to a conditional mean.
Native-Resolution Multimodal Fusion: Unlike prior works that resample data to a common grid, COP-GEN processes modalities at their native resolutions via modality-specific tokenization, preserving physical fidelity.
Flexible Any-to-Any Generation: The architecture supports zero-shot translation between any subset of modalities (e.g., infilling missing spectral bands, translating radar to optical, or generating metadata from imagery) within a single unified framework.
Novel Evaluation Protocol: The authors introduce a "Peak-Capability" (Oracle) evaluation metric. Instead of averaging performance over all samples, they select the best generation per tile to measure the upper bound of the model's representational capacity, acknowledging the stochastic nature of the task.

4. Results and Analysis

Experiments were conducted on a large-scale global dataset, comparing COP-GEN against TerraMind (a strong deterministic baseline).

Quantitative Performance (Peak Capability)

Optical & Radar: COP-GEN significantly outperforms TerraMind in Peak MAE and PSNR for generating S2L1C, S2L2A, and S1RTC, demonstrating superior ability to capture high-fidelity details.
DEM Reconstruction: COP-GEN achieves a MAE of 26.80 compared to TerraMind's 145.62, highlighting its ability to model complex topographic variations.
Geolocation: TerraMind performs slightly better on Lat-Lon regression (deterministic collapse), while COP-GEN produces a distribution of plausible locations.

Qualitative & Distributional Analysis

Output Diversity: COP-GEN generates diverse scenes with varying illumination, atmospheric conditions, and spectral appearances while respecting terrain constraints. TerraMind produces visually identical, "blurry" outputs.
Uncertainty Calibration: As conditioning information increases (e.g., adding LULC, then Radar, then Timestamp), COP-GEN's output distribution systematically narrows toward the ground truth, demonstrating that the model learns to modulate uncertainty based on available data.
Spatial Priors: When conditioned only on LULC (e.g., "Trees"), COP-GEN predicts geolocations concentrated in forested regions globally, whereas TerraMind collapses to a few specific memorized locations.
Band Infilling: The model successfully reconstructs missing spectral bands (e.g., generating 20m/60m bands from 10m inputs) and auxiliary modalities (DEM, LULC) from partial inputs.

5. Significance and Future Work

Significance:
COP-GEN establishes a new paradigm for Earth Observation generative modeling by shifting from deterministic reconstruction to stochastic distribution modeling. It addresses the physical reality that remote sensing data is inherently uncertain and underdetermined. The paper also argues for a shift in evaluation metrics, moving away from single-reference pointwise comparisons toward distributional and peak-capability analyses.

Limitations & Future Directions:

Metadata Influence: Currently, geolocation and timestamp conditioning have a limited impact on the output (e.g., snow appearing in tropical regions), likely due to token imbalance in the loss function.
Joint Training: The current scheme generates all modalities jointly. Future work may explore stochastic modality dropout to improve marginal understanding of individual sensors.
Temporal Dynamics: Future iterations aim to model temporal sequences explicitly to simulate Earth system dynamics rather than static snapshots.
Scalability: Plans to extend the model to higher resolutions and additional sensor types.

In conclusion, COP-GEN provides a principled framework for generating multimodal Earth observation data that respects physical constraints while capturing the inherent variability of the natural world, offering a robust alternative to deterministic foundation models.