Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

Imagine you are a detective trying to solve a mystery about a specific neighborhood on Earth. You have a toolbox full of different "eyes" to look at this neighborhood:

The Daylight Eye (RGB): Takes beautiful, colorful photos like a standard camera.
The Night-Vision Eye (SAR): Uses radar to see through clouds and darkness, but the images look like grainy, black-and-white static.
The Heat-Sensing Eye (NIR): Sees infrared light, great for spotting healthy plants.
The Super-Sharp Eye (PAN): Sees in black and white but with incredible detail.
The Rainbow Eye (MS): Sees many specific colors of light we can't usually see.

The Problem:
In the real world, these eyes rarely work together perfectly. Sometimes you only have the Night-Vision Eye (because it's cloudy), but you need the Daylight Eye to see what the buildings actually look like. Sometimes you have the Rainbow Eye, but you need the Super-Sharp Eye.

Previously, scientists tried to solve this by building a separate "translator" for every single pair of eyes.

Need to translate Night-Vision to Daylight? Build Translator A.
Need to translate Heat to Rainbow? Build Translator B.
Need to translate Super-Sharp to Heat? Build Translator C.

If you have 5 types of eyes, you need 20 different translators! This is expensive, slow, and if you want to translate between two eyes you haven't seen before, you're stuck. It's like having a dictionary that only translates English to French, and another that only translates French to German, but no single book that can translate anything to anything.

The Solution: Any2Any
The researchers behind this paper, "Any2Any," built a Universal Translator and a massive Training Library to fix this.

1. The Training Library (RST-1M)

Imagine trying to teach a student to translate languages, but you only have a few scattered sentences. They will never learn the rules well.
The authors created RST-1M, a massive library containing 1.2 million pairs of images. It's like a giant photo album where every single neighborhood is photographed by all 5 different "eyes" at the exact same time, perfectly aligned. This gives the AI a perfect reference to learn how the world looks through every lens.

2. The Universal Translator (The Model)

Instead of building 20 different translators, they built one single brain (a unified AI model) that can handle any translation. Here is how it works, using a simple analogy:

The "Secret Code" Room (Latent Space):
Imagine all 5 types of cameras take their photos and run them through a special machine that converts them into a universal "Secret Code."
- The Night-Vision photo becomes Code A.
- The Daylight photo becomes Code B.
- Even though the original photos look totally different, their "Secret Codes" live in the same room and describe the same neighborhood.
- This is the Shared Latent Space. It strips away the weirdness of the specific camera and focuses on the actual geography (the roads, buildings, trees).
The Translator (The Diffusion Model):
Once the AI has the "Secret Code" of the source image (e.g., Night-Vision), it uses a shared brain to predict what the "Secret Code" of the target image (e.g., Daylight) should look like. It's like taking a sketch of a house and asking, "If I colored this in, what would it look like?"
The "Fine-Tuning" Tool (Residual Adapters):
Sometimes, the Night-Vision camera sees things slightly differently than the Daylight camera (maybe the radar bounces off a roof differently than light does). The AI might get the general shape right but the texture wrong.
To fix this, they added tiny, lightweight "patches" (Adapters) for each camera type. Think of these as specialized glasses. If the AI is translating to a Daylight photo, it puts on the "Daylight Glasses" to correct the colors and details. If it's translating to Night-Vision, it swaps to the "Radar Glasses."
Crucially, these glasses are so light that they don't slow the AI down.

Why is this a Big Deal?

One Size Fits All: You don't need 20 translators anymore. You have one "Any2Any" brain. If you want to translate from Eye A to Eye B, or Eye C to Eye D, you just use the same brain.
Zero-Shot Magic: Because the AI learned the "Secret Code" of the world so well, it can even translate between two types of eyes it never saw paired together during training.
- Example: If the AI learned how "Night-Vision" looks and how "Rainbow" looks, but never saw them side-by-side, it can still figure out how to turn Night-Vision into Rainbow by using the "Secret Code" as a bridge. It's like knowing how to speak English and knowing how to speak Japanese; even if you've never heard someone speak English to Japanese directly, you can translate it because you understand the underlying concepts.
Efficiency: It saves massive amounts of computer power and storage.

The Result

The paper shows that this new system is much better at creating realistic images than the old "pair-by-pair" methods. It produces clearer, more accurate images of our planet, no matter which sensors are available.

In short: They built a massive library of perfectly matched photos and taught a single, super-smart AI to speak the "language of the Earth" in any dialect (sensor) you throw at it, making Earth observation faster, cheaper, and more flexible.

1. Problem Statement

Remote sensing relies on heterogeneous sensors (e.g., RGB, SAR, NIR, PAN, MS) that provide complementary views of the same geographic scene. However, in practice, data is often incomplete due to acquisition constraints, leading to "missing-modality" scenarios.

Current solutions for cross-modal translation suffer from two fundamental limitations:

Quadratic Complexity: Existing methods treat each modality pair (e.g., SAR→RGB, RGB→SAR) as an independent task. Supporting $N$ modalities requires training $O(N^2)$ separate models, leading to prohibitive storage and training costs.
Poor Generalization: These pairwise models are direction-specific and fail to generalize to unseen modality combinations (e.g., a model trained on SAR→RGB cannot translate PAN→MS). Furthermore, existing datasets often lack the dense connectivity required to learn a unified semantic representation across all modalities.

2. Methodology: Any2Any Framework

The authors propose Any2Any, a unified generative framework based on Latent Diffusion that treats translation as inference over a shared latent representation. The architecture decouples modality-specific encoding from shared semantic mapping.

A. Dataset Construction: RST-1M

To enable unified learning, the authors constructed RST-1M, the first million-scale remote sensing dataset.

Scale: Contains ~1.2 million spatially aligned image pairs.
Modalities: Covers 5 core modalities: RGB, SAR, NIR, PAN, and MS.
Connectivity: Aggregated from five public datasets (SEN1-2, SEN12MS, CACo, SpaceNet-3, SpaceNet-5). It forms a connected modality graph, allowing transitive learning (e.g., learning SAR→MS via the pivot of RGB) and supporting 14 seen translation tasks and 6 unseen zero-shot tasks.

B. Architecture Components

The framework operates in three stages:

Modality-Specific Latent Projection (Stage I):
- $N$ independent Variational Autoencoders (VAEs) are trained for each modality.
- Each encoder $E_k$ projects raw observations into a dimensionally unified and geometrically aligned latent space $\mathcal{Z}$ .
- This step resolves physical heterogeneity (different resolutions, spectral bands) by mapping all inputs to a common latent manifold.
Unified Semantic Mapping (Stage II):
- A shared Diffusion Transformer (DiT) backbone ( $f_\theta$ ) performs the semantic translation.
- Input: Concatenation of the noisy target latent ( $z_t$ ) and the source latent ( $z_i$ ).
- Conditioning: An Adaptive Layer Normalization (AdaLN) mechanism uses an MLP to inject timestep embeddings and modality identity embeddings (source and target) into the backbone.
- Objective: Instead of predicting noise ( $\epsilon$ ), the model uses $x_0$ -prediction to directly regress the clean target latent ( $z_j$ ). This "Latent Anchor" mechanism, supervised by the paired ground truth in RST-1M, collapses the conditional distribution to a Dirac delta, ensuring structural stability.
Manifold Calibration (Stage III):
- To address systematic distribution shifts between the shared backbone's output and the specific target decoder's manifold, lightweight Residual Adapters ( $A_j$ ) are introduced.
- These adapters are target-specific but share the same architecture. They are zero-initialized and learn only the residual correction needed to align the predicted latent with the target decoder's effective manifold.
- Crucially, the adapters are optimized with a stop-gradient operator to prevent backpropagation from corrupting the pre-trained backbone.

3. Key Contributions

Unified Formulation: The first framework to formalize "Any-to-Any" translation, replacing $O(N^2)$ direction-specific models with a single $O(1)$ unified model.
RST-1M Dataset: The creation of a million-scale, multi-modal benchmark with five sensing modalities, enabling transitive learning and dense supervision across the modality graph.
Latent Diffusion with Anchors: A novel diffusion strategy that projects heterogeneous inputs into a shared latent space and uses a "Latent Anchor" mechanism to enforce structural alignment, coupled with lightweight residual adapters for modality-specific refinement.
Zero-Shot Generalization: The ability to translate between unseen modality pairs (e.g., PAN→NIR) without specific paired training data, leveraging the shared semantic manifold.

4. Experimental Results

Experiments were conducted on 14 translation tasks (7 modality pairs, bidirectional) using the RST-1M test set.

Quantitative Performance: Any2Any (in S, B, and L variants) consistently outperforms state-of-the-art methods (Pix2Pix, Pix2PixHD, BBDM, ControlNet, LBM) across all metrics (PSNR, SSIM, RMSE).
- Example: On SAR→RGB, Any2Any-L achieves 25.20 PSNR, significantly higher than the previous best (BBDM at 19.50).
- Example: On MS→RGB, Any2Any-L achieves 33.22 PSNR vs. BBDM's 26.39.
Zero-Shot Capability: The model successfully generates semantically reasonable results for 6 unseen modality pairs (e.g., SAR→PAN, PAN→MS) that were not present during training, demonstrating strong emergent generalization.
Efficiency: The unified model reduces the parameter count and training complexity compared to training 14 separate models. The residual adapters add negligible overhead (<0.01M parameters).

5. Significance and Impact

Scalability: By reducing complexity from quadratic to constant, Any2Any makes large-scale, multi-sensor remote sensing systems feasible. As new sensors are added, the model does not require retraining the entire system; it only needs the new modality's encoder/decoder and adapter.
Data Efficiency: The framework leverages the connectivity of the RST-1M dataset to learn transferable semantic knowledge, allowing the model to perform well even with sparse supervision for specific pairs.
Foundation for Earth Observation: This work lays the groundwork for "universal" Earth observation models capable of all-weather, multi-sensor data generation and completion, bridging the gap between physical sensor limitations and the need for complete, consistent data.

In conclusion, Any2Any represents a paradigm shift from fragmented pairwise translation to a unified, scalable, and generalizable approach for remote sensing modality conversion.

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

1. The Training Library (RST-1M)

2. The Universal Translator (The Model)

Why is this a Big Deal?

The Result

1. Problem Statement

2. Methodology: Any2Any Framework

A. Dataset Construction: RST-1M

B. Architecture Components

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search