Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

This paper introduces Any2Any, a unified latent diffusion framework that enables efficient and generalizable arbitrary modality translation in remote sensing by projecting heterogeneous inputs into a shared geometrically aligned latent space, supported by the newly proposed million-scale RST-1M dataset.

Haoyang Chen, Jing Zhang, Hebaixu Wang, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haonan Guo, Di Wang, Zheng Wang, Bo Du

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery about a specific neighborhood on Earth. You have a toolbox full of different "eyes" to look at this neighborhood:

  1. The Daylight Eye (RGB): Takes beautiful, colorful photos like a standard camera.
  2. The Night-Vision Eye (SAR): Uses radar to see through clouds and darkness, but the images look like grainy, black-and-white static.
  3. The Heat-Sensing Eye (NIR): Sees infrared light, great for spotting healthy plants.
  4. The Super-Sharp Eye (PAN): Sees in black and white but with incredible detail.
  5. The Rainbow Eye (MS): Sees many specific colors of light we can't usually see.

The Problem:
In the real world, these eyes rarely work together perfectly. Sometimes you only have the Night-Vision Eye (because it's cloudy), but you need the Daylight Eye to see what the buildings actually look like. Sometimes you have the Rainbow Eye, but you need the Super-Sharp Eye.

Previously, scientists tried to solve this by building a separate "translator" for every single pair of eyes.

  • Need to translate Night-Vision to Daylight? Build Translator A.
  • Need to translate Heat to Rainbow? Build Translator B.
  • Need to translate Super-Sharp to Heat? Build Translator C.

If you have 5 types of eyes, you need 20 different translators! This is expensive, slow, and if you want to translate between two eyes you haven't seen before, you're stuck. It's like having a dictionary that only translates English to French, and another that only translates French to German, but no single book that can translate anything to anything.

The Solution: Any2Any
The researchers behind this paper, "Any2Any," built a Universal Translator and a massive Training Library to fix this.

1. The Training Library (RST-1M)

Imagine trying to teach a student to translate languages, but you only have a few scattered sentences. They will never learn the rules well.
The authors created RST-1M, a massive library containing 1.2 million pairs of images. It's like a giant photo album where every single neighborhood is photographed by all 5 different "eyes" at the exact same time, perfectly aligned. This gives the AI a perfect reference to learn how the world looks through every lens.

2. The Universal Translator (The Model)

Instead of building 20 different translators, they built one single brain (a unified AI model) that can handle any translation. Here is how it works, using a simple analogy:

  • The "Secret Code" Room (Latent Space):
    Imagine all 5 types of cameras take their photos and run them through a special machine that converts them into a universal "Secret Code."

    • The Night-Vision photo becomes Code A.
    • The Daylight photo becomes Code B.
    • Even though the original photos look totally different, their "Secret Codes" live in the same room and describe the same neighborhood.
    • This is the Shared Latent Space. It strips away the weirdness of the specific camera and focuses on the actual geography (the roads, buildings, trees).
  • The Translator (The Diffusion Model):
    Once the AI has the "Secret Code" of the source image (e.g., Night-Vision), it uses a shared brain to predict what the "Secret Code" of the target image (e.g., Daylight) should look like. It's like taking a sketch of a house and asking, "If I colored this in, what would it look like?"

  • The "Fine-Tuning" Tool (Residual Adapters):
    Sometimes, the Night-Vision camera sees things slightly differently than the Daylight camera (maybe the radar bounces off a roof differently than light does). The AI might get the general shape right but the texture wrong.
    To fix this, they added tiny, lightweight "patches" (Adapters) for each camera type. Think of these as specialized glasses. If the AI is translating to a Daylight photo, it puts on the "Daylight Glasses" to correct the colors and details. If it's translating to Night-Vision, it swaps to the "Radar Glasses."
    Crucially, these glasses are so light that they don't slow the AI down.

Why is this a Big Deal?

  1. One Size Fits All: You don't need 20 translators anymore. You have one "Any2Any" brain. If you want to translate from Eye A to Eye B, or Eye C to Eye D, you just use the same brain.
  2. Zero-Shot Magic: Because the AI learned the "Secret Code" of the world so well, it can even translate between two types of eyes it never saw paired together during training.
    • Example: If the AI learned how "Night-Vision" looks and how "Rainbow" looks, but never saw them side-by-side, it can still figure out how to turn Night-Vision into Rainbow by using the "Secret Code" as a bridge. It's like knowing how to speak English and knowing how to speak Japanese; even if you've never heard someone speak English to Japanese directly, you can translate it because you understand the underlying concepts.
  3. Efficiency: It saves massive amounts of computer power and storage.

The Result

The paper shows that this new system is much better at creating realistic images than the old "pair-by-pair" methods. It produces clearer, more accurate images of our planet, no matter which sensors are available.

In short: They built a massive library of perfectly matched photos and taught a single, super-smart AI to speak the "language of the Earth" in any dialect (sensor) you throw at it, making Earth observation faster, cheaper, and more flexible.