Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

This paper proposes the Self-Supervised Semantic Bridge (SSB), a framework that leverages self-supervised visual encoders to create a shared geometric latent space for diffusion bridge models, enabling high-fidelity, spatially faithful unpaired image-to-image translation without cross-domain supervision.

Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis

Published 2026-02-19
📖 6 min read🧠 Deep dive

Imagine you are a master translator, but instead of translating words between French and English, you are translating images between two completely different worlds.

Let's say you want to turn a MRI scan (which looks like a fuzzy, gray ghost of a body) into a CT scan (which looks like a sharp, high-contrast X-ray). Or, you want to turn a photo of a horse into a zebra, or a summer landscape into a winter wonderland.

The problem? You don't have a "dictionary" that pairs every single MRI with its matching CT scan. You have a pile of MRIs and a separate pile of CTs, but they aren't lined up. This is called Unpaired Image-to-Image Translation.

For a long time, computers struggled with this. They either got the shapes wrong (the horse's legs turned into zebra stripes but the legs disappeared) or they got the colors right but the anatomy was a mess.

Enter the new method from this paper: SSB (Self-Supervised Semantic Bridge).

Here is how it works, using a simple analogy:

The Problem: The "Lost in Translation" Effect

Imagine you are trying to translate a book from a language you don't speak (Source) to another language you don't speak (Target), and you have no dictionary.

  • Old Method A (The Adversarial Approach): You hire two people. One tries to fake a translation, and the other tries to catch them lying. They argue back and forth until they agree on a translation. But if the "fake" translator gets too creative, the story loses its original meaning.
  • Old Method B (The Inversion Approach): You try to turn the book into pure noise (static) and then try to rebuild it in the new language. But because the "noise" is messy, you often lose the plot details. The story comes back, but the characters are in the wrong places.

The Solution: The "Universal Blueprint" (The Semantic Bridge)

The authors of this paper realized that while the appearance of an MRI and a CT scan are totally different (gray vs. white, fuzzy vs. sharp), the skeleton underneath is exactly the same. The liver is in the same spot; the spine curves the same way.

They built a Semantic Bridge. Think of this bridge as a Universal Blueprint or a 3D Wireframe.

  1. The Smart Architect (The Encoder):
    They used a special AI (called DINO) that acts like a super-architect. This architect doesn't care about the paint color, the wallpaper, or the lighting. It only cares about the structure.

    • If you show it a photo of a horse, it ignores the brown fur and sees the "Horse Shape."
    • If you show it a photo of a zebra, it ignores the black-and-white stripes and sees the "Horse Shape."
    • If you show it an MRI, it ignores the gray fuzz and sees the "Body Shape."
    • If you show it a CT, it ignores the white bones and sees the "Body Shape."

    This architect creates a Shared Blueprint (a "latent space") that is identical for both the horse and the zebra, or the MRI and the CT.

  2. The Construction Crew (The Diffusion Bridge):
    Once the blueprint is made, the computer doesn't need to guess how to turn an MRI into a CT. It just needs to say: "Here is the blueprint. Now, build a CT scan that fits this blueprint."

    • It takes the MRI, extracts the blueprint.
    • It uses that blueprint to guide a "construction crew" (a diffusion model) to build a brand new CT scan from scratch, ensuring the bones and organs are in the exact right places.

Why is this a Big Deal?

1. No "Pairing" Required:
You don't need to find a patient who has both an MRI and a CT scan taken at the exact same time. You can just throw a million MRIs and a million CTs into the system, and the "Architect" will figure out the shared shapes on its own. This is like learning to translate French and English just by reading a million French books and a million English books, without ever seeing a sentence that is written in both.

2. It Works on "Weird" Data (Out-of-Distribution):
Imagine you trained your translator on French books about cats, but then someone hands you a French book about space travel. Old methods would get confused and output gibberish.
Because SSB focuses on the structure (the grammar and logic) rather than just the specific words (the appearance), it can handle new types of MRI scans it has never seen before. It knows that "a kidney looks like a kidney" even if the contrast is weird.

3. It's a One-Way Street (Scalability):
If you want to add a third language (say, Ultrasound), you don't need to pair Ultrasound with MRI and Ultrasound with CT. You just teach the Architect how to read Ultrasound blueprints. Now, the bridge connects everything automatically. It's much cheaper and faster to scale.

Real-World Magic

The paper shows this working on:

  • Medical Imaging: Turning fuzzy MRIs into sharp CTs so doctors can plan surgery without needing a CT scan (which uses radiation).
  • Art & Editing: Turning a summer photo into a winter one, or changing a horse into a zebra, while keeping the horse's pose and the background perfectly intact.

The Catch (Limitations)

The system is great at changing the "skin" (color, texture, style) while keeping the "bones" (shape, structure) the same.

  • It struggles with big changes: If you ask it to turn a small lizard into a giant dragon, it gets confused. The "blueprint" says "lizard," but the prompt says "dragon." The system tries to keep the lizard shape, so the result might look like a weird, stretched lizard-dragon hybrid. It's too stubborn to completely change the geometry.
  • Abstract inputs: If you try to translate a stick-figure drawing into a photo, it might fail because the "blueprint" isn't detailed enough to start with.

The Bottom Line

SSB is like a master translator who ignores the accent and the vocabulary and focuses entirely on the story's plot. By building a shared "structural blueprint" between two different worlds, it allows computers to translate images with incredible accuracy, even when they've never seen the two images paired together before. It's a huge step forward for medical imaging and creative AI editing.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →