Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

This paper proposes Robust-MMR, a self-supervised pre-training framework that integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to generate domain-invariant medical vision-language representations, significantly improving cross-domain performance and robustness against perturbations across multiple benchmarks.

Melika Filvantorkaman, Mohsen Piri

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are training a brilliant medical student to read X-rays and understand doctor's notes.

In the old way of doing this (the "standard" method), you would show the student thousands of X-rays and reports from one specific hospital. You'd teach them to recognize a broken bone or a tumor based on exactly how that hospital's machines look and how their doctors write.

The Problem:
When you send this student to a different hospital, things go wrong.

  • The X-ray machines are different brands (maybe the images look grainier or brighter).
  • The doctors write their notes in a different style (maybe they use different abbreviations or sentence structures).
  • The student freezes. They can't tell a broken bone from a shadow because they were only taught to recognize the specific look of the first hospital's data. They learned the "accent" of the data, not the actual medical facts.

The Solution: "Robust-MMR"
The authors of this paper propose a new training method called Robust-MMR. Think of it as a "survival training" boot camp for AI models. Instead of just teaching the model to fill in the blanks, they teach it to fill in the blanks even when the world is falling apart around it.

Here is how they do it, using simple analogies:

1. The "Blindfold and Noise" Game (Asymmetric Perturbation)

In standard training, the AI sees a clear picture and a clear sentence, and has to guess the missing parts.
In Robust-MMR, the trainers deliberately mess things up:

  • The Image: They take the X-ray and add static noise, blur it, or cut out a chunk of it (simulating a bad scanner or a damaged film).
  • The Text: They take the doctor's report, delete random sentences, or swap words for synonyms (simulating a messy handwritten note or a different doctor's style).
  • The Twist: Sometimes they ruin the image but leave the text perfect. Other times, they ruin the text but leave the image clear.

The Lesson: The AI learns that it can't rely on just one source. If the picture is blurry, it must read the text to understand what's happening. If the text is missing, it must look at the picture. It learns to be a detective who can solve a crime even if half the evidence is missing.

2. The "Universal Translator" (Domain-Consistency)

Imagine you have two students: one trained in New York and one in Tokyo.

  • The New York student sees a "heart attack" on an X-ray and hears the word "infarction."
  • The Tokyo student sees the same heart attack on a different machine and hears the word "MI."

Standard AI might think these are two different diseases because the words and images look different.
Robust-MMR forces the AI to realize: "Wait, even though the machine looks different and the word is different, the meaning is the same."

They use a special rule that says: "If two cases mean the same thing medically, your brain must treat them as identical, no matter where they came from." This teaches the AI to ignore the "accent" (the scanner brand or writing style) and focus on the "soul" (the actual disease).

3. The "Safety Net" (Modality Resilience)

In the real world, data is often incomplete. You might have an X-ray but no report, or a report but a blurry photo.
Standard AI often crashes or guesses wildly when data is missing.
Robust-MMR trains the AI to be resilient. It practices scenarios where one part of the data is totally gone. It learns to say, "I can't see the image, but the text tells me enough to make a safe guess," or vice versa.

The Results: Why Does This Matter?

The paper tested this new "survival-trained" AI against the old "classroom-trained" AI.

  • In the classroom (same hospital): Both did well.
  • In the real world (different hospital, bad equipment, messy notes): The old AI failed. The new Robust-MMR AI kept its cool.
    • It got 3.8% more questions right on average when moving to new hospitals.
    • When the images were noisy or the text was cut off, the old AI's performance dropped significantly, while the new AI stayed strong.
    • In a "retrieval" test (finding the right report for an image), the old AI got lost in the noise, while the new AI found the right answer almost immediately.

The Bottom Line

This paper argues that we shouldn't just teach AI to be smart; we need to teach it to be tough.

By intentionally making the training data messy, incomplete, and different during the learning phase, the AI learns to ignore the "noise" of the real world (different machines, different doctors) and focus on the "signal" (the actual medical truth). This makes the AI much safer and more reliable when it's finally deployed in a real hospital, where things are rarely perfect.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →