MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

This paper proposes MoD-DPO, a Modality-Decoupled Direct Preference Optimization framework that mitigates cross-modal hallucinations in omni-modal LLMs by enforcing modality-specific invariance and sensitivity through regularization and language-prior debiasing, thereby significantly improving perception accuracy and hallucination resistance.

Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, super-smart assistant named "Omni." Omni can see videos, hear audio, and read text all at the same time. It's like a detective who can look at a crime scene photo, listen to the 911 call, and read the police report simultaneously.

But Omni has a weird flaw: It sometimes hallucinates.

If you show Omni a video of a silent cat and ask, "Is the cat meowing?", Omni might say, "Yes, I hear a loud meow!" Why? Because in its training data, cats and meowing are often paired together. Its brain is so used to the idea of a cat meowing that it ignores the fact that the video is actually silent. It's relying too much on its "textbook knowledge" and not enough on what's actually happening in front of its eyes and ears.

This paper introduces a new training method called MoD-DPO to fix this. Think of it as a specialized "reality check" course for Omni.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Spurious Correlation" Trap

Imagine you are taking a test. The question is about a picture of a beach.

  • The Wrong Way: You answer based on what you expect to be there (sand, waves, seagulls) because you've seen a million beach photos before. You ignore the fact that the photo is actually black and white and has no water.
  • The Hallucination: Omni does this. It hears a dog bark in the audio, so when you ask about the video, it "sees" a dog, even if the video shows a cat. It's mixing up the senses.

2. The Solution: "Modality Decoupling" (The "Blindfold" Exercise)

The authors created a training game called MoD-DPO. The goal is to teach Omni to trust the right sense for the right question.

They use two main tricks, which we can call "The Blindfold" and "The Noise Machine."

Trick A: The Blindfold (Invariance)

  • The Scenario: You ask Omni, "What do you see in this video?"
  • The Training: The researchers take the audio track of the video and replace it with random noise or silence (corrupting the irrelevant sense).
  • The Lesson: Omni is punished if its answer changes just because the audio got weird. It learns: "Wait, I'm being asked about the video. The audio doesn't matter. I need to stay focused on the picture, even if the sound is garbage."
  • Real-life analogy: Imagine a chef tasting a soup. If someone starts playing loud jazz music in the kitchen, the chef shouldn't suddenly think the soup tastes like jazz. The chef must remain "invariant" (unchanged) to the irrelevant noise.

Trick B: The Noise Machine (Sensitivity)

  • The Scenario: You ask Omni, "What do you hear in this video?"
  • The Training: This time, they keep the audio but scramble the video (corrupting the relevant sense).
  • The Lesson: Omni is punished if it doesn't notice that the video is broken. It learns: "If the video is messed up, my answer about what I see should change drastically. I need to be sensitive to the quality of the input I'm being asked about."
  • Real-life analogy: If you are trying to listen to a friend in a quiet room, and someone suddenly starts screaming, you should immediately realize the environment has changed. You shouldn't keep pretending everything is calm.

3. The "Language Bias" Penalty

Omni is also a language model, meaning it loves to guess based on words. If you ask, "Did the dog bark?", it might just say "Yes" because "dog" and "bark" go together in its dictionary.

The authors added a "Language Prior Debiasing" penalty.

  • The Analogy: Imagine a student who always guesses the answer based on the first word of the question. The teacher says, "If you answer based only on the words without looking at the chart, you get a zero."
  • The Result: This forces Omni to stop guessing based on text habits and actually look at the video or listen to the audio to find the truth.

4. The Data: Creating the "Trick Questions"

To teach this, they didn't just use normal videos. They built a massive dataset of "trick questions" (18,000+ of them).

  • They took a video of a dog barking.
  • They paired it with a video of a silent cat.
  • They asked: "Is the dog barking?"
  • Correct Answer: "No, that's a cat."
  • Wrong Answer (Hallucination): "Yes, I hear a dog."

By training on thousands of these mismatched scenarios, Omni learns to stop guessing and start paying attention.

The Result

After this training, Omni becomes much more reliable.

  • Before: "I see a dog barking!" (Even though it was a cat).
  • After: "No, I see a cat, and it is silent."

Summary

MoD-DPO is like a rigorous training camp for a multi-sensory AI. It teaches the AI to:

  1. Ignore distractions: If you ask about sight, don't let sound confuse you.
  2. Notice changes: If the thing you're looking at changes, admit it.
  3. Stop guessing: Don't rely on your text-book knowledge; look at the evidence.

The result is an AI that is less likely to lie about what it sees and hears, making it a much more trustworthy assistant for the real world.