MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Imagine you have a brilliant, super-smart assistant named "Omni." Omni can see videos, hear audio, and read text all at the same time. It's like a detective who can look at a crime scene photo, listen to the 911 call, and read the police report simultaneously.

But Omni has a weird flaw: It sometimes hallucinates.

If you show Omni a video of a silent cat and ask, "Is the cat meowing?", Omni might say, "Yes, I hear a loud meow!" Why? Because in its training data, cats and meowing are often paired together. Its brain is so used to the idea of a cat meowing that it ignores the fact that the video is actually silent. It's relying too much on its "textbook knowledge" and not enough on what's actually happening in front of its eyes and ears.

This paper introduces a new training method called MoD-DPO to fix this. Think of it as a specialized "reality check" course for Omni.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Spurious Correlation" Trap

Imagine you are taking a test. The question is about a picture of a beach.

The Wrong Way: You answer based on what you expect to be there (sand, waves, seagulls) because you've seen a million beach photos before. You ignore the fact that the photo is actually black and white and has no water.
The Hallucination: Omni does this. It hears a dog bark in the audio, so when you ask about the video, it "sees" a dog, even if the video shows a cat. It's mixing up the senses.

2. The Solution: "Modality Decoupling" (The "Blindfold" Exercise)

The authors created a training game called MoD-DPO. The goal is to teach Omni to trust the right sense for the right question.

They use two main tricks, which we can call "The Blindfold" and "The Noise Machine."

Trick A: The Blindfold (Invariance)

The Scenario: You ask Omni, "What do you see in this video?"
The Training: The researchers take the audio track of the video and replace it with random noise or silence (corrupting the irrelevant sense).
The Lesson: Omni is punished if its answer changes just because the audio got weird. It learns: "Wait, I'm being asked about the video. The audio doesn't matter. I need to stay focused on the picture, even if the sound is garbage."
Real-life analogy: Imagine a chef tasting a soup. If someone starts playing loud jazz music in the kitchen, the chef shouldn't suddenly think the soup tastes like jazz. The chef must remain "invariant" (unchanged) to the irrelevant noise.

Trick B: The Noise Machine (Sensitivity)

The Scenario: You ask Omni, "What do you hear in this video?"
The Training: This time, they keep the audio but scramble the video (corrupting the relevant sense).
The Lesson: Omni is punished if it doesn't notice that the video is broken. It learns: "If the video is messed up, my answer about what I see should change drastically. I need to be sensitive to the quality of the input I'm being asked about."
Real-life analogy: If you are trying to listen to a friend in a quiet room, and someone suddenly starts screaming, you should immediately realize the environment has changed. You shouldn't keep pretending everything is calm.

3. The "Language Bias" Penalty

Omni is also a language model, meaning it loves to guess based on words. If you ask, "Did the dog bark?", it might just say "Yes" because "dog" and "bark" go together in its dictionary.

The authors added a "Language Prior Debiasing" penalty.

The Analogy: Imagine a student who always guesses the answer based on the first word of the question. The teacher says, "If you answer based only on the words without looking at the chart, you get a zero."
The Result: This forces Omni to stop guessing based on text habits and actually look at the video or listen to the audio to find the truth.

4. The Data: Creating the "Trick Questions"

To teach this, they didn't just use normal videos. They built a massive dataset of "trick questions" (18,000+ of them).

They took a video of a dog barking.
They paired it with a video of a silent cat.
They asked: "Is the dog barking?"
Correct Answer: "No, that's a cat."
Wrong Answer (Hallucination): "Yes, I hear a dog."

By training on thousands of these mismatched scenarios, Omni learns to stop guessing and start paying attention.

The Result

After this training, Omni becomes much more reliable.

Before: "I see a dog barking!" (Even though it was a cat).
After: "No, I see a cat, and it is silent."

Summary

MoD-DPO is like a rigorous training camp for a multi-sensory AI. It teaches the AI to:

Ignore distractions: If you ask about sight, don't let sound confuse you.
Notice changes: If the thing you're looking at changes, admit it.
Stop guessing: Don't rely on your text-book knowledge; look at the evidence.

The result is an AI that is less likely to lie about what it sees and hears, making it a much more trustworthy assistant for the real world.

1. Problem Statement

Omni-modal Large Language Models (Omni LLMs) integrate text, audio, and video to perform complex reasoning tasks. However, they suffer significantly from cross-modal hallucinations, where the model generates outputs based on spurious correlations between modalities or over-relies on language priors rather than actual sensory evidence.

Two primary causes of these failures are identified:

Spurious Inter-modality Correlations: The model learns to associate unrelated modalities (e.g., assuming a visual event is happening because of a specific sound, even if the sound is unrelated).
Over-reliance on Language Priors: Due to pre-training on massive text corpora, the model often ignores audiovisual inputs and generates responses based solely on textual patterns, leading to hallucinations when the audio/video contradicts the text.

Existing solutions, such as standard Direct Preference Optimization (DPO) or decoding-time defenses (e.g., Visual Contrastive Decoding), fail to explicitly decouple modality pathways during training or penalize latent "text-only" shortcuts effectively.

2. Methodology: MoD-DPO

The authors propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a framework designed to enforce modality-faithful alignment. The method introduces two core mechanisms: Modality Decoupling and Language-Prior Debiasing.

A. Modality Decoupling (Invariance & Sensitivity)

The core innovation is modifying the DPO objective to explicitly handle modality corruption. The framework enforces two complementary properties:

Invariance to Irrelevant Modalities: If a prompt is related to the visual modality, the model's output distribution should remain stable even if the audio input is corrupted (e.g., replaced with noise). This prevents the model from relying on irrelevant audio cues.
Sensitivity to Relevant Modalities: If a prompt is related to the visual modality, the model's output distribution should shift significantly if the visual input is corrupted. This ensures the model is highly sensitive to the integrity of the relevant modality.

Mathematically, this is achieved by adding KL-divergence regularization terms to the standard DPO loss. For a visual prompt $x_v$ :
$\mathcal{L}_{MoD} = \mathcal{L}_{DPO} - \beta_{inv} D_{KL}(\pi_\theta(\cdot|a,v,x_v) \parallel \pi_\theta(\cdot|a',v,x_v)) + \beta_{sens} D_{KL}(\pi_\theta(\cdot|a,v,x_v) \parallel \pi_\theta(\cdot|a,v',x_v))$
Where $a'$ and $v'$ represent corrupted versions of audio and video, respectively. The authors derive a closed-form solution for the optimal policy under these constraints.

B. Language-Prior Debiasing (LPD)

To counter the tendency of Omni LLMs to ignore audiovisual inputs in favor of text, the authors introduce a Language-Prior Debiasing penalty.

This penalty reduces the reward for responses generated using text-only inputs.
It is implemented by subtracting the log-probability of the chosen response given only the text input (using a frozen reference language model) from the total reward.
This forces the model to rely on the actual multimodal evidence rather than defaulting to language priors.

C. Preference Data Generation Pipeline

To train MoD-DPO, the authors constructed a novel dataset with 18,112 preference samples spanning 10,854 unique videos. The pipeline involves:

Disentanglement: Separating audio and video inputs to generate independent captions and tags using models like GPT-4o, RAM++, and AudioFlamingo 3.
QA Generation: Creating questions specifically for audio or visual presence/captioning.
Hard-Negative Construction: Crucially, the "rejected" responses ( $y_l$ ) are not random; they are generated by injecting information from the irrelevant modality (e.g., answering a visual question using audio details). This creates "hard negatives" that specifically target cross-modal hallucinations.

3. Key Contributions

MoD-DPO Framework: A novel preference optimization technique that explicitly decouples modalities by enforcing invariance to irrelevant modality corruption and sensitivity to relevant modality corruption.
Language-Prior Debiasing: A specific penalty term integrated into the reward function to suppress text-only shortcuts and force reliance on audiovisual evidence.
Automated Preference Dataset: A large-scale, automatically generated dataset of 18k+ samples designed to expose and correct cross-modal hallucinations, utilizing mismatched audio-video contexts to break spurious correlations.
Closed-Form Solution: The derivation of a closed-form optimal policy for the modified DPO objective, allowing for efficient optimization without complex iterative reward modeling.

4. Experimental Results

The authors evaluated MoD-DPO (and its variant MoD-DPO++) on two major benchmarks: AVHBench (Audio-Visual Hallucination Benchmark) and CMM (Curse of Multi-Modalities).

Performance Gains:
- On AVHBench, MoD-DPO++ achieved 88.19% accuracy on Qwen 2.5 Omni, outperforming the base model (84.15%) and other baselines like OmniDPO (85.34%).
- On CMM, the method showed significant improvements in Hallucination Resistance (hr), reaching 87.2% compared to 84.6% for the base model.
- The method demonstrated a 27% relative gain in accuracy on the audiovisual matching task compared to reference models.
Ablation Studies:
- Removing the Sensitivity term led to a drop in precision.
- Removing the Invariance term reduced hallucination resistance.
- Removing LPD significantly hurt performance on "Language Dominance" tasks, confirming its role in reducing text-bias.
Robustness: Analysis of log-likelihood distributions showed that MoD-DPO++ models exhibit larger shifts when relevant modalities are corrupted (high sensitivity) and minimal shifts when irrelevant modalities are corrupted (high invariance).
Attention Redistribution: Attention maps revealed that MoD-DPO++ significantly increased the model's attention to audiovisual tokens, proving it forces the model to "look" and "listen" rather than just "read."
Generalization: The method improved performance on general benchmarks (DailyOmni, MVBench, MMAU) without degrading general capabilities, unlike some baselines which showed inconsistent gains.

5. Significance

This work addresses a critical bottleneck in the deployment of Omni-modal AI: reliability. By moving beyond simple preference alignment to modality-decoupled alignment, MoD-DPO provides a scalable path toward foundation models that:

Do not hallucinate due to spurious correlations between sound and sight.
Do not ignore sensory data in favor of linguistic priors.
Can be trained efficiently using automatically generated data, reducing the need for expensive human annotation.

The paper demonstrates that structured preference optimization, specifically designed to respect the distinct nature of different modalities, is a superior strategy for building robust, trustworthy multi-modal agents compared to standard DPO or decoding-time heuristics.