OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Imagine you are trying to solve a mystery in a room. A regular AI might look at the clues (the video) and guess the answer. But sometimes, the most important clue is a sound—a creaking floorboard, a whisper, or a specific song playing in the background.

Current "Omnimodal" AIs (models that can see and hear) are like detectives who have been trained to look at photos, but when you hand them a video with sound, they get confused. They often ignore the audio or, worse, the sound makes them forget what they saw. They end up guessing wrong because they aren't using their ears and eyes together effectively.

OmniVideo-R1 is a new training method designed to fix this. It teaches the AI to become a true "detective" that uses both its eyes and ears to solve problems. Here is how it works, broken down into simple concepts:

1. The Problem: The "Distracted Detective"

Imagine a detective who is so used to looking at crime scene photos that when you play them a recording of the crime, they get distracted. They might say, "I see a red car," but miss the fact that the car's engine was making a specific knocking sound that identifies the exact model.

The paper shows that even the smartest current AIs (like Qwen3-Omni) often perform worse when they try to use audio and video together compared to just using video. They have a "bias" where they ignore the sound.

2. The Solution: Two-Step Training

The authors created a two-step training camp to teach the AI how to think properly.

Step 1: The "Highlighter" Game (Query-Intensive Grounding)

The Analogy: Imagine you are reading a long, boring book and someone asks you a specific question. Before you answer, you have to highlight the exact sentences in the book that prove your answer.

How the AI learns:
Instead of just giving the AI the answer, the researchers taught it to pause and say, "Wait, let me find the part of the video where the answer is hiding."

The AI learns to point to specific moments in the video (e.g., "Between 0:10 and 0:15, the person drops the cup").
It then writes a short caption for that moment.
The Trick: They didn't need humans to do this highlighting. They used a "self-check" system. If the AI highlights a part of the video and the description it writes matches what actually happened in that clip, it gets a reward. If it highlights the wrong part, it learns to try again. This teaches the AI to look for evidence before guessing.

Step 2: The "Blindfold Test" (Modality-Attentive Fusion)

The Analogy: Imagine a chef tasting a soup. If they taste it with just their eyes closed (only smell), it's okay. If they taste it with just their nose covered (only taste), it's okay. But the best soup is tasted with both senses working together.

How the AI learns:
The researchers played a game with the AI:

Scenario A: Show the AI the video with sound.
Scenario B: Show the AI the video without sound (silent).
Scenario C: Show the AI only the sound (no video).

The AI gets a special bonus point only if it solves the mystery better in Scenario A (Video + Sound) than in Scenario B or C. This forces the AI to realize: "Hey, I can't solve this just by looking! I need to listen to the sound to get the full picture." It forces the two senses to work as a team rather than fighting each other.

3. The Result: A Super-Detective

After this training, the AI (OmniVideo-R1) became much better at:

Finding the right clues: It knows exactly when to look and when to listen.
Combining senses: It understands that a "scream" in the audio changes the meaning of a "running person" in the video.
Not forgetting how to see: Even though it learned to listen, it didn't forget how to watch. It actually got better at watching videos too, because it learned to focus on the most important parts.

Why This Matters

Think of the world as a movie, not a silent film. Real life has sound and sight happening at the same time. OmniVideo-R1 is a breakthrough because it teaches AI to stop treating sound as an "add-on" and start treating it as a critical partner in understanding what is happening.

In short: OmniVideo-R1 teaches AI to stop guessing and start investigating, using both its eyes and ears to find the truth.

1. Problem Statement

Current Omnivideo models (Multimodal Large Language Models capable of processing audio and video) face a significant paradox: while humans naturally integrate audio and visual signals for holistic understanding, adding audio modality to existing models often degrades their visual reasoning performance.

Modality Bias: Pre-training on heterogeneous tasks induces a bias where the model relies on unimodal shortcuts (e.g., ignoring audio cues) or dataset biases to answer questions, rather than performing true synergistic fusion.
Lack of Intermediate Supervision: Existing post-training methods (Supervised Fine-Tuning or standard Reinforcement Learning like GRPO) focus on the final answer. They fail to explicitly train the model on intermediate reasoning behaviors, such as locating specific audio-visual evidence or composing cross-modal clues.
Data Scarcity: High-quality, mixed-modality reasoning data with dense process-level annotations (e.g., exact timestamps of relevant cues) is expensive and difficult to scale.

2. Methodology: OmniVideo-R1

OmniVideo-R1 is a Reinforcement Learning (RL) framework designed to instill robust mixed-modality reasoning behaviors without relying on expensive process-level annotations. It utilizes the Group Sequence Policy Optimization (GSPO) algorithm to optimize the entire reasoning sequence. The framework operates in two distinct stages:

A. Data Preparation

The authors constructed a high-quality corpus of 80,000+ audio-visual samples using a three-stage refinement pipeline:

Quality Assessment: Using LLMs (Gemini-2.5-Pro) to score samples on video dependency, audio dependency, question logic, and answer accuracy.
Heuristic Filtering: Removing low-quality or misaligned samples.
Categorical Balancing: Pruning sparse categories to prevent long-tail bias.

Stage 1 (QI) Data: 88,173 samples for grounding.
Stage 2 (MA) Data: A subset of 12,887 high-dependency samples (requiring both audio and visual cues).

B. Stage 1: Query-Intensive Grounding (QI)

This stage teaches the model to "think with omnimodal cues" by explicitly identifying relevant segments before answering.

Mechanism: The model is trained to output structured <time>...</time><caption>...</caption> pairs followed by reasoning and an answer.
Self-Supervised Learning: Instead of using human-annotated ground truth for timestamps, the model generates hypotheses. A "Judger" model (Qwen3-VL) evaluates the consistency between the generated time span and its corresponding caption.
Reward Function ( $R_{QI}$ ):
- Format Reward ( $r_{format}$ ): Ensures strict adherence to the output template.
- Consistency Reward ( $r_{cons}$ ): Evaluates if the generated caption matches the audio-visual segment at the predicted time.
- Completeness Reward ( $r_{comp}$ ): Ensures the union of all predicted segments covers all necessary cues for the answer without redundancy.
- Outcome Reward ( $r_{ans}$ ): Evaluates the final answer quality.

C. Stage 2: Modality-Attentive Fusion (MA)

This stage ensures the model does not ignore audio cues and learns to synergistically fuse modalities.

Contrastive Strategy: For a given query, the model generates responses under three conditions: (1) Full Audio-Visual input, (2) Video-only (silent), and (3) Audio-only.
Reward Function ( $R_{MA}$ ):
- Attention Reward ( $r_{attn}$ ): A contrastive reward is applied only if the Full Audio-Visual score is strictly higher than both the Video-only and Audio-only scores. This forces the model to discover that the combined input yields a superior answer, preventing reliance on a single modality.
- The stage retains the format and outcome rewards from Stage 1.

3. Key Contributions

First RL-based Framework for Mixed-Modality Reasoning: Introduces OmniVideo-R1, the first post-training framework specifically designed to improve audio-visual reasoning via reinforcement learning.
Self-Supervised Grounding: Proposes a novel method to learn query-intensive grounding (identifying key frames/timestamps) without dense human annotations, using self-supervised consistency checks.
Contrastive Modality Fusion: Develops a training strategy that explicitly penalizes models that do not benefit from fusing audio and visual inputs, ensuring synergistic understanding.
High-Quality Dataset: Curated a specialized 80K+ audio-visual corpus specifically for complex reasoning tasks.

4. Experimental Results

OmniVideo-R1 was evaluated on multiple benchmarks, demonstrating state-of-the-art (SOTA) performance:

Audio-Visual Benchmarks:
- Daily-Omni: Achieved 82.8% accuracy, outperforming the previous open-source SOTA (Video-SALMONN 2+-72B) by 4.3% and the closed-source Gemini-3-Pro by 2.1%.
- IntentBench: Achieved 74.2%, surpassing Gemini-3-Pro by 3.8%.
- OmniVideoBench: Surpassed the base model (Qwen3-Omni-30B-A3B) by 21.1% (44.8 vs 37.0), breaking the "random guessing" barrier seen in previous methods.
Visual-Only Benchmarks:
- Crucially, the model did not suffer performance degradation on silent video tasks (Video-MME, MLVU, LVBench). In fact, it showed slight improvements (e.g., +4.4% on Video-MME), proving that audio training enhances rather than distracts from visual reasoning.
Ablation Studies:
- Removing the QI stage (grounding) or the MA stage (contrastive fusion) resulted in significant performance drops, confirming both components are essential.
- Standard baselines (QA SFT, CoT SFT, Vanilla RL) were consistently outperformed by OmniVideo-R1.

5. Significance

Solving the "Modality Curse": The paper addresses the critical issue where adding audio to video models often hurts performance. OmniVideo-R1 proves that with the right reasoning training, multimodal integration can be strictly superior to unimodal approaches.
Scalable Reasoning: By eliminating the need for expensive, human-annotated process-level data (timestamps/cues), this approach offers a scalable path to training high-reasoning multimodal models.
Robustness: The model demonstrates "aha moments" by actively selecting and fusing cues, leading to more reliable and explainable reasoning processes in complex, real-world scenarios.

In summary, OmniVideo-R1 represents a paradigm shift from simple multimodal perception to intention-driven, evidence-based reasoning, setting a new benchmark for audio-visual understanding in Large Language Models.