V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Imagine you have a silent movie. It's beautiful, but something is missing: the sound. You want to add music, footsteps, or dialogue that fits perfectly with what's happening on screen. This is the job of Video-to-Audio (V2A) generation.

For a long time, computers have been getting better at this, but they often sound robotic, out of sync, or just "off." They might make a dog bark when the video shows a cat, or the sound might start a second too late.

This paper introduces V2A-DPO, a new "tuning" method that teaches these AI models to listen to human taste and create audio that feels natural, immersive, and high-quality.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Blind" Composer

Imagine an AI trying to compose a soundtrack for a movie.

Old AI: It knows the rules of music theory (technical stuff), but it doesn't know if the music feels right. It might play a sad song during a comedy scene because it got the technical details right but missed the "vibe."
The Goal: We want the AI to stop just following rules and start understanding human preference. Does this sound make me feel like I'm really there? Is it clear? Does it match the action perfectly?

2. The Solution: The "Taste Tester" (AudioScore)

To teach the AI, you need a teacher. The authors created a system called AudioScore. Think of this as a super-smart Taste Tester or a Film Critic.

Instead of just asking humans to listen to thousands of clips (which takes forever and costs a lot of money), they built a robot critic. This critic checks three things:

The Script Check: Does the sound match the video? (If a car crashes, does it sound like a crash?)
The Timing Check: Is the sound perfectly synced? (Does the drum hit exactly when the drummer's stick hits the drum?)
The "Vibe" Check: Is the sound clear, rich, and immersive? Does it sound like a professional recording or a cheap phone call?

This "Taste Tester" gives every generated sound a score, deciding if it's "Good," "Medium," or "Bad."

3. The Training: The "Taste-Test Tournament"

Now that they have a way to score the sounds, they need to train the AI. They use a method called DPO (Direct Preference Optimization).

Imagine a cooking competition:

The AI chef makes two versions of a dish (audio) for the same video.
The "Taste Tester" (AudioScore) judges them.
The AI learns: "Okay, Version A was 'Good' and Version B was 'Bad.' Next time, I'll try to make Version A."

The paper's innovation here is scale. They used their robot critic to generate 48,000 pairs of "Good vs. Bad" examples automatically. This is like having the AI practice on 48,000 cooking competitions in a single afternoon, learning exactly what humans prefer without needing a human to taste every single one.

4. The Secret Sauce: "Curriculum Learning" (Learning by Difficulty)

This is the cleverest part. If you try to teach a student by giving them the hardest math problems first, they will get frustrated and quit. You start with easy problems and move to hard ones.

The authors applied this to the AI:

Stage 1 (Easy): The AI practices on pairs where the difference between "Good" and "Bad" is obvious. (e.g., One sound is a clear bell, the other is static noise). The AI learns the basics quickly.
Stage 2 (Hard): Once the AI is good at the basics, it moves to the "Hard Mode" pairs. These are subtle differences (e.g., two sounds that are both good, but one is slightly more immersive).
The Human Touch: They also added a small batch of real human-annotated data for the final stage, specifically to teach the AI about aesthetic appeal—that "magical" feeling that makes a sound perfect.

5. The Results: From "Okay" to "Oscar-Winning"

They tested this new method on two existing AI models (named Frieren and MMAudio).

Before: The models were decent but sometimes missed the mark on timing or quality.
After (with V2A-DPO): The models became significantly better. They synced perfectly with the video, sounded clearer, and felt much more "real."
The Comparison: They beat other top-tier models and even models trained with older, more complex methods.

The Big Picture

Think of V2A-DPO as taking a talented but inexperienced musician and giving them:

A perfect ear (AudioScore) to judge their own playing.
A giant library of practice sessions (48k preference pairs) to learn from.
A smart teacher (Curriculum Learning) who starts them on easy songs and gradually moves them to complex symphonies.

The result? A computer that doesn't just generate noise, but creates soundtracks that make you believe you are really in the movie.

Here is a detailed technical summary of the paper "V2A-DPO: OMNI-PREFERENCE OPTIMIZATION FOR VIDEO-TO-AUDIO GENERATION".

1. Problem Statement

Video-to-Audio (V2A) generation aims to synthesize high-quality, semantically consistent, and temporally aligned audio conditioned on video features and optional text prompts. Despite recent advancements in GANs, autoregressive models, and diffusion/flow-matching models, existing V2A approaches suffer from three critical limitations:

Limited Style Control: Models are often restricted to the styles present in training data, failing to generalize to scenarios with significant differences from the training distribution.
Lack of Aesthetic Assessment: Most methods rely on explicit reward modeling for semantic or temporal alignment but overlook the "aesthetic quality" and immersive experience of the generated audio.
Fragmented Evaluation: Previous approaches use isolated quantitative metrics to assess semantic consistency, temporal alignment, and perceptual quality separately, lacking a holistic scoring system to guide optimization effectively.

2. Methodology: V2A-DPO Framework

The authors propose V2A-DPO, a Direct Preference Optimization (DPO) framework specifically adapted for flow-based V2A models. The framework consists of three core innovations:

A. AudioScore: A Comprehensive Scoring System

To address the lack of holistic evaluation, the authors introduce AudioScore, a scoring system aligned with human preferences. It evaluates generated audio across four dimensions:

Semantic Consistency (Video-Audio): Measured via cosine similarity between visual and audio features using ImageBind (IB-score).
Semantic Consistency (Text-Audio): Measured using CLAP if a text prompt is provided.
Temporal Alignment: Measured via the DeSync metric (predicted by Synchformer), representing the misalignment in seconds.
Perceptual & Aesthetic Quality: Assessed using PANNs-based Inception Score (IS) and PESQ (for speech quality).

AudioScore combines these five-dimensional scores using MLP and Softmax modules to classify audio samples into "Good," "Medium," or "Bad," aligning automated scores with human annotations via cross-entropy loss.

B. Omni-Preference Pair Data Generation

To overcome the high cost of human annotation, the authors propose an automated pipeline to generate large-scale preference data:

Generation: For a given video (and optional prompt), the pre-trained V2A model generates $N$ audio samples.
Scoring & Selection: AudioScore predicts probability vectors for each sample. The system selects the sample with the highest "Good" probability as the winning sample ( $a_w$ ) and the one with the highest "Bad" probability as the losing sample ( $a_l$ ).
Dataset Construction: By sampling 50K videos from VGGSound and filtering, they generate ~46K automated pairs. These are combined with 2K human-annotated pairs (focusing on aesthetic appeal) to create a final training dataset of ~48K preference pairs.

C. Curriculum Learning-Empowered DPO

Recognizing that random sorting of preference pairs is suboptimal, the authors introduce a Curriculum Learning strategy for the DPO optimization process:

Complexity Scoring: A complexity score ( $score_c$ ) is calculated based on the probability difference between winning and losing samples.
Two-Stage Training:
- Stage 1 (Simple): The model trains on pairs with high complexity scores (clearly distinguishable "Good" vs. "Bad").
- Stage 2 (Complex): The model trains on pairs with low complexity scores (subtle distinctions) and the human-annotated pairs (set to zero complexity to force focus on aesthetic appeal).
Flow-DPO Objective: The standard DPO loss is adapted for Rectified Flow Matching. Instead of optimizing a probability distribution directly, the objective minimizes the difference between the predicted vector fields of the policy model and the reference model, pushing the policy closer to the "winning" target vector field and away from the "losing" one.

3. Key Contributions

First DPO Adaptation for Flow-Based V2A: The paper pioneers the application of DPO to flow-matching generative models for video-to-audio tasks.
AudioScore System: A novel, multi-dimensional scoring system that holistically integrates semantic, temporal, and perceptual/aesthetic metrics.
Automated Data Pipeline: A scalable method to generate large-scale preference pairs by combining automated scoring with a small set of human annotations.
Curriculum Learning Strategy: A two-stage training approach that guides the model from easy to hard preference distinctions, significantly improving stability and performance.
High-Quality Dataset: The creation of the first high-quality video-text-audio preference dataset designed specifically for human preference alignment in V2A.

4. Experimental Results

Experiments were conducted on the VGGSound dataset using two pre-trained flow-based models: Frieren (159M params) and MMAudio (1.03B params).

Comparison with Baselines: V2A-DPO optimized models significantly outperformed both pre-trained baselines and models optimized using Denoising Diffusion Policy Optimization (DDPO).
- MMAudio-DPO achieved state-of-the-art (SOTA) performance across multiple metrics.
- Improvements over Pre-trained MMAudio:
  - Inception Score (IS): +1.81 absolute (+10.4% relative).
  - IB-score (Semantic): +0.86 absolute (+2.6% relative).
  - DeSync (Temporal): -0.09 absolute (-20.5% relative, indicating better alignment).
Comparison with SOTA: The DPO-optimized MMAudio surpassed published SOTA models (including ThinkingSound, FoleyCrafter, and V-AURA) in distribution matching, perceptual quality, and temporal alignment.
Ablation Studies:
- Curriculum Learning: Removing the curriculum strategy (treating all pairs equally) resulted in significant performance degradation, proving the necessity of the staged training approach.
- Hyperparameters: Optimal performance was achieved with a KL divergence constraint parameter ( $\beta$ ) of 600 and a complexity threshold ( $score_\Delta$ ) of 0.7.

5. Significance

This work represents a significant leap in the alignment of generative audio models with human preferences. By moving beyond simple quantitative metrics to a holistic, human-aligned scoring system (AudioScore) and leveraging curriculum learning within a DPO framework, V2A-DPO solves the "aesthetic gap" in current V2A models. The framework demonstrates that flow-based models can be effectively aligned with complex human preferences, setting a new benchmark for video-to-audio generation quality, semantic consistency, and temporal synchronization. The open-sourcing of the dataset and demos further facilitates future research in this domain.