JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Imagine you are a director trying to make a movie. In the past, you had to hire two separate teams: one to film the video and another to record the sound. Often, these teams didn't talk to each other well. The result? A scene where a dog barks, but the sound comes out a second too late, or a car engine roars when the car is actually parked.

Enter JavisDiT++, a new AI model that acts like a super-smart, single-minded director who can film and record sound simultaneously, perfectly in sync, just by reading a simple script.

Here is how this paper explains their magic, broken down into everyday concepts:

1. The Problem: The "Bad Orchestra"

Current open-source AI models are like an orchestra where the violinists and drummers are playing from different sheet music. They might be in the same room, but they aren't listening to each other.

The Gap: Big tech companies (like Google's Veo3) have amazing "conductors" that make perfect movies. But open-source models (the ones anyone can use) usually produce videos where the audio and video feel "out of step" or just look a bit blurry and low-quality.

2. The Solution: A Unified Studio

The authors built JavisDiT++, which treats video and audio not as two separate things, but as one big, connected puzzle. They used three main tricks to fix the orchestra:

Trick #1: The "Specialized Chefs" (MS-MoE)

Imagine a kitchen where one chef tries to cook both a delicate soufflé (video) and a spicy curry (audio) at the same time using the same set of tools. The results are usually mediocre because the techniques clash.

JavisDiT++ changes the kitchen layout. They have one big table where the ingredients (data) from the video and audio can chat and mix together. But, when it comes time to actually cook (process the data), they send the video ingredients to a Video Chef and the audio ingredients to an Audio Chef.

Why it works: The chefs can still talk to each other at the table to coordinate (e.g., "I'm chopping onions now, so you should hear a crunching sound"), but they use their own specialized tools to ensure the final dish tastes perfect. This makes the video look sharper and the sound clearer.

Trick #2: The "Shared Metronome" (TA-RoPE)

In music, if the drummer and the singer don't follow the same beat, the song falls apart. In AI, the "beat" is the timeline.

The Old Way: Previous models tried to guess when the audio should match the video, often leading to a slight delay (like a bad karaoke machine).
The New Way: JavisDiT++ gives the video and audio tokens (the digital building blocks) a shared metronome. They are stamped with the exact same time ID. If the video shows a bird flapping its wings at "Time 1," the sound of the wing flap is forced to happen at "Time 1" as well.
The Result: Perfect synchronization. When a car crashes, the crash happens exactly when the metal hits the ground, not a split second later.

Trick #3: The "Human Taste Tester" (AV-DPO)

Even with good chefs and a metronome, the AI might still make weird choices, like a dog barking in a library.

The Fix: The team taught the AI to understand human preference. They created a system where the AI generates two versions of a video, and a "judge" (a set of reward models) picks the one that looks and sounds better.
The Learning: The AI learns from these wins and losses. It's like a student taking a test, seeing which answers got marked "correct" by a teacher, and adjusting their brain to get more "A's" next time. This ensures the final video isn't just technically correct, but actually pleasing to human eyes and ears.

3. The Magic Ingredients

What makes this so impressive?

Efficiency: They didn't need a supercomputer farm or a billion dollars. They built this on top of an existing video model (Wan2.1) and only used about 1 million examples to train it. That's like teaching a child to speak a new language with a small, high-quality book instead of a library of bad textbooks.
Speed: Because they didn't build two separate models and try to glue them together, it runs fast. It's like having a single car with two engines working in harmony, rather than two cars tied together.

The Bottom Line

JavisDiT++ is a breakthrough because it proves you don't need massive, expensive systems to create high-quality, synchronized audio-video. By using a smarter kitchen layout (Specialized Chefs), a shared metronome (TA-RoPE), and a human taste tester (AV-DPO), they created an open-source model that rivals the best commercial tools, making it possible for anyone to generate realistic, sound-synced movies from a simple text prompt.

In short: It's the difference between a disjointed, out-of-sync amateur video and a professional movie where the sound and picture dance together perfectly.

1. Problem Statement

Joint Audio-Video Generation (JAVG) aims to synthesize synchronized and semantically aligned audio and video from a single text prompt. While AI-generated content (AIGC) has advanced rapidly in text-to-image and text-to-video, existing open-source JAVG methods lag significantly behind proprietary models (e.g., Google Veo3).
Current open-source approaches suffer from three main limitations:

Generation Quality: Poor visual and audio fidelity compared to state-of-the-art (SOTA) models.
Temporal Synchrony: Misalignment between audio events (e.g., splashing, footsteps) and visual frames.
Human Preference Alignment: Failure to capture aesthetic harmony and consistency preferred by humans.
Furthermore, many existing methods rely on complex, dual-stream architectures or ad-hoc synchronization mechanisms that hinder scalability and efficiency.

2. Methodology

The authors propose JavisDiT++, a concise yet powerful framework built upon the Wan2.1-1.3B-T2V (Text-to-Video) backbone. The model introduces three core technical innovations to address the limitations above:

A. Unified Architecture with Modality-Specific Mixture-of-Experts (MS-MoE)

Instead of using separate dual-stream networks or a single shared Feed-Forward Network (FFN) for all tokens, JavisDiT++ employs a Unified Backbone with a specialized MoE design:

Shared Attention: Audio and video tokens are concatenated and processed through shared Multi-Head Self-Attention (MHSA) layers to facilitate dense cross-modal interaction.
Modality-Specific FFNs: After attention, tokens are routed to separate, modality-specific FFN layers (Audio-FFN and Video-FFN).
Benefit: This design allows for rich cross-modal communication while isolating intra-modal feature modeling. It increases the total parameter count (to ~2.1B) but keeps the activated parameters per token at 1.3B, ensuring no inference overhead compared to the base model while significantly improving generation quality.

B. Temporal-Aligned Rotary Position Encoding (TA-RoPE)

To achieve precise frame-level synchronization, the authors introduce TA-RoPE, a strategy that aligns audio and video tokens on a unified temporal axis without physically reordering tokens (which would be computationally expensive in causal models).

Mechanism: Video tokens use 3D position IDs $(t, h, w)$ . Audio tokens (from mel-spectrograms) are augmented with a leading dimension to match the video's temporal axis.
Non-Overlapping IDs: To prevent position confusion, the audio token IDs are offset in the spatial dimensions (height and width) by adding the video's $H$ $H$ and $W$ $W$ dimensions.
- Formula: $R_a(t, m) = (\lfloor t \cdot \frac{T_v}{T_a} \rfloor, t + H, m + W)$ .
Benefit: This enforces explicit, fine-grained temporal alignment between audio and video tokens within the attention mechanism, outperforming previous methods like ST-Prior or frame-level cross-attention in both efficiency and accuracy.

C. Audio-Video Direct Preference Optimization (AV-DPO)

The paper is the first to apply Direct Preference Optimization (DPO) to the JAVG task to align model outputs with human preferences.

Reward Modeling: Instead of a single scalar reward, the system uses diverse reward models to evaluate generated pairs across three dimensions: Audio Quality, Video Quality, and Audio-Video Alignment (semantic similarity and temporal sync).
Data Curation: A preference dataset is constructed by generating multiple candidates per prompt and selecting "winning" pairs that outperform "losing" pairs across all modality-aware dimensions simultaneously.
Loss Function: The AV-DPO loss optimizes the policy model to increase the probability of winning pairs while suppressing losing pairs, regularized by the flow matching objective to prevent overfitting.

3. Key Contributions

Concise Unified Architecture: Proposed a unified DiT architecture using MS-MoE that balances cross-modal interaction and intra-modal quality without the complexity of dual-stream models.
Precise Synchronization: Introduced TA-RoPE, a novel position encoding strategy that achieves explicit frame-level audio-video synchronization with zero additional inference cost.
Preference Alignment in JAVG: Pioneered the application of AV-DPO to joint generation, using multi-dimensional reward modeling to align outputs with human preferences regarding quality, consistency, and synchrony.
Efficiency and Performance: Demonstrated that a model trained on only ~1M public data entries (780K audio-text + 360K audio-video) can achieve SOTA performance, surpassing larger models like UniVerse-1 (6.4B) and JavisDiT (3.1B).

4. Experimental Results

The model was evaluated on JavisBench (10,140 samples) and compared against baselines including MM-Diffusion, JavisDiT, UniVerse-1, and the proprietary Veo3.

Quantitative Performance:
- Quality: Achieved the lowest FVD (141.5) and FAD (5.5), significantly outperforming UniVerse-1 (FVD 194.2) and JavisDiT (FVD 204.1).
- Synchrony: Achieved the best JavisScore (0.159) and lowest DeSync (0.832), indicating superior temporal alignment.
- Consistency: Scored highest on Text-Consistency (CLIP/CLAP) and AV-Consistency metrics.
Efficiency: Inference time is 1m 4s, comparable to the base Wan2.1 model and significantly faster than UniVerse-1 (13s for generation, but higher latency due to stitching) and JavisDiT (30s).
Human Evaluation: In blind user studies, JavisDiT++ was preferred over JavisDiT and UniVerse-1 by >74% of annotators. The AV-DPO stage alone improved human preference by 25%.
Ablation Studies: Confirmed that MS-MoE is superior to LoRA/Full-Finetuning baselines, TA-RoPE is more efficient than ST-Prior/FrameAttn, and modality-aware reward ranking is critical for DPO success.

5. Significance

JavisDiT++ represents a milestone in native joint audio-video generation.

Paradigm Shift: It moves away from complex, multi-stage, or dual-stream architectures toward a unified, efficient, and scalable single-backbone approach.
Accessibility: By achieving SOTA results with only 1M training entries and an open-source release, it democratizes high-quality sounding video generation, narrowing the gap between open-source research and proprietary commercial models.
Future Direction: The successful integration of preference alignment (DPO) into diffusion-based audio-video generation opens new avenues for optimizing generative models based on human aesthetic and harmonic preferences rather than just pixel-level reconstruction.

The code, models, and datasets are publicly available at https://JavisVerse.github.io/JavisDiT2-page.