JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Imagine you are directing a movie. In the old days, you'd film the scene first, then hire a sound editor to come in later and try to match the sound effects to the action. Sometimes the door slam happens a split second too late, or the dog barks before it even opens its mouth. It feels "off."

JavisDiT is like a brand-new, super-intelligent director who doesn't just film and edit separately. Instead, they dream the entire scene into existence all at once, ensuring that every sound matches the movement perfectly from the very first frame.

Here is a simple breakdown of how this paper achieves that magic:

1. The Core Idea: The "Dual-Brain" Director

Most AI systems today are like a two-person team: one person makes the video, and another person tries to guess the sound later. They often miss the beat.

JavisDiT is a single "brain" (a Diffusion Transformer) that has two hands working in perfect unison. When it thinks of a "robot fighting a dog," it doesn't just draw the robot; it simultaneously "hears" the mechanical whirring and the dog's squeak. It generates the picture and the sound together, so they are naturally locked in sync.

2. The Secret Sauce: The "Spatio-Temporal GPS"

The biggest challenge is synchronization. If a car drives by on the left, the engine noise should come from the left and start exactly when the car appears.

The paper introduces a special module called HiST-Sypo. Think of this as a GPS and a Script Supervisor rolled into one.

The GPS (Spatial): It tells the AI where things are happening. "The dog is in the bottom right corner."
The Script Supervisor (Temporal): It tells the AI when things happen. "The dog starts barking at second 2 and stops at second 4."

Instead of just guessing, the AI uses this "GPS" to guide the generation. It's like a conductor waving a baton, telling the visual orchestra and the audio orchestra exactly when to play their notes so they never clash.

3. The New Playground: "JavisBench"

To test if their new director was actually good, the authors realized the old test tracks were too easy. Imagine testing a race car driver only on a straight, empty road. They might look fast, but they can't handle a curve.

Existing tests only had simple videos (like a person dancing or a bird chirping). Real life is messy: a busy street with cars honking, people talking, and dogs barking all at once.

So, they built JavisBench.

What is it? A massive library of over 10,000 complex video clips with text descriptions.
Why is it special? It includes "chaos." It has scenes with multiple sounds happening at the same time (simultaneous events) and sounds coming from off-screen. It's the "final exam" for AI video generation.

4. The New Ruler: "JavisScore"

How do you measure if the sound and video are truly synced? The old rulers (metrics) were like using a stopwatch to time a dance; they were too clumsy for complex moves.

The authors invented JavisScore.

How it works: Instead of just checking if a sound starts, it breaks the video into tiny 2-second chunks. It asks, "Does the sound match the picture right now?"
The Analogy: Imagine a judge at a talent show. The old judges just looked at the start and end. JavisScore watches every single second, penalizing the AI if the sound lags even a tiny bit behind the action.

5. The Results

When they put JavisDiT to the test:

Quality: The videos look sharp, and the sounds are clear.
Sync: The sounds match the actions perfectly. If a glass breaks, you hear it exactly when it shatters.
Complexity: It handles the messy, multi-sound scenes (like the busy street) much better than any previous AI.

Summary

JavisDiT is a breakthrough because it stops treating video and audio as two separate problems. By using a "GPS-like" system to map out exactly where and when things happen, it creates a unified, realistic experience where the sound and vision feel like they were born together, not stitched together later.

They also gave the AI community a harder test track (JavisBench) and a better ruler (JavisScore) to ensure future inventions are just as good.

1. Problem Statement

The paper addresses the challenge of Joint Audio-Video Generation (JAVG), specifically the task of generating high-quality, synchronized audio and video content simultaneously from a single open-ended text prompt.

Current approaches suffer from two main limitations:

Quality vs. Synchronization Trade-off: Existing methods often rely on cascaded pipelines (generating audio then video, or vice versa), leading to error accumulation and poor synchronization. End-to-end joint models often lack the architectural strength to model fine-grained spatio-temporal alignment, resulting in audio that does not match the visual events (e.g., a dog barking out of sync with its mouth movement).
Evaluation Gaps: Existing benchmarks (e.g., AIST++, Landscape) are limited in diversity, focusing on simple scenarios (like dancing or nature sounds) and lacking complex, multi-event real-world scenarios. Furthermore, current metrics (like AV-Align) rely on optical flow and onset detection, which fail in complex scenes with subtle movements or multiple simultaneous sound sources.

2. Methodology: JavisDiT

The authors propose JavisDiT, a novel architecture based on the Diffusion Transformer (DiT) framework, designed to generate synchronized audio-video pairs in a unified manner.

A. Core Architecture

Backbone: Utilizes a shared AV-DiT architecture where audio and video branches share blocks to facilitate information exchange.
Attention Mechanisms:
- Spatio-Temporal Self-Attention (ST-SelfAttn): Processes intra-modal features (video-to-video, audio-to-audio) by sequentially applying attention along spatial and temporal dimensions to reduce computational cost while maintaining fine-grained modeling.
- Coarse-Grained Cross-Attention: Injects global semantic information from the text prompt (via T5 encoder) into both branches.
- Fine-Grained Spatio-Temporal Cross-Attention (ST-CrossAttn): The core innovation. It uses learned Spatio-Temporal Priors to guide the synchronization between specific visual regions and audio frequencies.
- Multi-Modality Bidirectional Cross-Attention (MM-BiCrossAttn): Enables direct interaction between the video and audio latent spaces to enhance fusion.

B. Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator

To achieve precise synchronization, JavisDiT introduces a HiST-Sypo Estimator that extracts two levels of priors from the text prompt:

Global Coarse-Grained Prior: The overall semantic framework of the event (e.g., "a robot fighting a dog").
Fine-Grained Spatio-Temporal Prior: Specific details regarding where (spatial location) and when (temporal onset/duration) events occur.
- Mechanism: Instead of generating explicit text prompts, the estimator outputs latent tokens ( $N_s$ spatial tokens, $N_t$ temporal tokens) sampled from a Gaussian distribution conditioned on the text.
- Training: The estimator is trained using contrastive learning on synchronous video-audio pairs (positive) and synthesized asynchronous pairs (negative). Negative samples are created via augmentation strategies like random masking, temporal shifting, and source addition/removal.

C. Training Strategy

The model employs a three-stage training strategy:

Audio Pretraining: Initializes the audio branch with weights from a video DiT (OpenSora) and trains on 0.8M audio-text pairs to ensure high-quality single-modal generation.
ST-Prior Training: Trains the HiST-Sypo Estimator on 0.6M synchronous triplets and synthesized negative samples to learn robust spatio-temporal alignment.
JAVG Training: Freezes the self-attention blocks and the estimator, training only the cross-attention modules (ST-CrossAttn and Bi-CrossAttn) to learn the joint generation and synchronization.

3. Key Contributions

A. JavisDiT Model

A unified DiT-based model that achieves state-of-the-art synchronization by explicitly modeling fine-grained spatio-temporal priors, moving beyond simple parameter sharing or coarse semantic alignment.

B. JavisBench Benchmark

A new, large-scale benchmark designed to address the limitations of existing datasets:

Scale: 10,140 high-quality, text-captioned sounding videos.
Diversity: Covers 5 dimensions (Event Scenario, Video Style, Sound Type, Spatial Composition, Temporal Composition) and 19 categories.
Complexity: Over 50% of samples feature complex, multi-event scenarios (e.g., simultaneous sounds, off-screen sources, sequential events) found in real-world applications but missing in previous benchmarks.

C. JavisScore Metric

A robust evaluation metric for audio-video synchronization that outperforms existing methods (like AV-Align).

Mechanism: Uses ImageBind to compute semantic alignment between video and audio segments.
Strategy: It employs a sliding window approach and calculates the mean similarity of the 40% least synchronized frames within each window. This makes the metric sensitive to local desynchronization errors rather than being skewed by the majority of synchronized frames.

4. Experimental Results

Performance on JavisBench: JavisDiT significantly outperforms existing SOTA methods (including MM-Diff, UniVerse-1, and cascaded approaches like FoleyCrafter) across all metrics.
- Synchronization: Achieves a JavisScore of 0.154, surpassing the previous best (0.151 for FoleyCrafter).
- Quality: Achieves superior FVD (204.1) and FAD (7.2) scores, indicating high visual and audio fidelity.
Performance on Existing Benchmarks: The model also sets new records on AIST++ and Landscape datasets, demonstrating generalizability.
Ablation Studies:
- Replacing UNet with STDiT significantly improves quality.
- The HiST-Sypo module is critical for synchronization, providing larger gains than simple bidirectional attention.
- Increasing the number of prior tokens (up to 32) and using Cross-Attention injection yields the best results.
Human Evaluation: In blind preference tests against UniVerse-1, JavisDiT won 55.3% of the time for Audio-Video Alignment and 56.0% for Audio Quality, though it slightly trailed in video quality due to the backbone difference (OpenSora vs. Wan2.1).

5. Significance

Advancement in Multimodal Generation: JavisDiT establishes a new standard for JAVG by proving that fine-grained spatio-temporal priors are essential for realistic synchronization, moving the field beyond coarse semantic alignment.
Real-World Applicability: By introducing JavisBench and JavisScore, the paper provides the community with the necessary tools to evaluate and develop models for complex, real-world scenarios (e.g., movies, interactive media, virtual reality) rather than just simple, controlled datasets.
Future Directions: The work opens pathways for "X-conditional" generation (e.g., video-to-audio, audio-to-video, extension tasks) using a unified DiT architecture with dynamic masking strategies.

In summary, JavisDiT represents a significant leap in synchronized audio-video generation by combining a powerful DiT backbone with a novel hierarchical prior estimation mechanism, validated by a rigorous new benchmark and metric.