SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Imagine you are sitting in a dark room, blindfolded. Someone plays a recording of a car engine roaring past you, moving from your right side to your left. Even without seeing anything, your brain instantly constructs a mental movie: "A car is speeding by, coming from the right and zooming past to the left."

Humans are naturally good at this "audio-to-movie" translation. We don't just hear what is making noise; we hear where it is and how it's moving.

SpA2V is a new computer program designed to do exactly what your brain does, but for AI. It takes a sound recording and generates a realistic video that matches not just the story of the sound, but the spatial choreography of it.

Here is how it works, broken down into simple steps with some creative analogies:

The Problem: The "Blind" AI

Most current AI video generators are like a musician who only hears the notes but not the rhythm or the instrument's location. If you tell them, "Play a car sound," they might generate a video of a car. But if you play a recording of a car zooming from right to left, the AI often just shows a car sitting still or moving randomly. It misses the spatial cues (the "where" and "how").

The Solution: The Two-Stage "Director and Animator"

The authors of this paper created a framework called SpA2V (Spatial Audio-to-Video). Think of it as a two-person team working in a movie studio:

Stage 1: The "Audio Detective" (Video Planner)

Before drawing a single frame, the AI needs a script. In this stage, a powerful AI (called an MLLM) acts like a detective or a film director.

The Clues: The detective listens to the audio. It doesn't just say, "That's a car." It analyzes the physics of the sound:
- Volume getting louder? The object is getting closer.
- Pitch shifting? The object is moving fast.
- Sound moving from left ear to right ear? The object is crossing the screen.
The Blueprint: Instead of writing a vague sentence like "A car drives by," the detective draws a precise Video Scene Layout (VSL). This is like a storyboard or a blueprint. It says: "At frame 1, put a car at the far right. At frame 5, move it to the far left. Make it bigger as it gets closer."
The Secret Sauce (In-Context Learning): To get really good at this, the detective is given a "cheat sheet" of previous examples. If the audio sounds like a guitar, the AI looks at how it handled a guitar recording before to remember the rules. This helps it avoid guessing wrong.

Stage 2: The "Animator" (Video Generator)

Now that the director has the blueprint (the VSL), the second AI steps in. This is the Animator.

The Magic: The Animator is a pre-trained AI that is already great at making videos from text. But here, instead of just listening to text, it is locked to the blueprint.
The Result: It takes the "car moving from right to left" instructions from the blueprint and paints the actual video frames. Because it is following the strict blueprint from Stage 1, the car in the video actually moves from right to left, matching the sound perfectly.

Why is this a big deal?

Think of it like the difference between a bad dub and a real movie.

Old AI: You hear a car zooming past, but the video shows a car parked in a field. It feels fake and jarring.
SpA2V: You hear the car zooming, and the video shows the car zooming past the camera, changing size and position exactly as the sound suggests. It feels immersive and real.

The "Training-Free" Trick

Usually, teaching an AI to do a new job requires feeding it thousands of hours of data and retraining it from scratch (like going back to school). SpA2V is clever because it doesn't retrain the heavy lifting parts.

It uses a "frozen" brain (pre-trained models) that already knows how to draw and move things.
It just adds a new "instruction manual" (the Layout) to tell that brain exactly what to do.
This saves a massive amount of time and computing power, making the system efficient and easy to update.

In Summary

SpA2V is like giving an AI a pair of "spatial ears." It listens to a sound, figures out the 3D movement of the objects making that sound, draws a precise map of where they should be, and then paints a video that perfectly matches that map. It turns a simple audio file into a spatially accurate movie, just like your brain does when you close your eyes and listen.

1. Problem Statement

Audio-driven video generation (A2V) aims to synthesize realistic videos that align with input audio recordings. While existing methods have made progress in generating semantically aligned content (e.g., identifying that a sound is a "car" or a "piano"), they largely fail to capture spatial attributes.

Limitations of Prior Work: Current approaches predominantly rely on global semantic features or text descriptions derived from audio. They often ignore deep spatial cues such as the location, distance, and movement trajectories of sound sources. Consequently, generated videos frequently lack spatial coherence (e.g., a car sound moving from left to right might result in a car appearing in the center or moving in the wrong direction).
The Gap: Humans naturally infer spatial information from auditory cues (e.g., increasing volume implies approaching, stereo balance implies left/right positioning). Existing AI models do not explicitly leverage these physical properties of sound to guide video synthesis.

2. Methodology: The SpA2V Framework

The authors propose SpA2V, a novel two-stage framework that explicitly decodes spatial auditory cues to generate videos with high semantic and spatial correspondence. The pipeline follows an Audio $\rightarrow$ Video Scene Layout (VSL) $\rightarrow$ Video structure.

Stage 1: Audio-guided Video Planning

In this stage, the system converts raw audio into a structured intermediate representation called a Video Scene Layout (VSL).

Core Component: A state-of-the-art Multimodal Large Language Model (MLLM) (e.g., Gemini 2.0 Flash) acts as a "Video Planner."
Task: The MLLM analyzes the input audio to identify sounding objects, their semantic categories, and their spatial attributes (location, movement, distance).
Spatial Reasoning Strategy: The MLLM is instructed to deduce spatial attributes based on physical sound properties:
- ITD/ILD (Interaural Time/Level Difference): Determines Left/Right/Center positioning.
- Pitch/Volume: Indicates distance (Near/Far).
- Directional Shift/Volume Change: Indicates movement (Approaching/Receding/Crossing).
Output: A sequence of $N$ $N$ keyframe layouts. Each layout contains:
- Bounding Boxes: Coordinates $(x_1, y_1, x_2, y_2)$ for each object.
- Captions: A global video caption and local frame captions describing the scene.
- Reasoning: A text statement explaining the spatial deduction.
In-Context Learning (ICL): To prevent hallucinations and improve reasoning accuracy, the system employs a Retrieval Module. It selects $k$ semantically similar audio-visual examples from a database (using CLAP embeddings) to provide the MLLM with few-shot examples of correct spatial reasoning.

Stage 2: Layout-grounded Video Generation

This stage synthesizes the final video using the VSLs generated in Stage 1.

Core Component: A Training-free video diffusion model.
Architecture: The framework integrates pre-trained modules into a frozen Stable Diffusion backbone:
- Base Modules: From Stable Diffusion (preserves generative quality).
- Motion Modules: From AnimateDiff (injects temporal dynamics).
- Spatial Grounding Modules: From MIGC (enforces layout adherence).
Process:
1. The VSL bounding boxes are linearly interpolated to match the target frame count ( $n=16$ ).
2. The diffusion model is conditioned on the VSL layout (for spatial control) and captions (for semantic consistency).
3. The model generates a video where objects move and appear exactly as dictated by the VSL, ensuring alignment with the audio's spatial cues.

3. Key Contributions

Novel Task Definition: Introduces Audio-driven Spatially-aware Video Generation, shifting focus from mere semantic alignment to precise spatial correspondence between audio and video.
SpA2V Framework: The first framework to explicitly exploit spatial auditory cues (ITD, ILD, pitch, volume) via a two-stage pipeline (Planning $\rightarrow$ Generation).
Intermediate Representation (VSL): Proposes the Video Scene Layout as a structured bridge between audio and video, allowing for fine-grained control over object placement and motion that text descriptions cannot provide.
Training-Free Efficiency: The video generation stage requires no additional training, leveraging existing pre-trained diffusion models (Stable Diffusion, AnimateDiff, MIGC) to reduce computational costs and avoid catastrophic forgetting.
AVLBench Benchmark: Introduces a new benchmark curated from real-world stereo audio-video datasets (instruments and vehicles) specifically designed to evaluate spatial alignment in A2V tasks.

4. Experimental Results

The authors evaluated SpA2V on the AVLBench (7,274 samples) covering stationary indoor scenes (instruments) and translational outdoor scenes (vehicles).

Quantitative Performance:
- VSL Quality: SpA2V significantly outperformed baselines (e.g., Audio Captioning + LVD) in MaxIoU (spatial overlap), LTSim (layout similarity), and DocSim (semantic similarity).
- Video Quality: In terms of FVD (Fréchet Video Distance) and AV-Align (audio-video alignment), SpA2V achieved superior results compared to state-of-the-art methods like TempoTokens, Seeing and Hearing, and AC + LTX.
- Ablation Studies:
  - Removing In-Context Learning or Spatial Reasoning caused drastic performance drops, confirming their necessity.
  - Using $k$ -NN retrieval for examples was far superior to random selection.
  - Gemini 2.0 Flash proved to be the most effective MLLM for the planning stage.
Qualitative Results: Visual comparisons showed that SpA2V correctly generated moving objects (e.g., cars crossing the frame) and static objects in correct positions (e.g., piano on the left), whereas baselines often produced static scenes or misaligned movements.
User Study: 25 users ranked SpA2V videos highest in both visual quality and audio-video alignment compared to four other baselines.

5. Significance and Impact

Realism and Immersion: By incorporating spatial cues, SpA2V generates videos that feel physically grounded, enhancing the immersive experience for applications in filmmaking, advertising, and education.
Bridging Modalities: The work demonstrates that MLLMs can effectively act as "directors," translating abstract physical sound properties into concrete visual plans.
Efficiency: The training-free approach makes the framework accessible and scalable, avoiding the need for massive datasets and expensive training cycles typical of end-to-end video generation models.
Future Directions: The paper highlights that while SpA2V is a significant step forward, future work could focus on refining the integration of grounding and motion modules to prevent object appearance inconsistencies and expanding the benchmark to more complex, multi-source environments.

In summary, SpA2V represents a paradigm shift in audio-driven generation, moving from "what is happening" (semantics) to "where and how it is happening" (spatial dynamics), achieving state-of-the-art results through a clever combination of MLLM reasoning and pre-trained diffusion models.