Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

Imagine you are at a busy orchestra concert. You want to know: "Which instrument is playing the high-pitched solo right now, and is the drummer still hitting the snare?"

To answer this, you don't just look at the stage; you also listen carefully. Sometimes, the visual clues are tricky (a flutist might be standing very still), but the sound is distinct.

This paper introduces a new AI system called QSTar (Query-guided Spatial–Temporal–Frequency Interaction) that acts like a super-smart concert critic. It's designed to answer questions about videos by combining what it sees, what it hears, and what it reads (the question).

Here is how it works, broken down into simple concepts:

1. The Problem: The "Late Arrival" Guest

Most existing AI systems for video questions are like a guest who arrives at the party after everyone has already started dancing.

How they work: They watch the video and listen to the audio separately, make their own notes, and only at the very end do they look at the question to decide what to say.
The flaw: By the time they look at the question, they've already formed a generic opinion. If the question asks about a specific subtle sound (like a quiet flute), the AI might have already ignored it because it was too busy looking at the big, obvious movements on stage.

2. The Solution: The "Sherlock Holmes" Approach (QSTar)

The authors built QSTar to be different. Instead of waiting until the end, QSTar brings the question to the very beginning of the investigation. It's like a detective who reads the case file before entering the crime scene, so they know exactly what clues to look for.

Here are the three "superpowers" QSTar uses:

A. The "Question-First" Filter (Query-Guided Multimodal Correlation)

Imagine you are looking for a red car in a parking lot.

Old AI: Looks at every car, takes a picture of every single one, and then asks, "Was it red?"
QSTar: Reads "Red Car" first. As it scans the lot, it instantly filters out the blue and green cars, focusing its energy only on the red ones.
In the paper: The system takes the text question and uses it to "tune" the audio and video data immediately. It tells the audio system, "Listen for flutes," and the video system, "Look for a flute player," right from the start.

B. The "Three-Dimensional Detective" (Spatial–Temporal–Frequency)

To understand music, you need to look at it in three ways, not just one. QSTar does this simultaneously:

Spatial (Where?): It zooms in on the specific part of the video where the sound is coming from (e.g., the person holding the violin).
Temporal (When?): It tracks time. It knows that the violin started playing at second 10 and stopped at second 15.
Frequency (What does it sound like?): This is the secret sauce.
- The Analogy: Imagine a flute and a violin playing the same note. Visually, they might look similar (a person holding an instrument). But their "fingerprint" in the sound is totally different.
- QSTar looks at the frequency (the pitch and tone patterns) like a fingerprint scanner. It can tell, "Even though I can't see the flute player moving much, the high-frequency sound pattern proves a flute is playing."

C. The "Prompting" Coach (Query Context Reasoning)

Before giving the final answer, QSTar has a "coach" moment. It uses a technique called prompting (similar to how you might ask a human, "Think about the type of instrument and how loud it is").

It takes the specific keywords from your question (e.g., "how many," "which instrument," "when") and uses them to double-check its work.
It ensures the final answer isn't just a guess, but a reasoned conclusion based on the specific constraints of the question.

3. Why This Matters

The paper tested QSTar on a massive dataset of music videos (MUSIC-AVQA).

The Result: QSTar beat all previous record-holders.
The Win: It was especially good at tricky questions where the visual clues were weak (like a quiet instrument) or when multiple instruments were playing at once. It didn't just "see" the video; it truly "understood" the scene by listening, watching, and reading all at the same time.

Summary Metaphor

If old AI methods were like a tourist who takes a photo of a concert and then tries to guess what happened later, QSTar is like a musician in the audience. The musician reads the sheet music (the question) first, listens to the specific frequencies (frequency analysis), watches the conductor's hands (spatial/temporal), and knows exactly which instrument is playing the solo, even if the camera is blurry.

This new method proves that to truly understand a video, you have to let the question guide your eyes and ears from the very first second.

1. Problem Statement

Audio-Visual Question Answering (AVQA) is a multimodal task requiring models to reason over audio, visual, and textual data to answer natural language questions about a video.

Current Limitations: Existing AVQA approaches primarily focus on visual processing (object-level and motion-level features), treating audio merely as a complementary signal for temporal alignment.
The Gap:
- Underutilized Audio: Audio signals contain distinctive frequency-domain characteristics (spectral fingerprints) crucial for identifying instruments, especially when visual cues are subtle or ambiguous (e.g., a flute player with minimal motion).
- Late Fusion of Text: Question information is often integrated only at the final reasoning stage via simple operations (e.g., multiplication), failing to guide the feature extraction process early on.
- Polyphonic Complexity: In music scenes with multiple instruments, distinguishing specific sounds based solely on spatial or temporal visual features is insufficient.

2. Methodology: QSTar Framework

The authors propose QSTar (Query-guided Spatial–Temporal–Frequency Interaction), a novel framework that integrates question guidance throughout the entire pipeline and exploits multi-dimensional interactions.

A. Input Representation

Visual: Processed via a frozen CLIP-ViT model. Features are extracted at frame-level and patch-level (compressed via Token Merging).
Audio: Processed via VGGish (2D CNN) for general features and AST (Audio Spectrogram Transformer) for rich frequency-aware features.
Text: Processed via CLIP text encoder to obtain sentence-level and word-level embeddings.

B. Core Modules

1. Query-Guided Multimodal Correlation (QGMC)
Unlike prior two-stage fusion methods, QGMC aligns modalities with the question from the start.

Self-Enhancing: Applies self-attention to visual, audio, and text features independently.
Cross-Modal Capturing: Uses word-level text features as queries to capture shared semantics from frame-level visual and audio features (acting as keys/values).
Propagation: The resulting query-guided semantic context is propagated back to the visual and audio streams via cross-attention, refining $F_v$ and $F_a$ into $F'_{vq}$ and $F'_{aq}$ .

2. Spatial–Temporal–Frequency (STF) Interaction Module
This module refines features across three dimensions to handle polyphonic scenarios.

Spatial-Temporal Interaction (STI):
- Aligns fine-grained spatial visual details (patches) with the query-guided audio context.
- Computes global temporal dependencies between audio and visual features to identify when and where instruments are active.
Temporal-Frequency Interaction (TFI):
- Addresses cases where visual motion is minimal (e.g., wind instruments).
- Utilizes AST features to extract spectral signatures.
- Applies a frequency-wise attention mechanism guided by the question to highlight specific frequency bands relevant to the target instrument, distinguishing timbres that look similar visually.

3. Query Context Reasoning (QCR) Block
Inspired by prompting techniques, this block refines features before final prediction.

Context Construction: Constructs a prompt embedding ( $F_{prompt}$ ) based on task-specific keywords (instrument type, duration, location) derived from the question type.
Guided Refinement: Uses the combined prompt and sentence embedding as a query to refine the fused visual and audio features via cross-attention.
Prediction: Fuses the refined features and performs element-wise multiplication with the question embedding to predict the final answer.

3. Key Contributions

QSTar Framework: A novel architecture that embeds linguistic context early in the pipeline, enabling question-aware refinement of both audio and visual representations.
Multi-Dimensional Interaction: Introduction of a fine-grained interaction module that explicitly models Spatial, Temporal, and Frequency dimensions. This is particularly effective for distinguishing instruments in polyphonic music scenes where spectral cues are critical.
Prompt-Based Reasoning: Design of a Query Context Reasoning block that uses prompting to inject task-specific constraints, improving semantic alignment between the question and multimodal features.
State-of-the-Art Performance: Demonstrated significant improvements over existing Audio QA, Visual QA, Video QA, and AVQA methods.

4. Experimental Results

Dataset: Evaluated primarily on MUSIC-AVQA (40K+ QA pairs, 9,288 videos) and validated on AVQA.
Performance:
- Achieved 78.98% overall accuracy on MUSIC-AVQA, surpassing the previous SOTA (QA-TIGER) by 1.36% and TSPM by 2.19%.
- Showed particularly strong gains in Comparative and Temporal question types (>5% improvement).
- Outperformed specialized Audio QA and Audio-Visual QA baselines by over 2%.
Ablation Studies:
- Removing the QGMC module dropped accuracy by ~2.18%.
- Removing the TFI (frequency) module caused a sharp drop in Audio QA performance (~2.42%), proving the necessity of frequency-domain analysis.
- Removing query guidance at any stage (beginning, middle, or end) resulted in performance degradation, confirming the value of end-to-end query integration.
Qualitative Analysis: Visualizations show QSTar correctly identifying instruments (e.g., distinguishing a clarinet from a bassoon) even when visual motion is subtle, by leveraging frequency attention and temporal alignment.

5. Significance

This work addresses a critical bottleneck in multimodal understanding: the underutilization of audio's frequency domain and the late integration of textual queries. By treating the question as a guiding signal throughout the feature extraction and interaction process, and by explicitly modeling frequency characteristics, QSTar sets a new standard for music scene understanding. It demonstrates that for complex audio-visual tasks, especially those involving subtle physical actions (like playing instruments), spectral analysis combined with query-guided reasoning is superior to purely visual or temporal approaches. The code and models are set to be released, facilitating further research in fine-grained audio-visual reasoning.