Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

The Big Problem: The "Robot" Who Only Reads the Script

Imagine you are teaching a robot to describe a sound (like a dog barking or a car honking).

The Training: You play the sound and show the robot the perfect sentence: "A dog barks loudly." The robot learns to copy this perfectly.
The Problem (Exposure Bias): When you ask the robot to describe a new sound on its own, it has to guess the next word based on what it just wrote. If it makes a tiny mistake early on (e.g., it writes "A cat meows..." instead of "A dog barks..."), it gets confused. Because it was only trained to follow the perfect script, it doesn't know how to recover from its own mistakes. It spirals into nonsense, like "A cat meows loudly at the moon until the sky turns green."

This is called caption degeneration. The robot gets stuck in a loop of bad guesses.

The Old Solution: The "Cosine Similarity" Compass (And Why It Failed)

Researchers tried to fix this by teaching the robot to check if its guess "feels right" compared to the sound. They used a tool called Cosine Similarity.

The Analogy: Imagine the sound and the sentence are two arrows on a map. Cosine similarity just checks if the arrows point in the same general direction.
The Flaw: This tool is too lazy. It ignores time.
- Scenario: A sound of a drum beat followed by a cymbal crash.
- Sentence A: "Drum then cymbal." (Correct order)
- Sentence B: "Cymbal then drum." (Wrong order)
- The Old Tool: "Hey, both sentences have the words 'drum' and 'cymbal'! They are 99% similar!"
- Result: The robot picks the wrong sentence because the tool didn't care about the sequence of events.

The New Solution: The "Unbiased Sliced Wasserstein" (USW) Lens

The authors of this paper built a new, super-smart measuring tool called the USW-RBF Kernel. Think of it as a high-tech microscope that looks at both the content and the timing of the sound and the text.

Here is how it works, broken down into three parts:

1. The "Sliced" Approach (Looking at Shadows)

Calculating the exact distance between complex sounds and sentences is like trying to measure the volume of a squishy, irregular jellyfish. It's mathematically impossible to do perfectly every time.

The Analogy: Instead of measuring the whole jellyfish, the USW tool shines a light on it from many different angles, creating 2D shadows (slices). It measures the distance between the shadows.
Why it's good: It's fast and avoids the "curse of dimensionality" (getting lost in too many details).

2. The "Rotary Positional Embedding" (The Time-Stamp)

This is the magic ingredient that fixes the "time" problem.

The Analogy: Imagine the words in a sentence are beads on a string. The old tools just looked at the beads. The USW tool puts a GPS tracker on every bead. It knows that the "drum" bead is at position 1 and the "cymbal" bead is at position 2.
The Result: If the robot writes "Cymbal then drum," the GPS trackers show the beads are in the wrong order. The tool immediately says, "Nope, that doesn't match the sound!"

3. The "Unbiased" Guarantee (The Honest Judge)

In math, some tools make small, consistent errors (bias) that mess up the learning process.

The Analogy: Imagine a judge who always gives a slightly higher score to the home team. That's a biased judge. The USW tool is an unbiased judge. It doesn't favor the robot or the sound; it gives a perfectly fair score every time. This allows the robot to learn faster and more accurately using "stochastic" (randomized) methods.

How They Fixed the Robot: The "Tasting Menu" Strategy

Even with the new measuring tool, the robot still needs a way to stop making mistakes during the final test.

The Old Way (Beam Search): The robot tries to find the single most likely path. If it takes a wrong turn, it's stuck.
The New Way (Stochastic Decoding + Reranking):
1. The Chef: The robot acts like a chef making 30 different versions of a dish (30 different captions) using a bit of randomness (like adding a pinch of salt here or there).
2. The Critic: The USW-RBF tool acts as the food critic. It tastes all 30 dishes.
3. The Selection: It picks the one that matches the sound perfectly in both content and timing.

The Results: Why Should We Care?

The authors tested this on two big datasets (AudioCaps and Clotho) and even on "Audio Reasoning" tasks (where the AI has to answer questions about sounds).

Better Descriptions: The captions are more descriptive and less repetitive.
Better Timing: The AI finally understands that "drum then cymbal" is different from "cymbal then drum."
Generalization: This tool isn't just for writing captions; it helps AI understand complex audio logic (like solving a puzzle about a sound).

Summary in One Sentence

The authors built a new mathematical "ruler" that measures not just what words are in a sentence, but when they happen, allowing AI to describe sounds with human-like accuracy and without getting confused by its own mistakes.

1. Problem Statement

Audio captioning aims to describe acoustic events and their temporal relationships in natural language. Current state-of-the-art models typically rely on Maximum Likelihood Estimation (MLE) training with teacher-forcing. This approach suffers from two critical issues:

Exposure Bias: During training, the model sees ground-truth previous tokens, but during inference, it relies on its own (potentially erroneous) predictions. This mismatch leads to error accumulation and caption degeneration (repetitive, generic, or incoherent text).
Inadequate Cross-Modal Alignment: Existing solutions, such as contrastive learning, attempt to mitigate exposure bias by optimizing cosine similarity between audio and text embeddings. However, cosine similarity typically relies on global pooling (averaging), which discards crucial temporal information. Since audio and language are inherently sequential, ignoring temporal dynamics limits the model's ability to align acoustic events with linguistic descriptions accurately.

2. Methodology

The authors propose the ACUS (Audio Captioning with Unbiased Sliced Wasserstein) framework, which integrates a novel kernel function with stochastic decoding strategies.

A. The Unbiased Sliced Wasserstein RBF Kernel (USW-RBF)

To address the limitations of cosine similarity and standard Wasserstein distances, the authors introduce the USW-RBF kernel:

Sliced Wasserstein Distance (SW): Instead of computing the full high-dimensional Wasserstein distance (which suffers from the "curse of dimensionality"), SW projects distributions onto 1D lines and computes the distance there.
Rotary Positional Embedding: To preserve temporal information, the authors concatenate feature vectors with Rotary Positional Embeddings (RoPE) before projection. This ensures the distance metric accounts for the order of events in both audio and text sequences.
Unbiased Estimation: Standard Monte Carlo estimation of the Sliced Wasserstein kernel is biased because the expectation is inside the exponential function ( $E[e^{-X}] \neq e^{-E[X]}$ ). The authors define the USW-RBF as the expectation of the exponential of the distance:
$UK_\gamma(\mu, \nu) = \mathbb{E}_{\psi} [\exp(-\gamma W_p^p(\psi_\sharp \mu, \psi_\sharp \nu))]$
This formulation is unbiased, making it compatible with stochastic gradient optimization algorithms. The approximation error decreases at a rate of $O(L^{-1/2})$ with $L$ Monte Carlo samples.
Theoretical Properties: The paper proves that USW-RBF is a positive definite kernel and serves as an upper bound to the standard SW-RBF kernel.

B. The ACUS Framework

The framework operates in two stages:

Training: The model is trained by jointly optimizing the standard MLE loss and the USW-RBF kernel loss. This forces the encoder-decoder to learn representations where the temporal structure of audio aligns with the temporal structure of the caption.
Inference (Stochastic Decoding): To further mitigate exposure bias, the framework employs stochastic decoding (e.g., Nucleus Sampling or Top-k Sampling) to generate $B$ candidate captions. Instead of selecting the highest probability caption, the system reranks candidates using a scoring function that combines likelihood and the USW-RBF similarity score:
$y^* = \arg \max_{y \in B} \{ p(y|x) + UK_\gamma(Z_x, Z_y) \}$
This allows the model to select captions that are not only probable but also temporally consistent with the audio.

3. Key Contributions

USW-RBF Kernel: A novel, unbiased kernel designed specifically for cross-modal alignment that preserves temporal information via rotary positional embeddings and overcomes the dimensionality curse of Wasserstein distances.
Theoretical Guarantees: Proof that the kernel is positive definite and that its Monte Carlo estimation is unbiased with a known convergence rate, enabling stable stochastic optimization.
ACUS Framework: A complete pipeline integrating the USW-RBF kernel with stochastic decoding to effectively alleviate exposure bias and caption degeneration.
Generalizability: Demonstration that the kernel improves not just captioning but also audio reasoning tasks in Large Audio Language Models (LALMs).

4. Experimental Results

The method was evaluated on the AudioCaps and Clotho datasets, as well as audio reasoning benchmarks (CompA-R and MMAU).

Audio Captioning Performance:
- Quantitative: ACUS significantly outperformed baselines (including MLE and Contrastive Learning) across metrics like METEOR, CIDEr, SPICE, and SPIDEr. For example, on AudioCaps, SPIDEr improved from 0.48 (Enclap) to 0.50 (ACUS).
- Qualitative: The method generated captions with higher lexical diversity and longer lengths, reducing degeneration.
- Retrieval: Text-to-audio retrieval accuracy (using CLAP) improved significantly, indicating better semantic alignment.
- Human Evaluation: Annotators rated ACUS captions higher in descriptiveness and correctness compared to MLE and contrastive baselines.
Audio Reasoning:
- When applied to the GAMA model for audio reasoning, the USW-RBF kernel improved the CompA-R-test scores (Clarity, Correctness, Engagement) and increased MMAU-test-mini accuracy by 4% compared to contrastive learning baselines.
Ablation Studies:
- Rotary PE outperformed absolute positional embeddings.
- Stochastic decoding (Nucleus sampling) was crucial; deterministic beam search did not yield the same gains.
- The method is robust to the number of Monte Carlo projections ( $L$ ), with $L=50$ offering a good trade-off between performance and inference time.

5. Significance

This paper addresses a fundamental bottleneck in multimodal sequence generation: the inability of standard similarity metrics to capture temporal dynamics across modalities.

Theoretical Impact: It provides a mathematically rigorous, unbiased kernel for Wasserstein distances, enabling their practical use in deep learning optimization.
Practical Impact: The ACUS framework offers a plug-and-play solution to reduce caption degeneration in audio captioning without requiring complex architectural changes to the backbone models.
Broader Application: The success on audio reasoning tasks suggests that the USW-RBF kernel is a powerful tool for any cross-modal task requiring precise temporal alignment, potentially extending to video captioning, speech-to-text, and other sequential multimodal problems.

Limitations: The primary trade-off is inference time. Generating multiple candidate captions and computing the kernel for reranking increases latency (Real-Time Factor ~0.81 vs 0.33 for standard MLE), though it remains feasible for real-time applications.