TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Imagine you have a brilliant, super-smart robot that can listen to the world and tell you what it hears. This robot is trained on millions of hours of "standard" English and Mandarin. It's great at understanding a news anchor in a studio or a clear conversation in a quiet room.

But, if you take this robot to a bustling night market in Taiwan, or ask it to listen to a grandmother telling a story in a local dialect, it starts to get confused. It hears the unique rhythm of the local speech and the background sounds of street vendors, but because it's never heard them before, it treats them like "static noise." It might try to force a meaning onto sounds that don't fit, essentially "hallucinating" a story that isn't there.

This paper introduces a solution to that problem, consisting of three main parts: a new library of sounds, a quality control process, and a smart decision-maker.

1. The Library: TW-Sound580K

Think of the current AI models as students who only studied from textbooks written in a perfect, sterile classroom. They don't know how to handle the messy, real world.

The authors built a massive new library called TW-Sound580K. It contains over 580,000 audio clips specifically from Taiwan.

What's in it? It's not just people speaking clearly. It includes dialects, different accents, background noises like temple bells or market chatter, and emotional tones unique to the region.
The Goal: To teach the AI that these "messy" local sounds aren't errors; they are important clues to understanding the culture and the message.

2. The Quality Control: The "Verify-Generate-Critique" (VGC) Pipeline

Here is the tricky part: How do you teach an AI using data that is messy and full of dialects without teaching it bad habits?

Imagine you are hiring a team of translators to create a dictionary for a new language.

The Problem: If you just ask one translator to write down what they hear, they might make mistakes, especially with difficult dialects.
The Solution (The VGC Pipeline):
1. Verify (The Double-Check): They use two different "ears" (two different speech recognition systems) to listen to the same clip. If both ears agree on what was said, it's good. If they disagree wildly, the clip is likely too noisy or confusing, so they throw it out.
2. Generate (The Creative Writer): A super-smart "Teacher AI" listens to the clean clips and writes down descriptions. But instead of just writing text, it's forced to stick only to what it actually hears, preventing it from making things up.
3. Critique (The Editor): The Teacher AI then reviews its own work, acting like a strict editor. It asks, "Did I describe this sound accurately, or did I just guess?" If it guessed, it deletes that part.

This process ensures that the AI learns from high-quality, accurate examples, not from its own mistakes.

3. The Smart Decision-Maker: Dynamic Arbitration

Even with a great library, the AI might still get stuck when it hears a really tricky dialect during a real conversation.

Imagine the AI is a detective trying to solve a mystery. Usually, it asks one witness (a speech recognition system) for the story. But sometimes, that witness is confused by the accent.

The New Strategy: The AI now has a "Chief Detective" (the Arbiter). When the witnesses give different versions of the story, the Chief Detective doesn't just pick one. It listens to the original audio again and asks, "Which version of the story makes the most sense given the sound I'm hearing right now?"
It uses a math trick called AC-PPL (Acoustically-Conditioned Perplexity) to measure how well a guess fits the sound. If the guess feels "off" compared to the audio, it rejects it and tries another one. This stops the AI from confidently stating nonsense.

The Results: Does it Work?

The authors tested their new AI, called Tai-LALM, on a tough exam designed for Taiwanese audio (the TAU Benchmark).

Before: The standard AI got about 42.6% correct.
After: The new AI, trained on the TW-Sound580K library and using the smart decision-making strategy, scored 49.1%.

That might not sound like a huge jump, but in the world of AI, a 6.5% improvement is massive. It proves that by focusing on local, high-quality data and smart filtering, you can teach a global AI to understand local culture much better.

The Big Picture

This paper teaches us that you can't just make AI smarter by making it bigger (adding more parameters). Sometimes, you have to make it more specific. Just like a human needs to learn the local slang and customs to truly understand a community, AI needs a "local library" and a "local editor" to stop hallucinating and start understanding.

In short: They built a specialized school for AI using real Taiwanese sounds, hired a strict editor to clean up the lessons, and taught the AI to double-check its answers. The result is an AI that finally "gets" the local vibe.

Here is a detailed technical summary of the paper "TW-Sound580K: A Regional Audio–Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling."

1. Problem Statement

Large Audio-Language Models (LALMs) currently suffer from a localization gap, particularly in linguistically diverse regions like Taiwan. While models perform well on standard accents and dominant languages, they struggle with:

Regional Prosody: Non-standard dialectal intonation and rhythm.
Local Soundmarks: Unique environmental sounds specific to the region.
Acoustic Hallucinations: Models often treat these nuanced regional signals as out-of-distribution noise, leading to forced transcriptions of environmental sounds into nonsensical text or "acoustic hallucinations."

Existing datasets (e.g., AudioSet, LibriSpeech) focus on standard acoustic environments, leaving the "acoustic long-tail" (sparse, region-specific sounds) under-represented. Furthermore, standard Automatic Speech Recognition (ASR) systems fail to process non-lexical environmental cues, while relying solely on them compromises transcription accuracy for complex dialects.

2. Methodology

The authors propose a comprehensive data-centric framework involving dataset construction, a rigorous curation pipeline, and a novel inference strategy.

A. Dataset Construction: TW-Sound580K

Source: The process begins with 522,572 raw audio clips sourced from Taiwan-centric environments.
Expansion: Using a teacher Large Language Model (LLM), these clips are expanded into 580,000 diverse audio-text instruction-response pairs.
Distribution: The dataset is designed to capture the "acoustic long-tail." While 46.4% of labels relate to conversation, 53.6% specifically target unique dialectal prosody and localized soundmarks (e.g., announcements, emergency sounds, cultural events).

B. Verification-Guided Curation (VGC) Protocol

To ensure high-fidelity supervision and prevent semantic hallucinations, the authors introduce a Verify-Generate-Critique pipeline integrated with Dual-ASR filtering:

Verify (Conditional Routing): Two heterogeneous ASR engines (Whisper-v3 and SenseVoice) generate transcriptions. A semantic consistency score ( $S$ $S$ ) is computed.
- If both ASRs output empty text (indicating non-speech), the clip bypasses text checks.
- If the consistency score $S$ is below a threshold ( $\tau = 0.6$ ), the clip is pruned to remove irrecoverable dialectal noise.
Generate (Acoustic-Constrained Distillation): A powerful native-audio LLM (Teacher Model, Gemini-2.5-Pro) generates instructions and descriptions directly from raw audio, constrained by zero-shot prompting to focus on verifiable paralinguistic and environmental features.
Critique (Self-Reflective Audit): The teacher model reviews its own output to prune ungrounded descriptors, ensuring the final instruction data is strictly anchored to actual acoustic cues.

C. Inference-Time Perceptual Arbitration

To mitigate errors during inference (runtime hallucinations), the authors propose a Dynamic Dual-ASR Arbitration mechanism:

Mechanism: Given candidate transcriptions from multiple ASR engines, the model selects the optimal transcription ( $\hat{h}$ ) by minimizing the Acoustically-Conditioned Perplexity (AC-PPL).
Logic: The arbiter evaluates how well a candidate aligns with the model's internal latent acoustic representation ( $z_A$ ). If all candidates are detected as empty soundmarks, the system bypasses text injection entirely and shifts to pure-audio reasoning.

D. Model Architecture (Tai-LALM)

Base: Built upon the DeSTA 2.5-Audio framework (initialized with Llama-3-8B-Instruct backbone).
Training: Uses Low-Rank Adaptation (LoRA) on the backbone's attention layers.
Objective: Minimizes autoregressive loss conditioned on both continuous acoustic representations and text generated by the built-in ASR.

3. Key Contributions

TW-Sound580K Dataset: A large-scale (580k pairs) instruction-tuning corpus specifically targeting the Taiwanese "acoustic long-tail," providing high-quality supervision for regional dialects and soundmarks.
VGC Curation Pipeline: A novel data processing workflow combining Dual-ASR filtering and a Teacher-LLM critique loop to ensure data purity and reduce hallucinations.
Dynamic Inference Arbitration: An AC-PPL-guided strategy that dynamically selects the best transcription during inference, effectively reducing runtime errors in dialect-heavy contexts.
Empirical Validation: Demonstration that rigorous curation and dynamic arbitration significantly outperform simple scaling or unfiltered fine-tuning.

4. Experimental Results

The proposed model, Tai-LALM, was evaluated on the TAU Benchmark (1,794 queries covering single-hop and multi-hop tasks).

Performance: Tai-LALM achieved 49.1% overall accuracy.
Improvements:
- +6.5% absolute improvement over the zero-shot DeSTA 2.5-Audio baseline (42.6%).
- +2.7% improvement over a "Negative Control" model trained on unfiltered raw data (46.4%), proving that data quality (curation) is more critical than sheer volume.
- Outperformed the Qwen2.5-Omni-7B baseline (46.3%) by 2.8%.
Ablation Study:
- Using unfiltered data with Dual-ASR arbitration reached 47.5%.
- Adding the VGC filtering pipeline to the Dual-ASR setup pushed performance to 49.1%, confirming the critical role of data curation.
Generalization: The model retained foundational capabilities, showing improved Word Error Rate (WER) on LibriSpeech (3.92% vs 4.71%) and minimal trade-offs on non-speech tasks (ESC-50, CREMA-D), indicating no catastrophic forgetting.

5. Significance

This work demonstrates that bridging the localization gap in LALMs is primarily a data-centric challenge rather than purely an architectural one.

Quality over Quantity: The results show that a curated, region-specific dataset with rigorous verification outperforms massive, unfiltered datasets.
Reproducibility: The paper provides a reproducible framework (VGC + Dual-ASR Arbitration) that can be adapted to other under-resourced linguistic regions.
Cultural Preservation: By treating regional acoustic cues as integral semantic features rather than noise, the approach enables models to understand cultural context and local soundscapes, moving beyond generic cross-cultural semantics.

In conclusion, TW-Sound580K and the Tai-LALM framework offer a robust solution for localized audio understanding, proving that combining high-fidelity data curation with dynamic inference strategies is essential for deploying LALMs in diverse, real-world environments.