Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Imagine you are trying to teach a robot to have a natural, real-time conversation with a human. You want the robot to be able to listen while it speaks, interrupt politely when you have a point to make, and say "uh-huh" while you're talking (backchanneling). This is called full-duplex conversation.

The problem is, most robots today are like people who can only talk when the other person is completely silent. If you try to interrupt them, they get confused or stop talking.

Why? Because the "textbooks" (training data) we use to teach them are terrible at showing how real people talk. Real conversations are messy: people talk over each other, they say "yeah" while you're mid-sentence, and there's often background music or noise. Most existing data cleaning tools treat these messy overlaps as "errors" and cut them out, leaving the robot with a sanitized, robotic version of human speech.

Enter Sommelier.

What is Sommelier?

Think of Sommelier as a high-end, automated audio sommelier (a wine expert). Just as a sommelier uncorks a bottle, filters out the sediment, and decants the wine to reveal its true flavor, Sommelier takes raw, messy, "wild" audio recordings (like podcasts or radio shows) and processes them into a high-quality "vintage" suitable for training advanced AI.

Here is how it works, broken down into simple steps:

1. The "Decanting" (Standardization & Cleaning)

Raw audio is like a bottle of wine with different corks, labels, and temperatures. Sommelier first standardizes everything. It makes sure all the audio is the same volume and format, so the AI doesn't get confused by loud whispers or quiet shouting.

2. The "Guest List" (Speaker Diarization)

In a crowded party, it's hard to know who is saying what. Sommelier uses a smart system (called Sortformer) to act like a bouncer with a guest list. It figures out exactly who is speaking and when. Crucially, unlike older systems that get confused when two people talk at once, Sommelier is excellent at spotting short, quick interjections (like "Right!" or "I see") that happen while someone else is talking.

3. The "Unmixing" (Handling Overlaps)

This is Sommelier's superpower. In real life, people talk over each other. Old systems would just delete the overlapping part or guess the wrong words.
Sommelier acts like a magical audio mixer. When two voices overlap, it uses a special separation technique to "unmix" the audio. It takes the tangled knot of two voices and unties them, creating two clean, separate tracks. It then figures out which part belongs to Speaker A and which belongs to Speaker B, ensuring the AI learns that both people were speaking at the same time.

4. The "Fact-Checking" (Ensemble ASR)

Once the audio is separated, Sommelier needs to turn it into text. But AI transcription tools sometimes "hallucinate" (make up words) or repeat themselves like a broken record.
Sommelier doesn't trust just one tool. It hires three different transcription experts (a team of AIs) to listen to the same audio.

If two out of three agree on a word, it keeps it.
If they disagree, it uses a smart voting system to pick the best one.
It also has a "spell-checker" that removes repetitive nonsense.
This ensures the text the AI learns from is accurate and free of made-up words.

5. The "Noise Filter" (Background Removal)

Sometimes the audio has background music or loud noise. Sommelier can detect this and, if necessary, use a tool to strip away the music, leaving only the human voices. It's smart enough to know not to do this if the music is part of the conversation, preserving the natural feel.

The Result: A Better Robot

The authors tested this by teaching a robot named Moshi using data processed by Sommelier.

Before: The robot was stiff. If you interrupted it, it kept talking. If you said "uh-huh," it ignored you.
After: The robot became much more human-like. It learned to pause when you interrupted, to nod along (backchannel) while you spoke, and to handle the chaos of real conversation without getting lost.

Why Does This Matter?

Currently, the AI world is starving for high-quality data that shows how humans actually interact. We have plenty of data where people read scripts alone, but very little data showing the messy, overlapping, interrupting nature of real life.

Sommelier is the first open-source "factory" that can take massive amounts of messy, real-world audio and turn it into a clean, structured, high-fidelity dataset. It's like giving the AI community a massive library of real human conversations, finally allowing them to build robots that don't just answer questions, but actually converse.

1. Problem Statement

The shift from text-based Large Language Models (LLMs) to Speech Language Models (SLMs) has created a demand for full-duplex systems capable of simultaneous listening and speaking to enable natural, real-time human-computer interaction. However, the development of such systems is bottlenecked by a critical lack of high-quality training data.

Data Scarcity: Existing large-scale audio datasets are predominantly single-speaker or single-stream, lacking the multi-turn, overlapping dynamics of natural conversation.
Processing Limitations: Standard preprocessing pipelines fail to handle the complexities of "in-the-wild" audio, such as:
- Overlapping Speech: Conventional methods often discard overlaps or treat them as errors, whereas full-duplex models require them.
- Diarization Errors: Standard tools struggle with rapid turn-taking and short back-channeling utterances.
- ASR Hallucinations: Single-model Automatic Speech Recognition (ASR) often generates repetitive or nonsensical text (hallucinations) during silence or noise, corrupting the training signal.
- Acoustic Clutter: Background music and noise in podcasts/radio interfere with speech clarity.

2. Methodology: The Sommelier Pipeline

The authors propose Sommelier, a robust, open-source, and scalable data processing pipeline designed to transform raw, chaotic conversational audio into high-quality, multi-stream training corpora. The pipeline is modular, allowing researchers to toggle components based on specific needs.

Key Processing Stages:

Audio Standardization: Converts diverse audio formats to a unified standard (16kHz, 16-bit, Mono) and performs loudness normalization (-20dBFS).
VAD & Segmentation: Uses Voice Activity Detection (VAD) to split long audio files into manageable chunks (<5 minutes) to prevent out-of-memory errors in downstream models, while preserving conversational context.
Speaker Diarization:
- Replaces the standard pyannote model with Sortformer (NVIDIA).
- Rationale: Sortformer demonstrates superior robustness in capturing very short utterances (e.g., back-channeling) and rapid turn-taking compared to existing baselines.
Overlap Separation (Core Innovation):
- Instead of discarding overlaps, the pipeline preserves them.
- It employs a two-step separation strategy:
  1. Extract non-overlapping segments (>2s) to generate speaker embeddings.
  2. Feed the overlapping region into a speech separation model (SepReformer).
  3. Use cosine similarity between the separated candidates and the reference embeddings to assign the correct speaker identity to each separated stream.
- This allows the creation of single-speaker segments that retain the full semantic content of the overlap.
Background Music Removal:
- Uses PANNs to detect background music (BGM).
- If BGM probability > 0.3, Demucs is applied to extract the vocal track.
- To maintain quality, the model processes 2-minute chunks rather than short segments, ensuring better separation context.
Ensemble-based ASR:
- Combines outputs from three SOTA models (Whisper, Canary, Parakeet) using ROVER (Recognizer Output Voting Error Reduction).
- Voting Scheme: A word is accepted if predicted by at least two models; otherwise, the primary model (Whisper) decides.
- Hallucination Filtering: A RepetitionFilter removes samples with excessive n-gram repetitions (n=15, count≥5) to eliminate ASR hallucinations.
- Timestamp Alignment: Extracts precise word-level timestamps essential for streaming full-duplex models.

3. Key Contributions

First Scalable Full-Duplex Pipeline: The first open-source pipeline specifically designed to curate multi-turn conversational data suitable for training full-duplex SLMs, addressing the community-wide data scarcity.
High-Fidelity Overlap Processing: Introduces a rigorous strategy combining advanced diarization (Sortformer) and speaker-aware speech separation to handle overlaps without losing information.
Hallucination Mitigation: Demonstrates that ensemble ASR with ROVER and n-gram filtering significantly reduces transcription errors and hallucinations compared to single-model approaches.
Validation on Full-Duplex Models: Proves the pipeline's efficacy by fine-tuning the Moshi model, showing measurable improvements in handling back-channeling, interruptions, and smooth turn-taking.

4. Experimental Results

The authors evaluated the pipeline through component ablation and end-to-end model fine-tuning.

Diarization Accuracy: On the VoxConverse benchmark, Sortformer outperformed Pyannote 3.1, reducing the Diarization Error Rate (DER) from 8.40% to 7.16% and the Jaccard Error Rate (JER) from 17.68% to 14.69%. It showed particular strength in handling short utterances.
Speech Separation Quality:
- In overlap scenarios, the separation method significantly improved intelligibility (WER) and perceptual quality (UTMOS) compared to baseline mixed audio.
- For Speaker 2 (the secondary/overlapped speaker), WER improved drastically (e.g., from 0.444 to 0.138 in high-overlap conditions).
- Perceptual scores (UTMOS) for separated speech approached the "Oracle" (clean source) upper bound.
ASR Performance: The ensemble method reduced Word Error Rate (WER) by approximately 37% compared to Whisper-large-v3 alone (e.g., 6.26% $\to$ 3.92% on LibriSpeech Test-Other).
Full-Duplex Model Performance (Moshi Fine-tuning):
- Fine-tuning Moshi on Sommelier-processed data (83 hours) led to significant gains in Backchanneling, Smooth Turn-Taking, and User Interruption handling on the Full-Duplex-Bench 1.0.
- Latency: While base Moshi had low latency due to ignoring user input (failure to interrupt), the fine-tuned model showed increased latency that reflects active processing and appropriate response generation, indicating more natural interaction.
Scalability: The pipeline achieves a Real-Time Factor (RTF) of 0.1746 on a single A100 GPU. With parallel processing, it can process 10,000 hours of audio in ~55 hours using 8 GPUs, proving industrial feasibility.

5. Significance

Democratizing Full-Duplex AI: By releasing an open-source, scalable pipeline, the authors lower the barrier for researchers to build and train full-duplex SLMs, moving beyond the reliance on proprietary or small-scale datasets.
Preserving Conversational Dynamics: Unlike previous pipelines that "clean" audio by removing overlaps, Sommelier treats overlaps and back-channeling as essential features, enabling models to learn the rhythm of human conversation rather than just the text.
Industrial Applicability: The pipeline is optimized for high throughput and handles the "messy" reality of web-scale audio (podcasts, radio), making it a practical tool for generating massive, high-quality training datasets for next-generation voice assistants.

In conclusion, Sommelier bridges the gap between raw, chaotic real-world audio and the structured, high-fidelity data required to train the next generation of natural, full-duplex speech AI.