TCG CREST System Description for the DISPLACE-M Challenge

Imagine you are walking into a bustling, noisy rural clinic. Two people are talking: a community health worker and a local resident. They are discussing health issues, but the room is loud, they talk over each other, and sometimes their voices sound very similar.

Your job is to be a super-listener. You need to listen to the whole conversation and write down exactly who said what and when. This is called Speaker Diarization.

This paper is a report from a team called TCG CREST who entered a competition (the DISPLACE-M Challenge) to build the best "super-listener" for this specific job. Here is how they did it, explained simply.

The Two Main Strategies

The team tried two different ways to solve the puzzle:

1. The "Modular Team" (SpeechBrain System)

Think of this as a relay race team where every runner has a specific job.

Runner 1 (VAD): First, a specialist listens to the audio and marks exactly where people are speaking and where there is silence. This is called "Voice Activity Detection."
Runner 2 (The Ear): Next, a smart system (ECAPA-TDNN) listens to those speech segments and creates a unique "voice fingerprint" for each speaker.
Runner 3 (The Organizer): Finally, a clustering algorithm looks at all the fingerprints and groups them. "Oh, this fingerprint looks like the first person, and this one looks like the second."

The Problem: If Runner 1 (the silence detector) makes a mistake, the whole team fails. The team found that if they used a perfect silence detector, this system was great. But if the detector was just "okay," the whole team stumbled.

2. The "All-in-One Super-Brain" (Diarizen System)

This is like a single, highly trained detective who does everything at once.

Instead of passing the baton, this system looks at the audio in small chunks (like 8 to 16 seconds).
It uses a massive, pre-trained AI brain (WavLM) to understand the sound, figure out who is speaking, and handle the noise all in one go.
It then uses a "backend" to organize the final list of who spoke when.

The Result: This "Super-Brain" was much stronger. It made fewer mistakes than the relay team, even when the audio was messy.

The Secret Sauce: Smoothing the Edges

Even the best detective can get jittery. Sometimes the system might think a speaker stopped talking for a split second when they actually didn't, or it might get confused by a cough.

The team discovered a simple trick to fix this: The "Longer Look" Filter.
Imagine you are watching a movie and someone sneezes. You don't think the movie ended; you know it's just a tiny blip.

The default system looked at a tiny window of time (11 frames) to decide if a speaker was still talking.
The team realized that if they looked at a longer window (29 frames), the system could "see" past the tiny glitches and coughs. It smoothed out the conversation, realizing, "Ah, they didn't stop talking; they just paused for a second."

This simple change made their system significantly better.

The Competition Results

The Goal: Reduce the "Diarization Error Rate" (DER). Think of this as the number of mistakes the system makes per 100 seconds of conversation. Lower is better.
The Winner: The "Super-Brain" (Diarizen) with the "Longer Look" filter was the team's best entry.
- It reduced errors by about 39% compared to their first attempt (the relay team).
- On the final test, they got a score of 9.21% error.
The Ranking: Out of 11 teams, TCG CREST came in 5th place.

What Did They Learn? (The Takeaway)

The Silence Detector Matters: If you can't tell when people are talking, you can't tell who is talking. The "Modular Team" failed mostly because the silence detector wasn't perfect.
One Big Brain vs. Many Small Ones: The "All-in-One" system (Diarizen) was much more robust against noise and overlapping voices than the step-by-step system.
Patience Pays Off: Looking at a longer stretch of time (the 29-frame window) helped the system ignore tiny distractions and focus on the main conversation.

What's Next?

The team realized that while their "Super-Brain" is great, it still struggles with a few specific types of recordings. They suggest that in the future, they might try:

Teaching the system using the practice data (fine-tuning) to make it even smarter.
Combining the two teams: Maybe the "Modular Team" is better at some things and the "Super-Brain" is better at others. If they could combine their strengths, they might get the perfect score.

In short: The team built a smart system that listens to noisy rural clinic conversations. By using a powerful AI detective and teaching it to look at the "big picture" rather than getting distracted by tiny noises, they managed to figure out who said what better than most other teams in the competition.

Here is a detailed technical summary of the paper "TCG CREST System Description for the DISPLACE-M Challenge."

1. Problem Statement

The paper addresses Track 1 (Speaker Diarization) of the DISPLACE-M challenge, which focuses on naturalistic medical conversations between community health workers and local residents in rural healthcare settings.

Key Challenges:

Environment: Noisy rural settings with varying Signal-to-Noise Ratios (SNR).
Acoustic Complexity: Foreground speech overlap, background speech, dialectal variations, and spontaneous dialogue.
Data Characteristics: The dataset (Phase I) consists of 78 development and 71 evaluation two-speaker recordings. Analysis revealed high speech continuity (88.14% speech percentage) but low overlap (4.08%) and moderate-to-clean SNR (32.43 dB), though variability exists across sessions.
Metric: Performance is measured by the Diarization Error Rate (DER).

2. Methodology

The authors evaluated two distinct Speaker Diarization (SD) frameworks, focusing heavily on Voice Activity Detection (VAD) and Clustering Algorithms.

A. System 1: Modular Pipeline (SpeechBrain)

Architecture: A modular approach using dedicated components for VAD, segmentation, and clustering.
VAD: Tested Silero and Pyannote VADs.
Embeddings: Utilized ECAPA-TDNN (trained on VoxCeleb2) for speaker embeddings.
Clustering: Applied various Spectral Clustering (SC) variants on cosine similarity affinity matrices.
Parameters: ECAPA-TDNN (20.76M params) + Silero VAD (0.46M params).

B. System 2: Hybrid End-to-End (Diarizen)

Architecture: A state-of-the-art (SOTA) hybrid system based on EEND-VC (End-to-End Neural Diarization with Vector Clustering).
Frontend (Micro-level): Uses a pre-trained WavLM feature extractor, a Conformer block, and a linear classifier. It processes raw audio in short, overlapping chunks (8–16s) to generate frame-level speech probabilities and handle overlaps.
Backend (Macro-level): Passes localized embeddings to a Pyannote-based backend for global clustering.
Clustering Strategies Evaluated:
- AHC: Agglomerative Hierarchical Clustering (Default).
- VBx: Bayesian HMM clustering of x-vector sequences.
- k-means: Standard clustering.
- Spectral Clustering Variants:
  - SC-fixed: Standard SC with fixed k-NN (10 neighbors).
  - SC-adapt: Adaptive neighborhood size (top $p\%$ nearest neighbors, optimized $p=0.01$ ).
  - SC-pNA: Adaptive neighbor selection without a dev set (retains top 20% connections based on affinity).
  - SC-MK: Multiple Kernel guided sparse graph construction (combines polynomial and arccosine kernels).
Post-Processing: Applied Median Filtering to smooth boundaries. The default window size was 11 frames; the authors optimized this to a larger window of 29 frames to reduce fragmentation.
Parameters: Total system size is approx. 31.77M params (WavLM-Base pruned to 25.17M + ResNet-34 embedding extractor at 6.6M).

3. Key Contributions

Comparative Analysis: A rigorous comparison between a modular SpeechBrain pipeline and the hybrid Diarizen system in a rural medical context.
Clustering Investigation: Extensive evaluation of advanced spectral clustering variants (SC-adapt, SC-pNA, SC-MK) within the Diarizen framework.
VAD Impact Assessment: Demonstrated that VAD quality is a critical bottleneck; using "Oracle" VAD significantly lowered DER, while estimated VADs (Silero/Pyannote) degraded performance significantly in the modular pipeline.
Post-Processing Optimization: Identified that increasing the median filtering context window from 11 to 29 frames significantly improved temporal consistency and reduced DER.

4. Experimental Results

Experiments were conducted on the Phase I Development (Dev) and Evaluation (Eval) sets.

SpeechBrain Baseline:
- With Oracle VAD: DER of 8.99% (Dev).
- With Silero VAD: DER of 17.37% (Dev).
- With Pyannote VAD: DER of 17.96% (Dev).
- Insight: The modular system is highly sensitive to VAD accuracy.
Diarizen System Performance:
- Baseline (AHC): 10.54% (Dev) / 9.44% (Eval).
- Clustering Variants:
  - VBx: 11.15% (Dev).
  - k-means: 10.47% (Dev).
  - SC-adapt: 10.48% (Dev) / 9.41% (Eval).
  - SC-MK: 10.51% (Dev) / 9.41% (Eval).
  - Finding: Advanced spectral clustering variants offered negligible or no improvement over the default AHC in this specific pipeline.
Best Performing System:
- Configuration: Diarizen Baseline (AHC) + Median Filtering (Window Size 29).
- Results:
  - Dev Set: 10.37% DER.
  - Eval Set: 9.21% DER.
- Improvement: This configuration provided a 1.61% relative improvement over the Diarizen baseline and a ~39% relative reduction in DER compared to the SpeechBrain baseline (using estimated VAD).
Ranking: The team achieved 5th place out of 11 participating teams.

5. Significance and Future Directions

System Robustness: The study confirms that while modular systems (SpeechBrain) can perform well with gold-standard VAD, hybrid end-to-end systems (Diarizen) are more robust to estimated VAD errors in real-world scenarios.
Limitations Identified:
- Advanced clustering algorithms (SC variants) did not outperform AHC in this specific setup.
- A few files showed high DER for both systems, suggesting potential annotation errors or extreme acoustic conditions.
- The study was unsupervised (no fine-tuning on the dev set).
Future Work:
- Investigating supervised adaptation or fine-tuning on the development set.
- Exploring fusion strategies (score-level or decision-level) to combine the strengths of SpeechBrain and Diarizen.
- Integrating clustering and filtering strategies directly into the training process rather than just inference.
- Developing statistical modeling frameworks to better understand performance variability based on recording properties (SNR, overlap, turn-taking).

In conclusion, the TCG CREST system demonstrated that optimizing post-processing (specifically median filtering window size) within a strong hybrid neural framework (Diarizen) is more effective than experimenting with complex clustering variants or modular pipelines reliant on imperfect VAD for this specific rural medical diarization task.