TCG CREST System Description for the DISPLACE-M Challenge

The TCG CREST system achieved a sixth-place ranking in the DISPLACE-M challenge's speaker diarization track by demonstrating that a hybrid end-to-end Diarizen framework with WavLM embeddings and optimized agglomerative hierarchical clustering significantly outperformed a SpeechBrain baseline, reducing the diarization error rate to 9.21% on the evaluation set.

Nikhil Raghav, Md Sahidullah

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are walking into a bustling, noisy rural clinic. Two people are talking: a community health worker and a local resident. They are discussing health issues, but the room is loud, they talk over each other, and sometimes their voices sound very similar.

Your job is to be a super-listener. You need to listen to the whole conversation and write down exactly who said what and when. This is called Speaker Diarization.

This paper is a report from a team called TCG CREST who entered a competition (the DISPLACE-M Challenge) to build the best "super-listener" for this specific job. Here is how they did it, explained simply.

The Two Main Strategies

The team tried two different ways to solve the puzzle:

1. The "Modular Team" (SpeechBrain System)

Think of this as a relay race team where every runner has a specific job.

  • Runner 1 (VAD): First, a specialist listens to the audio and marks exactly where people are speaking and where there is silence. This is called "Voice Activity Detection."
  • Runner 2 (The Ear): Next, a smart system (ECAPA-TDNN) listens to those speech segments and creates a unique "voice fingerprint" for each speaker.
  • Runner 3 (The Organizer): Finally, a clustering algorithm looks at all the fingerprints and groups them. "Oh, this fingerprint looks like the first person, and this one looks like the second."

The Problem: If Runner 1 (the silence detector) makes a mistake, the whole team fails. The team found that if they used a perfect silence detector, this system was great. But if the detector was just "okay," the whole team stumbled.

2. The "All-in-One Super-Brain" (Diarizen System)

This is like a single, highly trained detective who does everything at once.

  • Instead of passing the baton, this system looks at the audio in small chunks (like 8 to 16 seconds).
  • It uses a massive, pre-trained AI brain (WavLM) to understand the sound, figure out who is speaking, and handle the noise all in one go.
  • It then uses a "backend" to organize the final list of who spoke when.

The Result: This "Super-Brain" was much stronger. It made fewer mistakes than the relay team, even when the audio was messy.

The Secret Sauce: Smoothing the Edges

Even the best detective can get jittery. Sometimes the system might think a speaker stopped talking for a split second when they actually didn't, or it might get confused by a cough.

The team discovered a simple trick to fix this: The "Longer Look" Filter.
Imagine you are watching a movie and someone sneezes. You don't think the movie ended; you know it's just a tiny blip.

  • The default system looked at a tiny window of time (11 frames) to decide if a speaker was still talking.
  • The team realized that if they looked at a longer window (29 frames), the system could "see" past the tiny glitches and coughs. It smoothed out the conversation, realizing, "Ah, they didn't stop talking; they just paused for a second."

This simple change made their system significantly better.

The Competition Results

  • The Goal: Reduce the "Diarization Error Rate" (DER). Think of this as the number of mistakes the system makes per 100 seconds of conversation. Lower is better.
  • The Winner: The "Super-Brain" (Diarizen) with the "Longer Look" filter was the team's best entry.
    • It reduced errors by about 39% compared to their first attempt (the relay team).
    • On the final test, they got a score of 9.21% error.
  • The Ranking: Out of 11 teams, TCG CREST came in 5th place.

What Did They Learn? (The Takeaway)

  1. The Silence Detector Matters: If you can't tell when people are talking, you can't tell who is talking. The "Modular Team" failed mostly because the silence detector wasn't perfect.
  2. One Big Brain vs. Many Small Ones: The "All-in-One" system (Diarizen) was much more robust against noise and overlapping voices than the step-by-step system.
  3. Patience Pays Off: Looking at a longer stretch of time (the 29-frame window) helped the system ignore tiny distractions and focus on the main conversation.

What's Next?

The team realized that while their "Super-Brain" is great, it still struggles with a few specific types of recordings. They suggest that in the future, they might try:

  • Teaching the system using the practice data (fine-tuning) to make it even smarter.
  • Combining the two teams: Maybe the "Modular Team" is better at some things and the "Super-Brain" is better at others. If they could combine their strengths, they might get the perfect score.

In short: The team built a smart system that listens to noisy rural clinic conversations. By using a powerful AI detective and teaching it to look at the "big picture" rather than getting distracted by tiny noises, they managed to figure out who said what better than most other teams in the competition.