Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction

This paper presents a competitive open-source cascaded system that combines EEND-VC speaker diarization and fine-tuned Qwen3 ASR to achieve first place in the DISPLACE-M challenge by effectively extracting medical conditions from overlapping, code-switched Hinglish clinical dialogues.

Séverin Baroudi, Yanis Labrak, Shashi Kumar, Joonas Kalda, Sergio Burdisso, Pawel Cyrta, Juan Ignacio Alvarez-Trejos, Petr Motlicek, Hervé Bredin, Ricard Marxer

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine a busy, chaotic doctor's appointment in a rural village in India. The doctor speaks a mix of Hindi and English (called "Hinglish"), and the patient does too. They talk fast, they interrupt each other, they talk over each other, and the room is noisy.

Your goal? To listen to this recording and automatically write down exactly what medical problems the patient has.

This paper is about building a super-smart "digital scribe" that can handle this chaos better than any human or previous computer program could. Here is how they did it, explained simply:

1. The Problem: The "Cocktail Party" Chaos

Most computer programs that listen to speech are trained on clear, quiet conversations where one person speaks at a time. But real life is messy.

  • The Overlap: The doctor and patient talk at the same time. It's like two people shouting different stories over a loud radio.
  • The Mix: They switch between Hindi and English mid-sentence. It's like a chef switching between French and Italian recipes while cooking.
  • The Noise: The recording isn't in a studio; it's in a noisy clinic.

2. The Solution: A Three-Step Assembly Line

The team built a pipeline (a step-by-step process) to solve this. Think of it as a factory with three specialized workers.

Step 1: The "Traffic Cop" (Speaker Diarization)

The Job: Figure out who is speaking and when.
The Old Way: Old systems tried to guess who was talking based on simple voice patterns, but they got confused when voices overlapped.
The New Way (EEND-VC): The team used a new "Traffic Cop" AI. Instead of just listening, it looks at the whole conversation at once. It uses a technique called Vector Clustering.

  • Analogy: Imagine a crowded dance floor. Old systems try to track one dancer at a time. This new system takes a photo of the whole floor, groups the dancers by their outfits (voice patterns), and instantly separates the "Doctor's Dance" from the "Patient's Dance," even when they are bumping into each other.

Step 2: The "Translator" (Speech-to-Text)

The Job: Turn the separated voices into written words.
The Challenge: The text is in Devanagari script (Hindi) mixed with English, and it has medical jargon.
The Fix: They took a powerful AI model (Qwen3) and gave it a crash course.

  • Domain Training: They fed it thousands of hours of Hindi medical conversations so it learned the specific words doctors use.
  • The "Spell-Check" (Error Correction): Sometimes the AI gets confused and writes "head ache" instead of "headache." They added a second AI (an LLM) that acts like a strict editor. It reads the whole conversation and fixes the typos and weird phrasing without changing the meaning.
  • Result: They got the text 80% more accurate than the baseline.

Step 3: The "Diagnosis Detective" (Condition Extraction)

The Job: Read the text and pull out the specific medical conditions (e.g., "diabetes," "fever").
The Experiment: They tested two ways to do this:

  1. The Cascade (Text-Based): Listen → Write Text → Read Text → Find Disease.
  2. The End-to-End (Audio-Based): Listen → Find Disease directly (skipping the writing step).

The Big Surprise:

  • The Text-Based system (their open-source winner) was very good. It won the competition against 24 other teams!
  • However, the Audio-Based system (using a proprietary "black box" AI) was actually the best at finding diseases.
  • Analogy: The Text-Based system is like a human taking notes and then reading them to find the answer. The Audio-Based system is like a psychic who hears the tone of voice and the hesitation and knows the answer immediately. The "psychic" (Audio) is slightly better, but the "note-taker" (Text) is incredibly close and, crucially, free and open for everyone to use.

3. Why This Matters

  • Real-World Ready: This isn't a lab experiment. It works on real, noisy recordings from rural India.
  • Open Source: The team released all their code. It's like giving the recipe for the "Traffic Cop" and the "Translator" to the whole world, so other developers can build on it.
  • Privacy: Because they built a system that works well without needing to send data to a giant tech company's server, it's safer for patient privacy.

The Takeaway

The paper proves that you don't need a magic "all-in-one" AI to solve complex medical problems. You can build a team of specialized AIs (a Traffic Cop, a Translator, and a Detective) that work together. While the "magic" AI is slightly better at the very end, this team approach is robust, transparent, and good enough to win the world championship.

In short: They taught computers to listen to messy, mixed-language doctor visits, separate the voices, write down the words perfectly, and figure out what's wrong with the patient—all while keeping the code open for everyone to see.