Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction

Imagine a busy, chaotic doctor's appointment in a rural village in India. The doctor speaks a mix of Hindi and English (called "Hinglish"), and the patient does too. They talk fast, they interrupt each other, they talk over each other, and the room is noisy.

Your goal? To listen to this recording and automatically write down exactly what medical problems the patient has.

This paper is about building a super-smart "digital scribe" that can handle this chaos better than any human or previous computer program could. Here is how they did it, explained simply:

1. The Problem: The "Cocktail Party" Chaos

Most computer programs that listen to speech are trained on clear, quiet conversations where one person speaks at a time. But real life is messy.

The Overlap: The doctor and patient talk at the same time. It's like two people shouting different stories over a loud radio.
The Mix: They switch between Hindi and English mid-sentence. It's like a chef switching between French and Italian recipes while cooking.
The Noise: The recording isn't in a studio; it's in a noisy clinic.

2. The Solution: A Three-Step Assembly Line

The team built a pipeline (a step-by-step process) to solve this. Think of it as a factory with three specialized workers.

Step 1: The "Traffic Cop" (Speaker Diarization)

The Job: Figure out who is speaking and when.
The Old Way: Old systems tried to guess who was talking based on simple voice patterns, but they got confused when voices overlapped.
The New Way (EEND-VC): The team used a new "Traffic Cop" AI. Instead of just listening, it looks at the whole conversation at once. It uses a technique called Vector Clustering.

Analogy: Imagine a crowded dance floor. Old systems try to track one dancer at a time. This new system takes a photo of the whole floor, groups the dancers by their outfits (voice patterns), and instantly separates the "Doctor's Dance" from the "Patient's Dance," even when they are bumping into each other.

Step 2: The "Translator" (Speech-to-Text)

The Job: Turn the separated voices into written words.
The Challenge: The text is in Devanagari script (Hindi) mixed with English, and it has medical jargon.
The Fix: They took a powerful AI model (Qwen3) and gave it a crash course.

Domain Training: They fed it thousands of hours of Hindi medical conversations so it learned the specific words doctors use.
The "Spell-Check" (Error Correction): Sometimes the AI gets confused and writes "head ache" instead of "headache." They added a second AI (an LLM) that acts like a strict editor. It reads the whole conversation and fixes the typos and weird phrasing without changing the meaning.
Result: They got the text 80% more accurate than the baseline.

Step 3: The "Diagnosis Detective" (Condition Extraction)

The Job: Read the text and pull out the specific medical conditions (e.g., "diabetes," "fever").
The Experiment: They tested two ways to do this:

The Cascade (Text-Based): Listen → Write Text → Read Text → Find Disease.
The End-to-End (Audio-Based): Listen → Find Disease directly (skipping the writing step).

The Big Surprise:

The Text-Based system (their open-source winner) was very good. It won the competition against 24 other teams!
However, the Audio-Based system (using a proprietary "black box" AI) was actually the best at finding diseases.
Analogy: The Text-Based system is like a human taking notes and then reading them to find the answer. The Audio-Based system is like a psychic who hears the tone of voice and the hesitation and knows the answer immediately. The "psychic" (Audio) is slightly better, but the "note-taker" (Text) is incredibly close and, crucially, free and open for everyone to use.

3. Why This Matters

Real-World Ready: This isn't a lab experiment. It works on real, noisy recordings from rural India.
Open Source: The team released all their code. It's like giving the recipe for the "Traffic Cop" and the "Translator" to the whole world, so other developers can build on it.
Privacy: Because they built a system that works well without needing to send data to a giant tech company's server, it's safer for patient privacy.

The Takeaway

The paper proves that you don't need a magic "all-in-one" AI to solve complex medical problems. You can build a team of specialized AIs (a Traffic Cop, a Translator, and a Detective) that work together. While the "magic" AI is slightly better at the very end, this team approach is robust, transparent, and good enough to win the world championship.

In short: They taught computers to listen to messy, mixed-language doctor visits, separate the voices, write down the words perfectly, and figure out what's wrong with the patient—all while keeping the code open for everyone to see.

Here is a detailed technical summary of the paper "Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction."

1. Problem Statement

The paper addresses the challenge of extracting patient medical conditions from real-world, code-switched clinical dialogues (specifically Hinglish, a mix of Hindi and English). The task is complicated by several factors:

Linguistic Complexity: Frequent code-switching between Hindi (written in Devanagari script) and English, often with English words spelled phonetically in Devanagari.
Acoustic Challenges: Noisy far-field recordings, spontaneous speech, and dense speaker overlaps (simultaneous talking) typical of Doctor-Patient Conversations (DoPaCo).
Resource Scarcity: A lack of large-scale, annotated medical datasets for low-resource languages like Hinglish.
Pipeline Fragility: Traditional cascaded systems (Diarization $\to$ ASR $\to$ Translation $\to$ Extraction) suffer from error propagation, where mistakes in early stages degrade downstream performance.

The study utilizes the DISPLACE-M dataset, comprising ~35 hours of de-identified recordings between community health workers (ASHAs) and patients in rural India.

2. Methodology

The authors propose a robust, modular cascaded pipeline consisting of three main stages, alongside a benchmarking of End-to-End (E2E) approaches.

A. Speaker Diarization (EEND-VC)

To handle overlapping speech, the authors moved away from traditional sequential diarization (like ECAPA-TDNN) to End-to-End Neural Diarization with Vector Clustering (EEND-VC).

Backbone: Replaced the English-centric WavLM-Base with w2v-bert2.0, a multilingual encoder trained on 143 languages, better suited for Hindi conversations.
Architecture: Evaluated LSTM vs. Mamba (Selective State Space Model) layers for the context network. Mamba was hypothesized to offer linear complexity for long contexts but slightly underperformed LSTMs in this specific setting.
Clustering: Used k-means clustering ( $k=2$ ) on speaker embeddings extracted from single-speaker segments to separate the Doctor and Patient, avoiding complex threshold tuning required by hierarchical clustering.
Training Strategy: Pre-trained on multi-domain datasets (DIHARD3, VoxConverse, etc.) and fine-tuned specifically on the DISPLACE-M domain.

B. Speaker-Attributed ASR (SA-ASR)

The diarization output conditions the ASR system to transcribe only the active speaker's segments, filtering out background noise.

Model: Adapted Qwen3-ASR-1.7B, an encoder-decoder model using an AuT speech encoder and a Qwen3 LLM.
Domain Adaptation: Fine-tuned on ~1,800 hours of Hindi speech data (from FLEURS, IndicTTS, etc.) combined with DISPLACE-M data.
Preprocessing: Implemented Devanagari script normalization (canonical Unicode) to handle spelling variations of English words in Hindi script.
Error Correction: Applied a Contextualized Generative Error Correction step using GPT-4.1 with 3-shot In-Context Learning (ICL). This step corrected phonetic confusions and broken compounds while preserving the original code-mixed style.

C. Medical Condition Extraction

Two approaches were compared:

Text-Based Cascade: Transcription $\to$ (Optional Translation) $\to$ Extraction using LLMs (Gemma 3, Claude, Gemini).
Multimodal End-to-End (E2E): Directly processing raw audio (and optionally transcripts) using Gemini 3 Pro to extract conditions without an intermediate text generation step.

3. Key Contributions

Robust Diarization for Overlap: Developed an EEND-VC system tailored for dense Doctor-Patient overlaps, achieving significant DER reductions over baselines.
Hinglish ASR Adaptation: Successfully adapted a large ASR model (Qwen3) to the specific challenges of Hinglish medical dialogues, including Devanagari normalization and domain-specific fine-tuning.
Modular vs. E2E Analysis: Provided a comprehensive comparison showing that while proprietary E2E models set the performance ceiling, a well-optimized open-source cascaded system is highly competitive.
Challenge Victory: The proposed system achieved 1st place out of 25 participants in the DISPLACE-M challenge.
Open Source: All implementations (diarization, ASR, extraction) are publicly released to ensure reproducibility.

4. Key Results

Speaker Diarization

The best system achieved a Diarization Error Rate (DER) of 7.76% on the evaluation set.
Fine-tuning on domain-specific data provided the most significant gain (~1.2% absolute reduction).
Restricting the output to 2 speakers (Doctor/Patient) did not significantly improve performance over 4-speaker outputs in this context.

Speaker-Attributed ASR

The system achieved a tcpWER (time-constrained Word Error Rate) of 18.59%.
This represents a ~31% relative improvement over the baseline IndicConformer model (26.78%).
Key drivers for improvement were: Domain fine-tuning, Unicode normalization, and LLM-based error correction.

Medical Condition Extraction

E2E Superiority: The Gemini 3 Pro (Audio-only, Zero-shot) E2E model achieved the highest ROUGE-1 score of 45.60, significantly outperforming the best text-based cascade (Claude Opus 4.1 at 34.90). This validates that bypassing the ASR/translation pipeline preserves crucial acoustic cues.
Open Source Performance: The best open-source cascade (Gemma 3 12B with 6-shot ICL) achieved a ROUGE-1 of 28.97.
Synergy: An ablation study showed that while better diarization helps, the SA-ASR quality is the primary bottleneck. Upgrading ASR alone improved ROUGE-1 from 25.16 to 27.90, whereas upgrading only diarization actually degraded performance (23.72) due to downstream processing mismatches.

5. Significance and Conclusion

The paper demonstrates that extracting medical information from noisy, code-switched, low-resource dialogues is feasible with a carefully engineered modular pipeline.

Practical Impact: The system offers a privacy-preserving, open-source alternative to proprietary black-box models for medical applications in developing regions.
Architectural Insight: The study highlights a critical synergy between modules: improvements in upstream diarization only translate to better extraction results if the downstream ASR is robust enough to handle the refined speaker segments.
Future Direction: While E2E multimodal models currently hold the performance ceiling, the authors prove that open, cascaded architectures can achieve state-of-the-art results in specific challenges (winning the DISPLACE-M competition) when optimized for domain specificity and error correction.