WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

This paper presents WhisperAlign, a solution for the DL Sprint 4.0 that combines word-boundary-aware ASR using whisper-timestamped chunking and domain-fine-tuned Pyannote diarization anchored by WhisperX to achieve high-accuracy transcription and speaker separation for long-form Bengali speech.

Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to transcribe a very long, chaotic conversation between several people speaking in Bengali. It's like trying to write down a 60-minute movie script where the audio is muddy, people talk over each other, and the camera keeps cutting in and out.

This paper describes how a team of researchers built a smart system to solve two big problems:

  1. Transcription (ASR): Turning the speech into text accurately.
  2. Diarization: Figuring out who said what and when.

Here is the breakdown of their solution using simple analogies.


Part 1: The Transcription Problem (The "Word-Boundary" Puzzle)

The Challenge:
Standard AI models (like the famous "Whisper") are great at short clips, but they get confused with hour-long recordings. If you just chop a long audio file into random 30-second chunks, you might cut a sentence in half right in the middle of a word. The AI gets confused, thinks it's hearing something else, and starts "hallucinating" (making up words that weren't said).

The Solution: The "Smart Scissors"
Instead of using a ruler to chop the audio into equal pieces, the team invented a pair of "smart scissors."

  • How it works: They used the AI itself to listen to the audio and find the exact moment each word starts and ends.
  • The Analogy: Imagine you are cutting a long loaf of bread. A normal person cuts it every 3 inches, often slicing right through a crusty piece of bread. The "Smart Scissors" wait until they see a gap between the loaves (the word boundaries) before cutting.
  • The Result: They created perfect, bite-sized chunks of audio that always start and end with a complete word. They then taught the AI specifically on these perfect chunks. This stopped the AI from making up words and drastically improved accuracy.

Part 2: The Speaker Problem (The "Who's Talking?" Puzzle)

The Challenge:
In a crowded room, people talk over each other. Standard systems often get lost, thinking two people are speaking at once when the rules say only one person can be "on the mic" at a time. Also, the system that identifies the speaker often uses a different "ear" (Voice Activity Detector) than the system that writes the text, causing them to disagree on exactly when a sentence started or ended.

The Solution: The "Specialized Detective"

  • Training the Detective: They took a general-purpose speaker-detecting AI and gave it a crash course specifically on Bengali conversation styles. It's like taking a detective who knows how to solve crimes in New York and training them specifically on the slang and habits of a specific neighborhood in Dhaka.
  • The "Exclusive" Rule: The competition required that at any given second, only one person is credited with speaking. The team used a special feature that forces the AI to pick the most likely speaker for that exact moment, rather than letting it guess two people are talking.
  • The "Double-Check" (VAD Intersection): This is the secret sauce. The text system and the speaker system usually have different ideas about when silence ends and speech begins. The team forced them to agree by only accepting speaker labels that matched the text system's "silence detector."
    • Analogy: Imagine two security guards checking a guest list. Guard A (Text) says, "The party starts at 8:00." Guard B (Speaker) says, "The party starts at 8:05." The team made a rule: "We only let people in if both guards agree." This eliminated false alarms where the system thought someone was talking during silence.

Part 3: The Results (The Scoreboard)

Before their system, the best attempts had a high error rate (about 67% of words were wrong or the wrong speaker was identified).

  • Transcription: By using their "Smart Scissors" and specialized training, they dropped the error rate to 25%. That's a massive improvement, like going from a student who fails every test to one who gets a B+.
  • Speaker Identification: They reduced the error in identifying speakers by nearly half compared to standard tools.

The Big Picture Takeaway

The team didn't invent a brand new type of brain; they invented a better workflow.

  1. They stopped chopping audio randomly and started cutting it at natural word breaks.
  2. They taught the AI specifically on Bengali conversation quirks.
  3. They forced the different parts of the system to agree with each other before making a final decision.

In short: They turned a messy, hour-long recording into a clean, organized script where the words are correct and the speakers are clearly identified, proving that even for a language with fewer digital resources (like Bengali), smart engineering can beat brute force.