Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Imagine your voice is a beautiful song, and your vocal cords are the two strings on a guitar. For the song to sound right, both strings need to vibrate perfectly in sync. But sometimes, due to surgery, injury, or illness, one string goes "dead" and stops moving. This is called Vocal Fold Paralysis.

Doctors usually have to look inside a patient's throat using a special camera (a laryngoscope) to see if the strings are moving. But here's the problem: these video recordings are messy. They are long, full of dead air while the doctor is just getting the camera into position, and sometimes the strings aren't even visible. It's like trying to find a specific scene in a 3-hour movie where the camera is just panning around a dark room. Doctors have to manually scrub through hours of footage to find the few seconds where the vocal cords are actually singing. This is tiring, slow, and prone to human error.

This paper introduces a smart new tool called MLVAS (Multimodal Laryngoscopic Video Analyzing System) that acts like a super-smart, tireless assistant to help doctors diagnose this condition faster and more accurately.

Here is how it works, broken down into simple steps:

1. The "Ear" and the "Eye" Working Together

Most old systems tried to solve this using either the sound of the voice or the video of the throat. MLVAS is special because it uses both, like a detective who listens to a witness and looks at the crime scene photos.

The Ear (Audio): The system listens to the video's audio track. It's trained to recognize a specific sound the patient makes (a long "Eeee" sound). It acts like a "Keyword Spotter" (similar to how your phone wakes up when you say "Hey Siri"). It instantly ignores all the boring parts where the doctor is just moving the camera and only keeps the parts where the patient is actually making that sound.
The Eye (Video): Once it finds the sound, it looks at the video to make sure the vocal cords are actually visible. It uses a "smart camera" (an AI object detector) to find the vocal cords. If the cords are hidden or the camera is still adjusting, it discards that part.

2. Cleaning Up the Messy Video (The "Diffusion" Magic)

Even after finding the right video clips, the image might be a bit fuzzy or the AI might think it sees a vocal cord when it's actually just a shadow. This is called a "false alarm."

To fix this, the researchers used a technique called Diffusion. Think of this like a high-tech photo editor.

Imagine you have a sketch of a vocal cord that looks a little messy.
The Diffusion model acts like a skilled artist who takes that messy sketch and gently "denoises" it, refining the edges until the picture is crystal clear.
This ensures that when the system measures the vocal cords, it's measuring the real thing, not a shadow or a glitch.

3. Measuring the "Wiggle" (The New Metric)

Once the system has clean video clips, it needs to figure out which side is paralyzed.

Old Way: Doctors used to measure the gap between the two cords. But if both cords are moving a little, or if the camera angle is weird, this gap measurement can be misleading. It's like trying to judge a dance by only looking at the distance between the dancers' feet.
New Way (The Breakthrough): MLVAS measures the angle of each cord individually. It draws a perfect line down the middle of the throat and then measures how much the left cord wiggles away from that line and how much the right cord wiggles.
The Analogy: Imagine two dancers. If the left dancer is frozen stiff but the right dancer is spinning wildly, the system sees that the "left wiggle" is zero and the "right wiggle" is huge. This tells the doctor immediately: "The left side is paralyzed."

4. The Final Diagnosis

The system combines the "Ear" data (how the voice sounds) with the "Eye" data (how the cords move). It then gives the doctor a report that says:

Is there a problem? (Yes/No)
Which side is it? (Left or Right)
Visual Proof: It generates charts showing the movement of each cord, so the doctor can see exactly what the AI is talking about.

Why is this a big deal?

Speed: It cuts hours of video review down to seconds.
Objectivity: It removes the "I think it looks like this" guesswork. It gives hard numbers.
Precision: It can tell the difference between a left-side and right-side paralysis, which is crucial for planning surgery.
Accessibility: It uses advanced "pre-trained" AI (models that have already learned from millions of other sounds) to work well even when there isn't a massive amount of patient data available.

In short, MLVAS is like giving the doctor a pair of super-ears and super-eyes that never get tired, never miss a beat, and can instantly point out exactly which vocal cord is in trouble, making the path to healing much faster for the patient.

Here is a detailed technical summary of the paper "Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis."

1. Problem Statement

Vocal Fold Paralysis (VFP) is a condition where one or both vocal folds fail to move properly, causing voice changes, swallowing difficulties, and breathing issues. Accurate diagnosis typically relies on laryngeal videostroboscopy, which visualizes vocal fold vibrations in slow motion. However, current clinical workflows face significant challenges:

Data Volume & Noise: Raw endoscopic videos contain long sequences of non-phonation (e.g., scope insertion) and empty frames, requiring clinicians to manually scrub through hours of footage to find relevant phonation cycles.
Subjectivity: Diagnosis often relies on subjective visual inspection, leading to potential inconsistencies and misdiagnosis.
Modality Limitations: Existing AI solutions typically rely on a single modality (either audio or video). Audio-only models lack visual interpretability and cannot distinguish between left- and right-sided paralysis. Video-only models often struggle with false positives in segmentation and fail to utilize rich acoustic cues.
Data Scarcity: Clinical datasets are small and private, making it difficult to train deep learning models from scratch without overfitting.

2. Methodology: The MLVAS Framework

The authors propose the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a three-stage pipeline designed to automate the extraction of key segments, feature engineering, and classification.

A. Multimodal Front-End: Key Segment Extraction

The system automatically filters raw videos to isolate high-quality phonation cycles.

Audio Keyword Spotting (KWS):
- Patients are instructed to pronounce "/i:/", but due to physical constraints, the sound often resembles "/E:/".
- A custom KWS model (based on a ResNet-like architecture with 2D convolutions) analyzes STFT spectrograms to detect the "/E:/" keyword.
- This identifies time intervals containing patient vocalization.
Visual Glottis Detection & Strobing Extraction:
- Object Detection: A YOLO-v5 model detects the presence of vocal folds and the glottis area to refine the audio-derived time masks.
- Stroboscopic Selection: The system analyzes the HSV (Hue, Saturation, Value) color space. Since stroboscopic light causes rapid color fluctuations, the system calculates HSV fluctuation frequency to select segments with the highest flicker (indicating active stroboscopy) while filtering out empty frames.

B. Feature Extraction

Audio Features (Pre-trained Encoder):
- To address data scarcity, the system utilizes Dasheng, a state-of-the-art, large-scale pre-trained audio encoder (trained on speech, music, and audio events).
- The model is fully fine-tuned on the clinical dataset to extract utterance-level embeddings from the Mel-spectrograms.
Visual Features (Enhanced Segmentation & Dynamics):
- Two-Stage Segmentation:
  - Stage 1: A standard U-Net performs initial glottis segmentation.
  - Stage 2: A Diffusion-based refinement model corrects false positives (segmenting glottis when none exists). It uses the U-Net mask as a prior and starts the diffusion process from a customized Gaussian noise distribution rather than standard noise, effectively "denoising" the segmentation boundaries.
- Vocal Fold Dynamics (VFDyn):
  - Instead of just measuring the total glottal area, the system computes Left (LVFDyn) and Right (RVFDyn) Vocal Fold Dynamics.
  - Algorithm: The system fits a quadratic curve to the vocal fold outlines to establish a precise midline. It then calculates the angle deviation of the left and right folds relative to this midline over time.
  - Differentiation: Paralyzed folds exhibit less movement (lower variance) than healthy ones. Comparing the variance of LVFDyn and RVFDyn allows the system to distinguish Unilateral VFP (UVFP) as Left or Right.

C. Multimodal Back-End: Classification

Architecture: A lightweight ConvLSTM model processes the time-series VFDyn features (handling spatial and temporal dependencies).
Fusion: The audio embedding (from Dasheng) and the video embedding (from ConvLSTM) are concatenated.
Output: A classification layer predicts:
1. Binary VFP: Normal vs. Paralysis.
2. Unilateral VFP: Left-sided vs. Right-sided (based on variance comparison of VFDyn).

3. Key Contributions

First Application of Pre-trained Audio Models for VFP: The paper introduces the use of the Dasheng pre-trained encoder for vocal fold paralysis detection, overcoming the limitations of small clinical datasets.
Multimodal Integration: It is the first system to effectively combine audio keyword spotting and visual glottis dynamics, demonstrating that multimodal fusion significantly outperforms unimodal approaches.
Novel Visual Metrics (LVFDyn/RVFDyn): The introduction of Left/Right Vocal Fold Dynamics allows for the specific diagnosis of Unilateral VFP (distinguishing left vs. right), a capability lacking in previous methods that only measured aggregate glottal area.
Diffusion-Based Refinement: A novel two-stage segmentation pipeline using diffusion models to drastically reduce false positives in glottis detection, ensuring reliable metric extraction even in low-quality frames.
Automated Workflow: The system eliminates manual video scrubbing by automatically extracting "highlights" (phonation cycles + stroboscopic sequences) from raw clinical footage.

4. Experimental Results

The system was evaluated on two datasets:

BAGLS: A public dataset for glottis segmentation.
SYSU-A/B: A real-world clinical dataset from Sun Yat-sen Memorial Hospital (520 videos: 106 normal, 414 VFP).

Key Findings:

VFP Detection Performance: The full multimodal system achieved an F-score of 78.49%, outperforming the second-best model (MobileNetV3) by ~7%. It achieved a high Sensitivity (Recall) of 88.63%, crucial for avoiding missed diagnoses.
Ablation Studies:
- Adding Diffusion Refinement (DR) improved sensitivity from 85.03% to 90.06%.
- Adding Quadratic Fitting (QF) improved specificity.
- Multimodal Fusion: Removing visual features caused a ~2% drop in accuracy and a ~4.5% drop in specificity, proving the necessity of combining audio and video.
Unilateral VFP Classification: The system achieved 82.44% F-score in distinguishing Left vs. Right paralysis.
Statistical Significance: Paired t-tests confirmed that all improvements (DR, QF, and Multimodal fusion) were statistically significant ( $p < 0.05$ ).

5. Significance and Impact

Clinical Efficiency: By automating the extraction of relevant video segments, MLVAS significantly reduces the time clinicians spend reviewing raw footage.
Objective Diagnosis: Provides quantitative, objective metrics (angle deviations, variance) rather than relying solely on subjective visual assessment.
Precision Medicine: The ability to distinguish between left and right paralysis aids in targeted surgical or therapeutic interventions.
Robustness: The use of pre-trained models and diffusion-based refinement makes the system robust against data scarcity and segmentation errors, making it suitable for real-world clinical deployment where data quality varies.

In conclusion, MLVAS represents a significant advancement in AI-assisted laryngology, offering a comprehensive, automated, and highly accurate solution for diagnosing and characterizing vocal fold paralysis.