Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation

Imagine you are trying to thread a needle, but the needle is a breathing tube, the eye of the needle is a tiny opening in your throat (the glottis), and you are doing this in the dark, while the person is moving, and your hands are shaking. This is the reality of nasotracheal intubation (NTI)—a life-saving emergency procedure where doctors must quickly find a patient's airway to help them breathe.

Currently, doctors rely heavily on their own eyes and experience. If they miss the target, it can be dangerous. Computers have tried to help by using cameras and "smart eyes" (AI) to point out the airway, but there's a big problem: existing smart eyes are too heavy.

Think of current AI systems like a giant, high-end supercomputer trying to run on a smartwatch. They are too slow, too big, and need too much power to be useful in an ambulance or a remote clinic. They take too long to "think," and in an emergency, every second counts.

The Solution: "Mobile GlottisNet"

The authors of this paper built a new AI system called Mobile GlottisNet. You can think of this as a lightweight, super-fast "smart glasses" app designed specifically to fit on small, portable medical devices.

Here is how they made it work, using some simple analogies:

1. The "Tiny Brain" (Lightweight Backbone)

Most AI models are like massive libraries with millions of books; they take forever to find the right page. Mobile GlottisNet is like a pocket-sized cheat sheet. It uses a highly efficient design (based on MobileNetV3) that strips away all the unnecessary fluff. It's so small (only 5MB—about the size of a few high-res photos) that it can run instantly on a small device without needing a massive server farm.

2. The "Smart Filter" (Hierarchical Dynamic Thresholding)

When the AI looks at a throat, it sees thousands of potential spots that might be the airway. Most are wrong.

Old way: The AI tries to guess on everything, getting confused by noise (like blood, saliva, or shadows).
New way: The authors added a "Smart Filter." Imagine a bouncer at a club who only lets in the VIPs. This filter dynamically decides, "Okay, this spot looks promising, let's focus on it," while ignoring the junk. It constantly adjusts its standards based on what it sees, ensuring it only pays attention to the best candidates.

3. The "Stretchy Lens" (Adaptive Feature Decoupling)

Throats aren't static; they move, twist, and get covered in fluids. A normal camera lens is rigid and might miss the target if it moves slightly.

The Innovation: The team gave the AI a "stretchy lens" (using deformable convolutions). If the airway shifts to the left or gets blurry, the AI's "eyes" physically stretch and shift to follow the shape of the airway. It decouples the "shape" of the airway from the "mess" around it, allowing it to see clearly even when the view is foggy or blocked.

4. The "Team Huddle" (Cross-Layer Weighting)

Deep learning models have different "layers" that see things at different scales (some see the big picture, some see tiny details).

The Innovation: Usually, these layers just shout their opinions at each other. Here, the authors added a "Team Huddle" mechanism. It weighs the opinions of the "Big Picture" layer and the "Tiny Detail" layer, deciding exactly how much to listen to each one depending on the situation. This ensures the AI doesn't miss the tiny opening just because it's looking at the whole throat.

The Results: Fast, Small, and Accurate

The team tested this system in three ways:

Lab Simulations: Using a fake throat (phantom).
Real Patients: Using data from hospitals.
Public Databases: Testing on thousands of other images.

The verdict?

Speed: It runs at over 62 frames per second on standard devices and 33 frames per second on tiny edge devices. That means it updates the image more than 30 times a second—fast enough to track movement in real-time without lag.
Size: It fits in a tiny 5MB package.
Accuracy: It finds the airway as well as (or better than) the giant, slow supercomputers, even when the view is messy.

Why This Matters

Imagine a paramedic in a remote area or a doctor in a crowded emergency room. They don't have a supercomputer on a cart; they have a small, portable device. Mobile GlottisNet is the first system that can live on that small device, acting like a reliable co-pilot that says, "Look here, that's the airway," instantly and accurately.

It bridges the gap between "cool AI research" and "life-saving tool," ensuring that even in the most resource-limited situations, the patient gets the fastest, safest help possible.

Here is a detailed technical summary of the paper "Real-Time Glottis Detection Framework via Spatial-decoupled Feature Learning for Nasal Transnasal Intubation."

1. Problem Statement

Nasotracheal Intubation (NTI) is a critical emergency procedure where the endotracheal tube must be navigated through a narrow, long nasal pathway with limited visualization. Successful intervention relies on rapid and accurate glottis detection within a "golden window" of 3–5 minutes.

Current challenges include:

Environmental Constraints: Poor illumination, anatomical variability, secretions, blood occlusion, and motion blur.
Computational Limitations: Existing deep learning-based detection systems (e.g., semantic segmentation or heavy object detectors) require high-performance GPUs and large memory, leading to inference delays (>200 ms). This makes them unsuitable for resource-constrained edge devices, bedside equipment, or portable robotic systems used in prehospital emergencies.
Accuracy-Efficiency Trade-off: High-precision models are too slow for real-time guidance, while lightweight models often fail to localize small, morphologically variable glottic structures under complex conditions.

2. Methodology: Mobile GlottisNet

The authors propose Mobile GlottisNet, a lightweight, real-time glottis detection framework optimized for embedded and edge deployment. The architecture consists of four key components:

A. Lightweight Backbone

Base: Utilizes MobileNetV3, which combines Neural Architecture Search (NAS) with hand-crafted design.
Optimizations: Employs depthwise separable convolutions and inverted residual blocks to drastically reduce parameters and FLOPs.
Activation: Uses the h-swish activation function (approximated via piecewise linearity) to reduce computational burden while maintaining non-linearity, making it ideal for edge hardware.
Feature Enhancement: Integrates a lightweight Squeeze-and-Excitation (SE) module to recalibrate channel-wise responses based on global context.

B. Hierarchical Dynamic Thresholding & Sample Allocation

To address the difficulty of matching candidates to ground truth in complex anatomical scenes:

Cost Matrix: Defines a cost function combining classification probability and regression accuracy (IoU) to quantify matching quality.
Top-K Selection: Selects the top-K prediction boxes across multi-scale Feature Pyramid Networks (FPN) that best align with the glottal structure.
Dynamic Thresholding: Instead of fixed thresholds, the system calculates a dynamic threshold based on batch statistics. Only candidates exceeding this quality threshold are retained as positive samples for classification and regression, ensuring sparse and high-quality feature embedding.

C. Adaptive Feature Disentanglement (Spatial-Decoupled Learning)

To handle non-rigid deformations (e.g., head posture changes, neck angles) and occlusions:

Deformable Convolutions: The framework introduces an adaptive module that uses deformable convolutions to dynamically adjust sampling locations ( $\Delta p$ ) in the feature map.
Task-Specific Offsets: The module predicts independent offsets for the classification branch (focusing on global semantic structures like glottal shape) and the regression branch (focusing on boundary-level details).
Benefit: This allows the model to "disentangle" spatial features, suppressing irrelevant noise (blood, fog) and aligning precisely with the glottic aperture.

D. Cross-Layer Dynamic Weighting

A mechanism is employed to fuse semantic and detail features across multiple scales, adjusting supervision strength dynamically to balance global context with fine-grained boundary localization.

3. Key Contributions

Lightweight Framework: Proposed Mobile GlottisNet, a model with a size of only 5 MB (reduced variant) that achieves real-time inference on edge devices.
Hierarchical Dynamic Thresholding: Introduced a strategy to adaptively select high-quality samples across scales, improving boundary alignment and robustness against glottal variations.
Adaptive Feature Disentanglement: Designed a deformable convolution-based module to decouple task-specific spatial features, enhancing geometric adaptability under occlusions and perspective shifts.
Clinical Validation: Verified the approach on three distinct datasets (PID, Clinical, and Glottis) and demonstrated successful integration into a robotic NTI system.

4. Experimental Results

The model was evaluated on three datasets:

PID Dataset: Lab-collected data from a robotic NTI platform (2,267 train / 479 test images).
Clinical Dataset: Real-world flexible nasopharyngoscopy recordings from Singapore General Hospital (2,683 train / 1,131 validation images).
Glottis Dataset: Large-scale public benchmark (BAGLS) with 55,750 training images.

Performance Highlights:

Speed: Achieved >62 FPS on terminal devices and 33 FPS on edge platforms (NVIDIA Jetson Orin), well within the real-time requirements for emergency airway management.
Accuracy:
- On the PID dataset, the 5MB model achieved 33.6% mAP (AP50: 64.3) and 29.9% mAP on the Clinical dataset (AP50: 59.6).
- On the Glottis dataset, it achieved 62.7% mAP (AP50: 91.6, AP75: 72.7), outperforming state-of-the-art methods like RTMDet and DETR variants in the high-IoU regime.
Efficiency: The model size is drastically reduced to 5 MB (compared to 33 MB for the standard variant and >100 MB for many SOTA models), enabling deployment on low-power hardware.
Robustness: Qualitative results showed stable performance under motion blur, low light, and anatomical variations, with bounding boxes accurately tracking the glottis in real-time video streams.

5. Significance and Impact

Clinical Translation: This work bridges the gap between algorithmic development and clinical application by providing a solution that is both accurate enough for medical guidance and lightweight enough for portable/bedside use.
Emergency Readiness: The ability to run at >30 FPS on edge devices ensures that robotic intubation systems can operate in prehospital or resource-limited settings without relying on cloud computing or heavy workstations.
Safety Improvement: By providing real-time visual guidance, the system aims to reduce the risk of esophageal intubation and improve first-attempt success rates in emergency scenarios.
Future Directions: The authors plan to incorporate temporal reasoning (video tracking), occlusion-aware decoding, and hardware-aware compression (quantization) to further enhance robustness and reduce latency.