Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos

Imagine you are trying to teach a robot to trace the delicate, winding roads of a city (the coronary arteries) on a series of blurry, flickering black-and-white aerial photos (X-ray angiography videos).

The problem? You only have a few photos where a human expert has carefully drawn the roads. You have thousands of other photos where the roads are there, but no one has drawn them yet. Also, the photos are tricky: the roads sometimes look faint, the city moves (because the heart beats), and the edges are often fuzzy.

This paper introduces a new method called SMART to solve this problem. Here is how it works, broken down into simple concepts:

1. The "Teacher" and the "Student" (The Mentor System)

Think of the AI system as a school with a Teacher and a Student.

The Teacher: This is a super-smart AI that has already been trained on a few "perfect" examples (the labeled data). It knows what a coronary artery should look like.
The Student: This is the AI we are trying to train. It looks at the thousands of unlabeled photos and tries to guess where the arteries are.
The Trick: The Teacher doesn't just give the Student the answer; it gives a "best guess" (called a pseudo-label). The Student learns by trying to match the Teacher's guess, but with a safety net to make sure the Teacher isn't making things up.

2. Speaking the Language of "Concepts" (The Promptable Magic)

Old AI models needed specific coordinates (like "draw a box here" or "click this dot") to know what to find. This is like giving a GPS to someone who doesn't know the city.

The Innovation: The paper uses a new model called SAM3. Instead of needing coordinates, you can just "speak" to it. You can tell it, "Find the coronary artery."
The Analogy: Imagine asking a local guide, "Show me the main river," instead of handing them a map with a red dot. Because the AI understands the concept of a "vessel" or "artery" through text, it doesn't get confused by the weird angles or shapes of the heart. It understands the idea of the road, not just the pixels.

3. The "Blurry Photo" Problem (Uncertainty Awareness)

Sometimes, the Teacher looks at a photo and says, "I think the artery is here, but I'm not 100% sure because the image is blurry."

The Old Way: The Student would blindly copy the Teacher, even if the Teacher was wrong, leading to bad habits.
The SMART Way: The system has a "confidence meter." If the Teacher is unsure (low confidence), the system says, "Okay, let's not trust this part too much yet." If the Teacher is very sure, the system says, "Great, let's learn from this!"
The Analogy: Imagine a student studying for a test. If the teacher says, "I'm 90% sure the answer is A," the student writes it down. If the teacher says, "I'm guessing, maybe it's B?", the student ignores that guess and waits for more proof. This prevents the student from learning the wrong answers.

4. The "Moving Target" Problem (Motion Consistency)

The heart is always beating. In a video, the arteries move, stretch, and wiggle from frame to frame.

The Problem: If you treat every frame as a separate still photo, the AI might draw the artery in one spot in frame 1, and a completely different spot in frame 2, even though it's the same artery. It looks like a glitchy, jumping line.
The SMART Solution: The system uses "optical flow" (a way to track how pixels move). It acts like a dance instructor.
- Forward & Backward: It watches the video moving forward and backward to see how the artery flows.
- The Rule: "If the artery moved to the left in the last frame, it should be slightly to the left in this frame."
- The Result: The segmentation (the drawing of the artery) flows smoothly like a river, rather than jumping around like a glitchy video game character.

Why Does This Matter?

In the real world, getting doctors to manually draw every single artery in every X-ray video is expensive and takes forever.

The Result: SMART proved that by using this "Teacher-Student" system with "Concept Prompts" and "Motion Tracking," they could train a model using only 16 labeled videos (a tiny amount) and still get results better than models trained on much more data.
The Impact: This means hospitals can get high-quality, automated artery analysis without needing armies of doctors to spend hours drawing lines on screens. It makes advanced diagnosis accessible even where labeled data is scarce.

In short: SMART is a smart, self-correcting robot that learns to trace heart arteries by listening to text instructions, checking its own confidence, and watching how the heart moves, all while needing very few human examples to get the job done.

1. Problem Statement

The paper addresses the challenge of segmenting coronary arteries from X-ray coronary angiography (XCA) video sequences. This task is critical for diagnosing coronary artery disease (CAD) but faces several significant hurdles:

Data Scarcity: Obtaining pixel-level annotations for medical images is expensive and time-consuming, resulting in a large volume of unlabeled data compared to labeled samples.
Image Quality Issues: XCA images suffer from blurred boundaries, inconsistent radiation contrast, low signal-to-noise ratios, and minimal contrast between vessels and background.
Temporal Complexity: Vessels exhibit complex motion patterns due to cardiac movement and involuntary organ motion, leading to temporal discontinuities in morphology and scale.
Limitations of Existing Methods:
- Standard Semi-Supervised Learning (SSL) struggles with complex temporal dynamics and unreliable uncertainty quantification.
- Existing adaptations of the Segment Anything Model (SAM) often rely on geometric prompts (points, boxes) or learnable features, which fail to generalize across diverse clinical imaging systems.
- Direct application of SAM3 (which uses text prompts) ignores temporal dependencies, leading to inconsistent segmentation across video frames.

2. Methodology: The SMART Framework

The authors propose SMART (SAM3-based Motion-Aware Confidence Regularization for Teacher-Student Architecture), a semi-supervised learning framework designed to leverage the "promptable concept segmentation" of SAM3 while addressing the specific challenges of XCA videos.

The framework operates in two main stages:

A. Text-Driven Segmentation Fine-Tuning

Before the semi-supervised phase, the Teacher SAM3 model is fine-tuned on the limited labeled data ( $D_l$ ).

Strategy: Instead of using geometric prompts, the method employs visual instruction tuning. The image encoder, text encoder, and detector are fine-tuned using text prompts describing vessel segmentation.
Goal: To align the model's internal representations with medical domain semantics (anatomical structures) rather than generic natural image concepts, enabling the model to understand "vessels" based on textual descriptions.

B. Semi-Supervised Learning with Three Core Innovations

Once the teacher is fine-tuned, it guides a Student SAM3 model on unlabeled data ( $D_u$ ) using a mean-teacher approach. The training incorporates three specific loss functions:

Confidence-Aware Consistency Regularization ( $L_{conf}$ ):
- Problem: Teacher predictions on unlabeled data can be noisy due to low contrast and blurring. Blindly trusting these pseudo-labels leads to error accumulation.
- Solution: The teacher processes multiple noise-perturbed versions of the input frames to generate an ensemble of predictions.
- Mechanism:
  - Uncertainty Estimation: The variance among the ensemble predictions is calculated to generate an uncertainty weight map.
  - Dynamic Weighting: The consistency loss between the student and the averaged teacher prediction is weighted by this uncertainty. Regions with high uncertainty (e.g., blurred boundaries) are down-weighted or treated with specific regularization, while high-confidence regions drive learning.
  - Progressive Regularization: The framework adapts the supervision intensity as training progresses, focusing on uncertain regions as the model improves.
Dual-Stream Temporal Consistency ( $L_{opti}$ and $L_{coh}$ ):
- Problem: Static image segmentation ignores the temporal continuity of blood flow, leading to flickering or inconsistent masks between frames.
- Solution: The method utilizes a pretrained optical flow estimator (SEA-RAFT) to compute forward and backward flows between consecutive frames.
- Mechanism:
  - Motion Consistency Loss ( $L_{opti}$ ): Uses mask warping to align the student's prediction on frame $t$ with the warped prediction from frame $t+1$ (and vice versa). This ensures pixel-level temporal alignment.
  - Flow Coherence Loss ( $L_{coh}$ ): Addresses ambiguity at vascular boundaries by penalizing deviations of boundary points from the dominant motion vector of the vessel body. This helps distinguish the moving vessel foreground from the static background.
Overall Objective:
The total loss function combines supervised loss (Dice + Cross-Entropy), confidence-aware consistency, motion consistency, and flow coherence:
$L_{all} = \lambda_{Dice}L_{Dice} + \lambda_{Bce}L_{Bce} + \lambda_{conf}L_{conf} + \lambda_{opti}L_{opti} + \lambda_{coh}L_{coh}$

3. Key Contributions

SAM3-Based Teacher-Student Framework: First to adapt the text-promptable SAM3 for medical video segmentation, eliminating the need for geometric prompts (points/boxes) and leveraging semantic text descriptions for better generalization.
Uncertainty-Aware Regularization: A novel progressive confidence-aware consistency mechanism that dynamically adjusts supervision intensity based on teacher prediction reliability, mitigating the risk of learning from noisy pseudo-labels in low-contrast regions.
Motion-Aware Temporal Modeling: Introduction of a dual-stream temporal consistency strategy (mask warping + flow coherence) that explicitly models vessel dynamics, ensuring temporally consistent and topologically connected segmentation.
State-of-the-Art Performance: Demonstrated superior results on three distinct XCA datasets with significantly fewer annotations than required by fully supervised methods.

4. Experimental Results

The method was evaluated on three datasets: XCAV (public), CADICA (public), and CAVSA (private).

Quantitative Performance:
- On the XCAV dataset (using only 16 labeled videos, ~14% of data), SMART achieved a Dice Similarity Coefficient (DSC) of 84.39% and clDice of 83.01%.
- This outperformed the next best method (CPC-SAM) by 6.49% in DSC and 3.86% in clDice.
- On the CAVSA dataset (using only 1.5% labeled data), SMART achieved a DSC of 91.00%, a 13.1% improvement over baselines.
Ablation Studies:
- Removing the Text-driven Fine-tuning dropped DSC by ~2%.
- Removing Confidence-Aware Consistency caused a catastrophic drop in performance (DSC decreased by ~43%), highlighting the critical role of handling uncertainty.
- Removing Dual-Stream Temporal Consistency significantly reduced spatial connectivity (clDice dropped by ~39%).
Generalization: SMART demonstrated superior flexibility on the CADICA dataset compared to methods relying on geometric priors, proving its ability to generalize across different clinical institutions and imaging systems.

5. Significance

This work represents a significant step forward in clinical AI applications where labeled data is scarce. By effectively combining the semantic understanding of large foundation models (SAM3) with robust uncertainty estimation and motion modeling, SMART provides a reliable solution for coronary artery segmentation. Its ability to achieve state-of-the-art results with minimal annotations makes it highly valuable for real-world deployment in hospitals, potentially reducing the workload on radiologists and improving the efficiency of CAD diagnosis.

Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos

1. The "Teacher" and the "Student" (The Mentor System)

2. Speaking the Language of "Concepts" (The Promptable Magic)

3. The "Blurry Photo" Problem (Uncertainty Awareness)

4. The "Moving Target" Problem (Motion Consistency)

Why Does This Matter?

1. Problem Statement

2. Methodology: The SMART Framework

A. Text-Driven Segmentation Fine-Tuning

B. Semi-Supervised Learning with Three Core Innovations

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization