Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett's Video Segmentation

Imagine you are a doctor trying to draw a map of a very tricky, shape-shifting island (a lesion in the esophagus) on a series of 100 moving photographs. This is what happens when doctors annotate Barrett's esophagus videos to train AI.

Doing this manually for every single photo is exhausting and takes forever. So, doctors use a "smart assistant" (an AI called SAM2). The doctor draws the island on just the first photo, and the AI tries to guess where the island is in all the following 100 photos.

The Problem: The "Drifting" Map
Here's the catch: The island moves, the lighting changes, and the camera shakes. If the AI makes a tiny mistake on photo #5, it carries that mistake to photo #6, then #7, and so on. By photo #50, the AI's map might be completely wrong. It's like a game of "Telephone" where the message gets garbled with every turn.

Usually, the doctor has to stop and fix the map every time it gets slightly off, which is still very time-consuming. Or, they might just guess a random time to fix it, which isn't very efficient.

The Solution: A Smart "Check-In" System (L2RP)
This paper introduces a new system called L2RP (Learning-to-Re-Prompt). Think of L2RP as a super-smart co-pilot sitting next to the doctor.

Instead of the doctor guessing when to fix the map, or the AI blindly guessing, L2RP watches the AI's work in real-time. It asks itself: "Is the AI still doing a good job, or is it starting to drift off course?"

If the AI is doing well: The co-pilot says, "Keep going, no need to bother the doctor yet."
If the AI is starting to drift: The co-pilot says, "Stop! The map is getting messy. We need the doctor to step in and correct it right now."

The system learns exactly when to ask for help so that the doctor does the minimum amount of work to get the best possible map.

The "Prompt" Choices: Drawing the Map
The paper also tested three different ways the doctor can give the AI the first instruction (the "prompt"):

The "Detailed Sketch" (Mask): The doctor carefully traces the exact outline of the island.
- Pros: Starts very accurate.
- Cons: Like a delicate sandcastle, it washes away quickly. The AI gets confused easily if the island moves slightly, leading to big errors later.
The "Rough Box" (Box): The doctor draws a square around the island.
- Pros: A bit more stable.
- Cons: Less precise to start with.
The "Finger Point" (Point): The doctor just clicks a few dots on the island.
- Pros: Surprisingly stable! Even though it's the least detailed, the AI holds onto this instruction the longest without getting confused.
- Cons: Starts slightly less accurate than the sketch.

The Big Discovery
The researchers found that the "Detailed Sketch" (Mask) is the most tempting because it looks perfect at the start, but it requires the doctor to fix the map constantly. The "Finger Point" (Point) is the most reliable "set it and forget it" option.

However, the real magic of L2RP is that it doesn't matter which method you choose. The co-pilot knows exactly when the AI is struggling and asks the doctor to intervene only at the most critical moments.

The Result
By using this smart co-pilot:

The final maps are much more accurate.
The doctor spends significantly less time correcting mistakes.
It works like a budget: You can tell the system, "I have 10 minutes to help," and it will stretch that 10 minutes to cover the whole video as effectively as possible.

In a Nutshell
This paper teaches us how to stop the AI from "drifting" off course and how to use a smart system to decide exactly when a human expert needs to step in. It turns a tedious, hour-long job into a quick, efficient collaboration between human and machine, ensuring the AI learns the right lessons without burning out the doctor.

1. Problem Statement

Accurate annotation of endoscopic videos for Barrett's esophagus (specifically dysplasia) is critical for AI diagnostics but is hindered by the scarcity of expert resources and the time-consuming nature of manual labeling. Dysplastic lesions are often irregular and lack clear boundaries, making frame-by-frame annotation impractical.

While Interactive Video Object Segmentation (iVOS) models like Segment Anything Model 2 (SAM2) allow experts to annotate key frames and propagate masks to the rest of the video, these systems suffer from annotation drift. Small errors caused by motion, lighting changes, or occlusion accumulate over time, degrading segmentation quality. Existing solutions either require frequent manual corrections (high cost) or lack a principled way to determine when and where to intervene. Furthermore, the impact of different prompt types (masks, boxes, points) on error propagation dynamics remains under-explored.

2. Methodology: Learning-to-Re-Prompt (L2RP)

The authors propose L2RP, a cost-aware framework that learns an adaptive policy to decide when to request expert intervention and which frame to correct, balancing segmentation accuracy against human effort.

A. Error Propagation Analysis

The study first systematically analyzes how segmentation errors propagate over time using three prompt types:

Masks: High initial precision but degrade rapidly due to sensitivity to appearance changes.
Boxes: Moderate initial accuracy with slower error growth.
Points: Lowest initial accuracy but the most stable over long sequences.

B. The L2RP Framework

L2RP extends the Learning-to-Defer (L2D) paradigm to spatiotemporal video segmentation. Instead of deferring to a "more capable" expert, it defers to the same expert to provide a re-prompt (correction) at an optimal frame.

Deferral Model ( $D_\theta$ ): A neural network (R(2+1)D) takes the video sequence and the initially propagated masks as input. It outputs a score for every frame indicating the suitability of requesting a correction.
Cost-Aware Loss Function: The model is trained to minimize a composite loss that balances segmentation error and annotation cost:
$L_{def} = \mathbb{I}[d=0] c_{prop} + \sum_{k=1}^{T} \mathbb{I}[d=k] c_{corr}^{(k)}$
- $c_{prop}$ : Cost of accepting the initial propagation (error + base cost).
- $c_{corr}^{(k)}$ : Cost of correcting at frame $k$ (error + base cost + correction cost $\lambda_{corr}$ ).
- $\lambda_{corr}$ : A tunable hyperparameter representing the "human cost" of intervention.
Training Strategy: Since the loss is non-differentiable, the authors use a surrogate loss (Mean Absolute Error) to train the deferral model end-to-end while keeping the base segmentation model (SAM2) fixed.
Inference: The model selects the frame $k^*$ with the minimum score (highest confidence for deferral). If $k^* > 0$ , the expert provides a correction prompt at that frame, and the segmentation is re-propagated.

3. Key Contributions

Systematic Error Analysis: A comprehensive study on a curated Barrett's esophagus dataset revealing that while masks offer the best initial accuracy, they degrade fastest, whereas points offer superior temporal stability.
L2RP Framework: The introduction of a novel, cost-aware framework that learns an adaptive policy for human-AI collaboration, specifically addressing the gap in spatiotemporal deferral settings.
Performance Optimization: Demonstrated that L2RP significantly improves temporal consistency and segmentation accuracy while reducing the total number of expert interventions required compared to baseline strategies.

4. Experimental Results

The method was evaluated on a private Barrett's dysplasia dataset (42 videos) and the public SUN-SEG dataset (colonoscopy videos).

Metrics: Dice Score (Mean $\pm$ Std).
Baselines: Initial Propagation (no correction), Random frame selection, Midpoint selection, and EVA-VOS (an existing high-error frame selector).
Key Findings:
- Superior Accuracy: L2RP achieved the highest Dice scores across all prompt types. For example, on the Barrett's dataset with Mask prompts, L2RP achieved 0.8436, significantly outperforming the next best baseline (EVA-VOS at 0.8244) and the uncorrected propagation (0.7371).
- Robustness: L2RP showed consistent gains on both datasets, with relative improvements of ~14.5% on Barrett's and ~33.7% on SUN-SEG for mask prompts compared to uncorrected propagation.
- Cost Control: The parameter $\lambda_{corr}$ allows users to tune the system. Lower $\lambda_{corr}$ leads to more frequent corrections and higher accuracy; higher $\lambda_{corr}$ reduces expert workload at the cost of slight accuracy drops.

5. Significance and Clinical Impact

Efficiency: L2RP optimizes the trade-off between annotation effort and model performance. By intelligently selecting only the most critical frames for correction, it minimizes the time experts spend on video annotation.
Generalizability: The framework is not limited to Barrett's esophagus; it was validated on colonoscopy data (SUN-SEG), proving its applicability to various endoscopic video segmentation tasks.
Practical Deployment: The ability to tune $\lambda_{corr}$ makes the system adaptable to different clinical workflows, whether the priority is maximum accuracy (low cost tolerance) or high-throughput screening (high cost tolerance).

In conclusion, this work bridges the gap between temporal error modeling and cost-aware decision-making, providing a practical solution for scaling high-quality medical video annotation with limited expert resources.

Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett's Video Segmentation

1. Problem Statement

2. Methodology: Learning-to-Re-Prompt (L2RP)

A. Error Propagation Analysis

B. The L2RP Framework

3. Key Contributions

4. Experimental Results

5. Significance and Clinical Impact

More like this

DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction