Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
This paper addresses the unexplored challenge of label noise in action-based video object segmentation by introducing the ActiSeg-NL benchmark, analyzing the impact of textual and mask annotation noise, and proposing a Parallel Mask Head Mechanism to enhance robustness for embodied intelligence applications.