TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models
TrajPred is a novel framework that enhances surgical instrument-tissue interaction recognition in vision-language models by encoding instrument trajectories to capture temporal motion cues and generating fine-grained visual semantic embeddings, thereby significantly improving performance and vision-text alignment on the CholecT50 benchmark.