TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

Imagine you are watching a high-stakes cooking show, but instead of a chef, it's a robot surgeon performing a delicate operation inside a human body. To make this robot truly helpful, it needs to understand exactly what is happening: Which tool is being used? What is it doing? And what part of the body is it touching?

This is the challenge the paper TrajPred tackles. It's trying to teach an AI to "see" and "understand" surgical actions in real-time, specifically focusing on how instruments interact with tissues.

Here is the breakdown of their solution using simple analogies:

The Problem: The "Blurry Snapshot" vs. The "Movie"

Current AI models for surgery are like a photographer who only takes one single photo every few seconds and tries to guess the whole story from that still image.

The Issue: If you see a photo of a scalpel near a liver, is the surgeon cutting, just holding, or about to pull away? A single photo often can't tell you.
The "Background Noise" Problem: Also, these AIs are like students who get distracted by the classroom walls. They look at the whole image (including the background, the camera movement, and the edges of the screen) and try to guess the action. They often miss the tiny, crucial details of the tool actually touching the tissue.

The authors say existing models are "blind" to the motion (temporal information) and get "distracted" by the background.

The Solution: TrajPred (The "Motion Detective")

The authors built a new system called TrajPred. Think of it as upgrading the AI from a photographer to a movie director with a motion tracker.

Here are the three main "superpowers" TrajPred uses:

1. The "Dance Track" (Trajectory Tokens)

Instead of just looking at the picture, TrajPred draws an invisible line (a trajectory) following the surgical tool as it moves through time.

Analogy: Imagine watching a dancer. If you just look at a photo of their foot, you don't know if they are jumping or standing still. But if you see the path their foot took (the trajectory), you know exactly what dance move they are doing.
How it works: The AI tracks the tool's position frame-by-frame. It creates a "motion token" that tells the system, "Hey, this tool moved here to there." This helps the AI understand actions that require movement, like "retracting" (pulling back) or "dissecting" (cutting apart), which are impossible to see in a single frozen frame.

2. The "Spotlight" (Joint Embedding Prediction)

Older models try to match the entire image to a sentence like "cutting tissue." This is like trying to match a whole city skyline to the word "coffee." It's too broad, and the AI gets confused by the background.

The Fix: TrajPred uses a technique called Joint Embedding Prediction. Instead of just matching, it predicts what the text description should look like based on the visual clues.
Analogy: Imagine a detective looking at a crime scene. Instead of guessing the whole story, the detective focuses a spotlight only on the specific area where the action is happening (the tool and the tissue). TrajPred forces the AI to ignore the background noise and focus its "spotlight" strictly on the interaction between the tool and the body part.

3. The "Translator" (Verb Rephrasing)

Surgical language is very specific and technical. A robot trained on general internet data might not understand that "retract" means "pulling aside."

The Fix: The authors act as translators. They take the short, technical verb (e.g., "retract") and turn it into a descriptive sentence (e.g., "pulling aside").
Analogy: It's like teaching a child a new word. Instead of just saying "Retract," you say, "The tool is pulling the tissue away." This helps the AI connect the visual action to the language much better, especially for rare or difficult actions it hasn't seen before.

The Results: Why Does This Matter?

The team tested this on a famous dataset of laparoscopic surgery videos (CholecT50).

Better Accuracy: TrajPred got significantly better scores at identifying the correct tool, action, and target compared to previous state-of-the-art models.
Seeing the Details: When they visualized the AI's "attention," TrajPred's "spotlight" was perfectly focused on the tool and tissue. The old models' spotlights were blurry and often pointed at the background or the camera edges.
Handling the Unknown: Even when the AI saw a rare action it had never been explicitly taught (like a specific way of packing tissue), TrajPred figured it out better than the others because it understood the motion and the descriptive language.

The Bottom Line

TrajPred is a smarter way for robots to watch surgery. By tracking the movement of tools, focusing the spotlight on the action, and translating technical words into clear descriptions, it helps AI assistants understand the "story" of the surgery, not just the pictures. This is a huge step toward robots that can truly collaborate with human surgeons, offering real-time advice and safety checks.

Here is a detailed technical summary of the paper "TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models."

1. Problem Statement

The paper addresses the limitations of current Vision-Language Models (VLMs) in recognizing Instrument-Tissue Interactions (structured as triplets: instrument, verb, target) within robotic surgery. While VLMs offer better generalization than traditional classifier-based approaches, they struggle with surgical action recognition due to two primary challenges:

Inadequate Temporal Utilization: Most existing surgical VLMs rely on single-frame inputs or fail to effectively leverage motion patterns across frames. Surgical actions (e.g., "dissect" vs. "retract") often require observing dynamic motion cues that static images cannot capture.
Loss of Fine-Grained Details in Contrastive Learning: Standard contrastive learning aligns global image features with text. In surgical videos, this often leads to the suppression of fine-grained spatial details, causing models to focus on background noise or camera motion rather than the specific interaction between the instrument and tissue.

2. Methodology: TrajPred Framework

The authors propose TrajPred, a framework that shifts from contrastive alignment to a Joint Embedding Predictive Architecture (JEPA) conditioned on instrument trajectories.

A. Core Architecture

Instead of directly aligning image and text embeddings via contrastive loss, TrajPred uses a predictive structure inspired by VL-JEPA:

Visual Encoder: A frozen visual encoder (V-JEPA2) processes video clips into spatio-temporal tubelet tokens.
Text Encoder: A trainable text encoder (based on Gemma) generates semantic embeddings for surgical action triplets.
Predictor Module: A trainable module that takes visual tokens and predicts the target text embedding, rather than just matching them.

B. Trajectory-Conditioned Prediction

To explicitly model motion and guide the predictor toward relevant regions, the authors introduce a Trajectory Token Encoding pathway:

Instrument Detection: A Fast R-CNN detector identifies instrument bounding boxes in every frame.
Dual-Stream Encoding: For each instrument $k$ $k$ , two streams are processed:
- Appearance Stream: Mean-pooling visual tokens within the bounding box across frames.
- Position Stream: Encoding the temporal evolution of bounding box coordinates using sinusoidal positional embeddings.
Fusion: These streams are aggregated via cross-attention pooling and fused (element-wise addition) to create a single trajectory token $\tau^{(k)}$ per instrument.
Augmented Input: The trajectory tokens are concatenated with the standard video tubelet tokens and fed into the predictor. This forces the model to attend to instrument dynamics.

C. Prompt Tuning and Verb Rephrasing

To improve generalization and bridge the domain gap between surgical jargon and general language:

Verb Rephrasing: Surgical verbs (e.g., "retract," "coagulate") are rephrased into descriptive natural language phrases (e.g., "pulling aside," "stopping bleeding by heating") to leverage the pre-trained knowledge of the text encoder more effectively.
CoOp-Style Prompt Tuning: Instead of fine-tuning the entire text encoder, the authors optimize a small set of learnable context tokens prepended to the text input. This retains the semantic priors of the pre-trained model while adapting to the surgical task.

D. Training Objective

The model is trained using a Multi-label Binary Cross-Entropy Loss on the cosine similarity logits between the predicted visual embedding and the ground-truth text embedding.

3. Key Contributions

Predictive Embedding Paradigm: Reformulated surgical action recognition as an embedding prediction problem, replacing standard contrastive learning to better capture fine-grained action details.
Trajectory Conditioning: Introduced a novel module that encodes instrument trajectories (appearance + position) as tokens to condition the embedding prediction, explicitly modeling temporal motion cues.
Linguistic Adaptation: Developed a verb-rephrasing technique and prompt tuning strategy to enhance generalization, particularly for rare or unseen action combinations.
State-of-the-Art Performance: Demonstrated significant improvements in both accuracy and visual-textual alignment on the CholecT50 benchmark.

4. Experimental Results

Experiments were conducted on the CholecT50 dataset (50 laparoscopic cholecystectomy videos) using the official Rendezvous (RDV) split and an "Unseen Verb" setting.

Overall Performance: TrajPred outperformed all baselines (including CLIP, SurgVLP, HecVL, and VL-JEPA) across all metrics.
- Average Precision (AP): Achieved 14.77 for triplet-level AP (APIVT), a significant jump from the best baseline (VL-JEPA Video at 13.49).
- Top-K Accuracy: Achieved 65.45% Top@K (where K is the number of ground-truth triplets) and 97.02% Top@20.
Generalization (Unseen Verbs): In the unseen-verb setting (testing on verbs like irrigate, retract, pack, aspirate not seen during training), TrajPred achieved 11.26 APIVT, significantly outperforming the next best baseline (9.02).
Rare Action Recognition: The model showed substantial gains on long-tail instrument-verb pairs (e.g., "grasper–pack" improved from 18.1 to 32.9 AP).
Visual Alignment: Heatmap visualizations confirmed that TrajPred focuses on the instrument-tissue interaction region, whereas baseline models often highlighted background or endoscope boundaries.
Efficiency: The trajectory module added minimal computational overhead (only ~1.1% increase in inference latency and ~3.4M parameters).

5. Significance

TrajPred represents a significant step forward in surgical AI by addressing the specific nuances of temporal dynamics and fine-grained spatial alignment required for robotic surgery.

Clinical Relevance: By accurately recognizing specific instrument-tissue interactions, the system can provide more reliable context-aware assistance to surgeons and facilitate the automation of surgical tasks.
Methodological Shift: The paper demonstrates that moving away from global contrastive alignment toward trajectory-conditioned predictive modeling is crucial for tasks where the "action" is defined by motion rather than static appearance.
Robustness: The ability to generalize to unseen verbs and rare action combinations suggests that this approach is better suited for the variable and complex environments of real-world operating rooms compared to rigid classifier-based models.