A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

The Big Problem: The "Time-Blind" AI

Imagine you are teaching a robot to make a sandwich.

Step 1: Put bread on the plate.
Step 2: Spread peanut butter.
Step 3: Add jelly.
Step 4: Put the top slice on.

Current AI models (specifically those using "Self-Supervised Learning") are like a student who has memorized what a slice of bread looks like, what peanut butter looks like, and what jelly looks like. But they have no idea about the order.

If you showed this robot a video of the sandwich being made in reverse (taking the jelly off, then the peanut butter, then the bread), the robot would be just as confused as when watching it forward. It sees the same ingredients, so it thinks the "story" is the same. It fails to understand that time matters. In surgery or cooking, doing step 4 before step 1 is a disaster, but the AI doesn't know that.

The Solution: "PL-Stitch"

The researchers created a new AI framework called PL-Stitch. Think of it as a teacher who forces the robot to learn not just what things look like, but when they happen.

They did this using two main tricks, which they call "Branches":

1. The "Movie Sorter" (The Video Branch)

Imagine you take a movie, cut it into 8 random scenes, shuffle them up, and hand them to the robot.

Old AI: "I see a knife, I see a pot, I see a plate. I don't know which came first."
PL-Stitch: The robot is forced to play a game: "Put these scenes back in the correct chronological order."

To do this, they used a special mathematical tool called the Plackett-Luce (PL) model.

The Analogy: Imagine a race with 8 runners. Instead of just guessing who came in 1st, 2nd, and 3rd separately, the PL model asks the AI to rank the entire group at once. It calculates the probability of every possible finishing order.
Why it's better: If the AI guesses the order is almost right (e.g., it swaps the 2nd and 3rd runner), the PL model says, "Good try, but not quite." If it guesses the order is totally wrong, it says, "Way off." This gives the AI a much smarter "score" to learn from than just a simple "Right/Wrong" answer.

2. The "Jigsaw Detective" (The Image Branch)

This part focuses on the details. Imagine you have a photo of a chef chopping an onion. You cut the photo into puzzle pieces and hide a few of them.

The Trick: To figure out where the missing pieces go, the AI is allowed to peek at the previous frame (the chef holding the whole onion) and the next frame (the onion already chopped).
The Goal: By looking at the "before" and "after," the AI learns how objects move and change over time. It learns that a whole onion must come before a chopped onion. This helps it understand the fine details of the action, not just the big picture.

Why This Matters (The Results)

The researchers tested PL-Stitch on two very different types of videos: Surgery (like gallbladder removal) and Cooking (like making breakfast).

The Test: They froze the AI's brain (so it couldn't learn new things) and just asked it to look at a video and say, "What phase of the surgery/cooking is happening right now?"
The Result: PL-Stitch crushed the competition.
- In surgery, it was 11.4% more accurate than the previous best AI.
- In cooking, it was 5.7% more accurate.

The "Aha!" Moment:
The paper includes a cool experiment. They took a video of someone making coffee and played it forward and backward.

Old AI: The features (the internal "thoughts" of the AI) for the forward video and the backward video were almost identical. It was "time-blind."
PL-Stitch: The features for the forward video were totally different from the backward video. It finally understood that time flows in one direction.

The Takeaway

Think of PL-Stitch as a new way of teaching AI to watch videos. Instead of just memorizing a library of static pictures, it learns the story.

Old Way: "I know what a scalpel looks like."
PL-Stitch: "I know that the scalpel is used after the incision is made, and before the stitching begins."

By forcing the AI to solve "time puzzles" (ranking frames) and "context puzzles" (jigsaw with past/future clues), they created a system that truly understands the flow of human activities. This is a huge step toward AI that can assist surgeons or teach cooking by actually understanding the process, not just the pictures.

1. Problem Statement

Current self-supervised learning (SSL) methods for video representation (e.g., contrastive learning, masked image modeling) often fail to capture the procedural structure of activities. While they excel at recognizing static objects or short atomic actions, they are "procedurally agnostic," meaning they cannot distinguish between a forward chronological sequence and a time-reversed sequence.

Motivation: The authors demonstrate this via an experiment where models pretrained on forward and backward sequences produce nearly identical feature vectors. This confirms that existing SSL models learn what is happening (static content) but not when it happens (temporal order), which is critical for complex workflows like surgery or cooking.
Limitations of Prior Work: Existing temporal ordering tasks often rely on suboptimal objectives:
- Pairwise comparisons: Provide fragmented local signals ( $O(k^2)$ comparisons) rather than a global sequence signal.
- Permutation classification: Treats relative ordering as an absolute classification problem, penalizing minor sorting errors as severely as total failures.

2. Methodology: PL-Stitch

The authors propose PL-Stitch, a self-supervised framework that leverages the inherent temporal order of video frames as a supervisory signal. It utilizes a shared backbone encoder (ViT) trained via two complementary branches, both optimized using the Plackett-Luce (PL) probabilistic model.

Core Innovation: Plackett-Luce (PL) Ranking

Instead of standard classification or pairwise ranking, PL-Stitch frames temporal and spatial ordering as a listwise ranking problem.

Mechanism: The PL model defines a probability distribution over all possible permutations of a set of items.
Advantage: It is probabilistic and listwise. It optimizes the ordering of $k$ elements in a single step, providing a global signal. Crucially, it scales penalties based on error severity (a near-correct order is penalized less than a completely wrong one), making it more robust than permutation classification.

Architecture Components

The framework jointly optimizes three loss functions:

Video Branch (Global Workflow Progression):
- Task: The model samples a sparse clip of $k$ frames and must predict their correct chronological order.
- Implementation: The encoder processes $k$ frames, and a temporal head ( $h_{vid}$ ) predicts a vector of PL parameters ( $s$ ).
- Loss: Minimizes the negative log-likelihood of the ground-truth chronological permutation ( $r^*$ ) given the predicted scores. This forces the model to learn the global progression of the procedure.
Image Branch (Fine-Grained Correspondence):
This branch learns local features through two parallel objectives:
- Spatio-Temporal Jigsaw ( $L_{jigsaw}$ ):
  - Task: Given a central masked frame ( $v_t$ ) and its adjacent past ( $v_{t-\tau_1}$ ) and future ( $v_{t+\tau_2}$ ) frames, the model must reconstruct the original spatial arrangement (patch order) of the central frame.
  - Mechanism: Uses Cross-Attention where the masked current frame acts as Queries ( $Q$ ) and the past/future frames act as Keys/Values ( $K, V$ ). Positional embeddings are omitted to force reliance on visual content.
  - Goal: Captures fine-grained object correspondence across time.
- Masked Image Modeling ( $L_{MIM}$ ):
  - Standard iBOT-style reconstruction loss on the current frame to ensure robust semantic representations.
Total Objective:
$L_{total} = \lambda_1 L_{vid} + \lambda_2 L_{MIM} + \lambda_3 L_{jigsaw}$

3. Key Contributions

Identification of Procedural Agnosticism: Experimentally validated that dominant SSL methods (DINO, iBOT, VideoMAE) fail to distinguish between forward and backward procedural sequences.
First Application of PL Model in SSL: To the authors' knowledge, this is the first work to leverage the Plackett-Luce model to formulate probabilistic pretext tasks for self-supervised video learning.
Novel Objectives: Introduced a listwise temporal ranking objective for global workflow and a spatio-temporal jigsaw objective for local object correspondence.
State-of-the-Art Performance: Achieved new SOTA results on five challenging benchmarks (3 surgical, 2 cooking) without manual annotations.

4. Experimental Results

The model was evaluated on Surgical Phase Recognition (Cholec80, AutoLaparo, M2CAI16) and Cooking Action Segmentation (Breakfast, GTEA).

Surgical Phase Recognition:
- On Cholec80, PL-Stitch achieved 81.7% k-NN accuracy, a massive +11.4 percentage point (pp) gain over the strong iBOT baseline (70.3%).
- It also outperformed specialized surgical foundation models (e.g., Endo-FM, LemonFM) and generalist SSL methods across all datasets.
Cooking Action Segmentation:
- On Breakfast, it achieved a +5.7 pp gain in linear probing accuracy over the second-best method (DINO).
- On GTEA, it surpassed all baselines in accuracy, Edit distance, and F1 scores.
Ablation Studies:
- Removing the temporal ranking loss ( $L_{vid}$ ) caused the largest performance drop, confirming its necessity for procedural awareness.
- The PL ranking formulation outperformed both Pairwise Loss and Permutation Classification baselines.
- Optimal performance was found with $k=8$ frames for the temporal clip.

5. Significance and Qualitative Analysis

Procedural Awareness: Visualizations (t-SNE) show that PL-Stitch creates distinct, well-separated clusters for different procedural phases, whereas baselines show heavy overlap.
Attention Mechanism: Attention maps reveal that PL-Stitch consistently focuses on the instrument-tissue interaction or manipulated objects throughout the video. In contrast, baselines exhibit diffuse, unstable attention that fails to track the workflow.
Zero-Shot Generalization: The model trained on the LEMON dataset (surgical) successfully generalized to Cholec80 without fine-tuning, demonstrating that it learned a generalizable representation of procedural progress (assigning higher scores to earlier phases).
Impact: This work bridges the gap between static representation learning and dynamic procedural understanding, offering a robust framework for applications requiring strict temporal logic, such as robotic surgery assistance, automated cooking guides, and activity monitoring.

Code Availability: The authors have released the code and models at https://github.com/visurg-ai/PL-Stitch.