Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Imagine you are watching a home video of someone making a sandwich. Your brain naturally breaks this continuous stream of images into distinct "chapters": getting the bread, spreading the peanut butter, adding the jelly, and taking a bite.

For a long time, computers have struggled to do this automatically. They are like a student who has memorized a specific textbook but fails when asked to describe a new topic they haven't studied. This is the problem of Temporal Action Segmentation (TAS): teaching a computer to chop a video into meaningful action chunks.

Here is the simple breakdown of what this paper does, using some everyday analogies.

The Problem: The "Closed Library"

Existing computer vision models are like students with a closed library. They can only recognize actions if they have seen them before in their training data.

If you train a model on "making tea," it knows "boil water" and "pour tea."
But if you show it "making coffee," it gets confused because "coffee" isn't in its closed library.
Furthermore, the real world is messy. One person might call a step "chopping onions," while another calls it "prepping vegetables." Collecting a dataset for every possible way to describe every possible action is impossible.

The Solution: The "Universal Translator" (VLMs)

The authors introduce OVTAS (Open-Vocabulary Zero-Shot Temporal Action Segmentation). Think of this as giving the computer a Universal Translator (specifically, a Vision-Language Model or VLM) instead of a closed library.

These models (like CLIP or SigLIP) have already "read" billions of books and "seen" billions of images. They understand that a picture of a dog and the word "dog" go together, even if they've never seen that specific dog before.

The paper asks: Can we use this universal translator to watch a video and label the actions in real-time, without ever training it on video data?

How It Works: The Two-Stage Process

The authors propose a "training-free" pipeline, meaning they don't teach the model anything new. They just use it as is. They use a two-step process they call "Segmentation-by-Classification."

Stage 1: The "Guessing Game" (FAES)

Imagine you are watching a video frame-by-frame.

You have a list of possible actions (e.g., "pouring," "cutting," "stirring").
For every single frame of the video, the computer asks the VLM: "Does this image look more like 'pouring' or 'cutting'?"
The VLM gives a score for every action.

The Catch: The VLM is a bit scatterbrained. It looks at each frame in isolation. It might say, "Frame 10 is pouring," but then "Frame 11 is cutting," and "Frame 12 is pouring" again. In real life, you don't chop, then pour, then chop again instantly. The computer's guesses are temporally inconsistent (all over the place).

Stage 2: The "Editor" (SMTS)

This is where the magic happens. The authors use a mathematical tool called Optimal Transport (think of it as a super-smart editor).

The editor looks at the messy list of guesses from Stage 1.
It knows that actions usually flow logically. You don't jump from "start cooking" to "eating dessert" in one second.
It rearranges the labels to make a smooth, logical story. It forces the computer to say, "Okay, for the next 5 seconds, we are definitely 'chopping,' and then we switch to 'stirring'."

The Experiments: Testing the "Universal Translator"

The authors didn't just build one tool; they tested 14 different versions of these Universal Translators (different sizes and families like CLIP, SigLIP, etc.) on three standard video datasets (making breakfast, making salads, and cooking tasks).

Key Findings:

It Works Without Training: They achieved impressive results without showing the models a single labeled video. They just gave the models the list of action names and let the VLM do the rest.
Bigger Isn't Always Better: Usually, in AI, bigger models are smarter. But here, they found that bigger models didn't necessarily do a better job at slicing videos. Sometimes, a smaller, snappier model was just as good.
Short Videos are Hard: The system struggled the most with videos that had very fast, tiny actions (like the GTEA dataset where actions last only 2 seconds). It's like trying to edit a movie where the scene changes every 0.5 seconds; there isn't enough time for the "editor" to figure out the flow.

The Analogy Summary

Old Way: A robot that only knows 50 specific dance moves. If you ask it to do a new dance, it freezes.
New Way (OVTAS): A robot that speaks human language and understands concepts. You tell it, "This video is about 'making a sandwich'." It looks at the video, understands the concepts of "bread," "knife," and "peanut butter," and automatically writes a script saying, "First, pick up bread. Then, spread butter." It does this instantly, without needing to practice the dance first.

Why This Matters

This research is a huge step forward because it breaks the "closed vocabulary" barrier. It suggests that we can build systems that understand human activities in any language, with any level of detail, without needing to spend years and millions of dollars collecting labeled video data. It turns the computer from a rigid memorizer into a flexible, understanding observer.

1. Problem Definition

Temporal Action Segmentation (TAS) involves assigning action labels to every frame of a video to segment it into meaningful units.

Current Limitation: Existing TAS methods rely on closed vocabularies, meaning they are trained and evaluated on a fixed set of action classes. They fail to generalize to unseen actions or domains because collecting comprehensive, annotated datasets for the vast space of possible human activities is infeasible.
The Gap: There is a lack of methods that can perform Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS). This requires segmenting a video into actions from a user-defined set of labels without any task-specific training or fine-tuning.
Challenge: While Vision-Language Models (VLMs) like CLIP excel at zero-shot image classification, applying them directly to video frames results in temporal inconsistency (predictions fluctuate wildly between adjacent frames) because VLMs process frames independently.

2. Methodology: The OVTAS Pipeline

The authors propose OVTAS, a training-free, zero-shot pipeline that leverages pre-trained VLMs. It follows a "segmentation-by-classification" design consisting of two stages:

Stage 1: Frame–Action Embedding Similarity (FAES)

This stage generates a raw similarity matrix between video frames and candidate action labels.

Input: Video frames and a set of action labels (e.g., "boil water," "pour tea").
Process:
1. Embedding Extraction: The VLM's vision encoder extracts frame embeddings ( $X$ ), and the text encoder extracts embeddings for the action labels ( $A$ ). Both are $\ell_2$ -normalized.
2. Similarity Calculation: A similarity matrix $S$ is computed via dot product: $S = XA^\top$ .
3. Output: A matrix where each entry represents the cosine similarity between a specific frame and an action label.
Constraint: The method assumes Action Set Supervision: the system knows which actions exist in the video but not their order or boundaries.

Stage 2: Similarity-Matrix Temporal Segmentation (SMTS)

This stage enforces temporal consistency to convert the noisy frame-level similarities into stable action segments.

Mechanism: It uses Entropy-Regularized Optimal Transport (OT) (specifically the ASOT decoder).
Components:
- Visual Cost ( $C$ ): Derived from the similarity matrix ( $C = 1 - S$ ).
- Temporal Prior ( $R$ ): A diagonal matrix encouraging monotone alignment (preventing time travel) based on the ratio of frame index to action index.
- Optimization: The algorithm solves for a coupling matrix $\Pi$ that minimizes the cost while respecting transport constraints and entropy regularization.
Output: A temporally consistent sequence of action labels ( $\hat{y}_t$ ) for every frame.

3. Key Contributions

OVTAS Pipeline: Introduction of the first training-free, zero-shot framework for open-vocabulary TAS, eliminating the need for task-specific supervision.
Systematic VLM Study: A comprehensive evaluation of 14 diverse VLMs (across families: SigLIP, CLIP, OpenCLIP, and PECore) and varying model sizes. This is the first broad analysis of VLM suitability for this specific task.
Resource Release: The authors released the codebase and pre-extracted embeddings for all 14 VLMs on three major datasets, lowering the computational barrier for future research.
Benchmarking: Establishment of strong baselines (e.g., Equal-Splits with Non-Repetition Penalty) to rigorously evaluate zero-shot performance.

4. Experimental Results

The method was evaluated on three standard benchmarks: Breakfast, 50 Salads, and GTEA (Georgia Tech Egocentric Activities).

Performance vs. Baselines: OVTAS significantly outperforms all training-free baselines. For example, on the Breakfast dataset, the best OVTAS model (SigLIP-M1) achieved an average score of 46.4, compared to 20.15 for the strongest baseline (ES-NRP).
VLM Family Analysis:
- SigLIP consistently outperformed other families (CLIP, OpenCLIP, PECore) across all datasets.
- Model Size Paradox: Contrary to expectations, larger models did not yield better performance. In many families, smaller checkpoints performed better than larger ones, suggesting that simply scaling up parameters without specific temporal pre-training or better prompting is insufficient for this task.
Ablation Studies:
- Removing either Stage 1 (FAES) or Stage 2 (SMTS) caused catastrophic performance drops, confirming both stages are critical.
- L2 Normalization and the Temporal Prior were proven essential for high accuracy.
Dataset Challenges:
- GTEA was the most difficult due to its egocentric viewpoint, rapid transitions, and very short segment durations (mean ~1.94s).
- Performance degraded as video length increased and as the number of ground-truth segments per video increased (dense segmentation is harder).

5. Significance and Future Directions

Scalability: OVTAS demonstrates that structured temporal understanding can be achieved without expensive, domain-specific annotation, making it scalable to the vast, unstructured space of human activities.
Feasibility of Zero-Shot: The results prove that pre-trained VLMs contain sufficient semantic knowledge to segment actions if paired with a robust temporal decoding mechanism (Optimal Transport).
Future Work: The authors suggest that future improvements should focus on:
- Advanced prompt engineering to better align text with video semantics.
- Video frame pre-processing (e.g., cropping) to enhance feature extraction.
- Enhancing the temporal modeling capabilities of the Optimal Transport decoder to handle very short segments and rapid transitions more effectively.

In conclusion, this paper establishes a new paradigm for action segmentation, shifting from closed-vocabulary supervised learning to open-vocabulary, training-free inference, while providing a critical analysis of how current VLM architectures perform in this domain.