Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Imagine you have a very smart video-watching robot. This robot has been trained to recognize thousands of different actions, like "throwing a ball" or "opening a door." But here's the problem: the robot's vocabulary is a bit too simple. It sees "opening a door" as just one single thing. It doesn't know the difference between pushing a door open, pulling it, slamming it, or creaking it open slowly.

In the real world, these tiny differences matter a lot. But if you want your robot to learn these new, specific distinctions, you usually have to feed it thousands of new videos, label them by hand, and retrain the whole robot from scratch. That's expensive, slow, and a lot of work.

This paper introduces a clever new trick called "Category Splitting." Instead of retraining the whole robot, the authors show how to perform a "surgical edit" on the robot's brain to upgrade its vocabulary on the fly.

Here is how it works, broken down into simple concepts:

1. The Problem: The "One-Size-Fits-All" Label

Think of the robot's current knowledge like a giant filing cabinet. Inside, there is one massive folder labeled "Dropping Things." Inside that folder, the robot has seen videos of dropping a cup, dropping a pen, and dropping a book. But it treats them all the same.

Now, imagine you need the robot to distinguish between "Dropping something into a box" and "Dropping something onto a table."

The Old Way: You'd have to build a whole new filing cabinet, label thousands of new videos, and teach the robot from zero.
The New Way (This Paper): You just take a pair of scissors and cut that one big "Dropping Things" folder into two smaller, specific folders. You don't need new videos; you just rearrange the existing knowledge.

2. The Secret Sauce: The "Modifier" Dictionary

How does the robot know how to cut the folder? The authors discovered that the robot's brain already contains the "ingredients" for these distinctions, even if it hasn't been taught to use them yet.

Imagine the robot's brain is like a Lego set.

The big folder "Dropping" is a large, plain Lego brick.
But hidden inside the robot's memory are smaller, specialized Lego pieces called "Modifiers."
- One piece says "Into."
- One piece says "Onto."
- One piece says "Behind."

The robot has already learned what "Pushing something into a box" looks like. It has also learned what "Pushing something onto a table" looks like. The authors realized that the difference between these two actions is just a specific "Modifier" piece attached to the base action.

3. The Zero-Shot Magic (No New Data Needed)

The paper proposes a Zero-Shot method. This means you don't need to show the robot a single new video to teach it the new words.

Here is the magic trick:

Find the Ingredients: The system looks at the robot's existing brain and finds the "Modifier" pieces it already knows (like "into" or "onto") by comparing how it recognizes similar actions.
Build the New Folder: When you want to split "Dropping" into "Dropping into," the system simply takes the "Dropping" brain pattern and adds the "Into" modifier piece to it.
The Result: The robot now has a brand new, highly specific recognition for "Dropping into," created entirely from math and existing knowledge, without seeing a single new video.

4. The "Low-Shot" Upgrade (Learning from One Example)

What if you do have just one or two new videos? The paper shows that combining the "Zero-Shot" trick with a tiny bit of practice is even better.

Think of it like teaching a child to ride a bike:

Zero-Shot: You give them a bike that is already perfectly balanced (thanks to our math trick).
Low-Shot: They hop on and ride it once.
Result: Because the bike was already balanced, they learn to ride in seconds. If you just gave them a wobbly bike and asked them to learn from one ride, they would fall over.

5. Why This Matters

Speed & Cost: You don't need to hire armies of people to label videos. You can update a robot's understanding in minutes.
Precision: It allows robots to understand the subtle, human-like details of actions (like how something is done, not just what is done).
Flexibility: As new needs arise (e.g., a factory robot needs to distinguish between "tightening a screw slightly" vs. "tightening it fully"), you can just "split" the category instantly.

Summary Analogy

Imagine your robot is a chef who only knows how to cook "Soup."

The Old Way: To teach the chef to make "Tomato Soup" vs. "Chicken Soup," you have to send them to culinary school for a year with new ingredients.
The New Way: You realize the chef already knows how to make "Soup" and already knows how to make "Tomato Sauce" and "Chicken Broth" separately. You just tell them: "Hey, for this new dish, take the 'Soup' base and mix in the 'Tomato' flavor."
The Outcome: The chef instantly knows how to make Tomato Soup, without ever stepping foot in a new kitchen.

This paper proves that video AI models already hold the secrets to fine-grained understanding; we just need the right key to unlock and rearrange them.

1. Problem Definition: Category Splitting

The paper addresses a fundamental limitation in current video recognition models: they are trained on fixed taxonomies that are often too coarse. A single label (e.g., "opening") may collapse distinct variations based on object, manner, speed, or outcome (e.g., "opening a cupboard," "opening by pushing," "opening quickly").

As applications evolve, new distinctions become necessary, but retraining models with new annotations is costly and data-intensive. Existing solutions like Vision-Language Models (VLMs) often lack the temporal resolution for fine-grained actions, and Continual Learning typically requires data for new classes rather than splitting existing ones.

The Task: The authors introduce Category Splitting, defined as editing an existing video classifier to refine a chosen coarse category ( $c$ ) into multiple fine-grained subcategories ( $S_c$ ) while preserving accuracy on all other existing categories. The goal is to achieve this with zero-shot (no data) or low-shot (very few data) constraints.

2. Methodology

The core insight is that modern video backbones implicitly encode compositional structure. Even without explicit supervision, the model's latent space contains vectors representing "base concepts" (coarse actions) and "modifiers" (fine-grained variations). The proposed method exploits this to edit only the classification head, leaving the backbone frozen.

A. Zero-Shot Category Splitting

The method operates in two stages:

Modifier Retrieval (Dictionary Construction):
- The authors identify existing fine-grained categories in the training set that share a base concept (e.g., "poking so it spins," "poking so it falls").
- They treat these as a "pseudo-coarse" category.
- Modifier Vectors ( $v_m$ ): By subtracting the mean weight vector of the pseudo-coarse category from the weight vectors of the fine-grained variants, they extract explicit modifier vectors representing the semantic difference (e.g., "so it spins").
- These vectors are stored in a dictionary paired with text descriptions.
Splitting via Transfer or Alignment:
- Modifier Retrieval: To split a new coarse category (e.g., "pushing"), the system retrieves the most appropriate modifier vector from the dictionary based on text similarity (e.g., matching "pushing left to right" with a retrieved "left to right" modifier). The new weight is calculated as: $w_{new} = w_{coarse} + v_{retrieved}$ .
- Modifier Alignment: To handle modifiers not present in the original dictionary, a lightweight alignment module ( $g_\psi$ ) is trained. It maps text embeddings of modifiers directly to the classifier weight space using the constructed modifier dictionary as supervision. This allows the synthesis of vectors for unseen modifiers.

B. Low-Shot Category Splitting

When a small number of labeled examples are available (e.g., 1 video per subcategory):

Isolated Fine-Tuning: Only the weights of the new subcategories ( $\theta'_{head}$ ) are fine-tuned; the backbone and original weights remain frozen to prevent catastrophic forgetting.
Zero-Shot Initialization: The new weights are initialized using the zero-shot method (Retrieval or Alignment) rather than random initialization. This provides a strong prior, significantly boosting performance with minimal data.

3. Key Contributions

New Task Definition: Formalized Category Splitting, a novel task distinct from standard class-incremental learning or zero-shot classification, focusing on refining existing labels rather than adding entirely new ones.
Zero-Shot Editing Framework: Proposed a method to edit classifiers by decomposing them into base concepts and modifiers, leveraging the latent compositional structure of video backbones without requiring new video data.
Benchmarks: Introduced SSv2-Split and FineGym-Split, two new benchmarks constructed from Something-Something V2 and FineGym288, specifically designed to evaluate category splitting with mixed-granularity base models.
Empirical Validation: Demonstrated that exploiting latent classifier structure is more effective for fine-grained video understanding than relying on external Vision-Language Models (VLMs).

4. Experimental Results

The authors evaluated their method against strong VLM baselines (CLIP, VideoCLIP-XL, VideoPrism, etc.) on the new benchmarks.

Generality (Accuracy on new subcategories):
- The proposed method achieved 46.3% generality on SSv2-Split and 34.2% on FineGym-Split.
- This significantly outperformed VLM baselines, which hovered around 27-30% on SSv2 and 12-21% on FineGym.
- Insight: VLMs struggle with the subtle temporal cues required for fine-grained video distinctions, whereas the edited classifier leverages the specific temporal features already learned by the video backbone.
Locality (Preservation of old categories):
- The method maintained near-perfect locality (~98-99%), meaning the edit did not degrade performance on the original categories.
- VLM baselines had perfect locality (100%) by design (as they are external), but their low generality made them ineffective.
Ablation Studies:
- Initialization: Initializing low-shot fine-tuning with the zero-shot edit improved generality by ~4-8% compared to random or coarse-category initialization.
- Backbone Pretraining: Models pretrained with video-only objectives (e.g., MVD, SIGMA) performed better than image-text models (CLIP), suggesting that video-specific pretraining captures the necessary compositional structure.
- Robustness: The method remained effective even when the original label space had fewer fine-grained categories (reduced compositional variation), though performance dropped slightly.

5. Significance and Impact

Efficiency: The approach allows for rapid model adaptation in resource-constrained scenarios without the need for expensive data collection and full retraining.
Interpretability: The method reveals that video classifiers implicitly learn compositional structures (Action + Modifier), offering a new perspective on how these models represent fine-grained concepts.
Future Directions: The paper suggests that this compositional editing framework could extend to other modalities (images, audio) and enable more dynamic, hierarchical taxonomies in video understanding systems.

In summary, the paper demonstrates that editing the classification head of a video model using latent compositional vectors is a superior strategy for fine-grained video understanding compared to relying on external VLMs or full retraining, achieving high accuracy on new distinctions while preserving existing knowledge.

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

1. The Problem: The "One-Size-Fits-All" Label

2. The Secret Sauce: The "Modifier" Dictionary

3. The Zero-Shot Magic (No New Data Needed)

4. The "Low-Shot" Upgrade (Learning from One Example)

5. Why This Matters

Summary Analogy

1. Problem Definition: Category Splitting

2. Methodology

A. Zero-Shot Category Splitting

B. Low-Shot Category Splitting

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank