On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

The Big Picture: Teaching a Robot to Watch Videos

Imagine you have a super-smart robot (a Multimodal Large Language Model, or MLLM) that has read the entire internet and watched millions of videos. You want to teach it to understand actions in new videos, like "adding sugar to a cake" or "pouring milk."

The researchers asked a simple question: What is the best way to ask this robot to identify an action?

They compared two different teaching styles:

The "Storyteller" (Generative Classifier): You ask the robot, "What is happening?" and it has to write out the answer word by word, like a story.
The "Quiz Master" (Discriminative Classifier): You show the robot a list of possible answers and ask it to point to the correct one immediately.

The Problem with the "Storyteller"

For a long time, researchers used the Storyteller approach. They would prompt the robot: "Describe the action in this video." The robot would then generate text like: "The person is... adding... sugar..."

Why this is tricky:
Imagine the robot is taking a spelling test. The options are "Add Sugar" and "Add Salt."

Both answers start with the word "Add."
Both answers share the word "Add."

Because the robot has to write the words one by one, it gets confused by the shared parts. It might think, "Oh, I've already written 'Add', so I'm halfway to 'Add Salt', but wait, the video looks like sugar..." This confusion is called Semantic Overlap. It's like trying to distinguish between two twins who wear the same shirt; the robot gets mixed up because the "shirt" (the shared word) is too similar.

Also, writing a sentence takes time. If the answer is "Add strawberries to cake," the robot has to think of "Add," then "strawberries," then "to," then "cake." It's slow.

The Solution: The "Quiz Master"

The researchers realized that for a closed list of actions (like a multiple-choice test), the Quiz Master approach is much better.

Instead of writing the answer, the robot looks at the video and instantly points to the correct label from a list.

No shared words: It treats "Add Sugar" and "Add Salt" as two completely different, unique codes (like distinct barcodes). It doesn't have to worry about the word "Add" confusing it.
Speed: It doesn't have to write a sentence. It just picks the right button. This makes it 3 times faster in their experiments.

The "Best of Both Worlds" Idea: GAD

The researchers found that while the Quiz Master was faster and more accurate, the Storyteller had a secret superpower: Context.

The Storyteller is great at understanding the flow of a story. If you ask it to describe the previous step ("What did they just do?"), it helps it understand the current step better.

So, they created a hybrid system called GAD (Generation-Assisted Discriminative).

The Analogy: The Coach and the Player
Imagine a soccer player (the Quiz Master) who is great at scoring goals but sometimes misses the big picture.

The Player: During the game (inference), the player just focuses on kicking the ball into the net (identifying the action). This is fast and accurate.
The Coach: During practice (training), a coach (the Storyteller) stands next to the player. The coach says, "Hey, remember, before you kicked the ball, you were passing it to the left. That context helps you know where to run next."

The player listens to the coach to learn better strategies, but during the actual game, the player doesn't stop to listen to the coach; they just play.

How GAD works:

Training: The model learns to be a Quiz Master (picking the right action) while the Storyteller part whispers context clues (like "what happened before?") to help it learn.
Testing: When the model is actually watching a video, it turns off the Storyteller. It acts purely as the fast, accurate Quiz Master.

The Results: Why This Matters

The researchers tested this on five different video datasets (like cooking videos, sports, and daily life). Here is what happened:

Accuracy: The new method (GAD) was the most accurate, beating the old "Storyteller" methods by a significant margin (about 2.5% to 6.8% better).
Speed: It was 3 times faster than the old methods.
Efficiency: It achieved these results even with a smaller model (1 Billion parameters) beating older, much larger models (8 Billion parameters).

The Takeaway

This paper teaches us that when we want a robot to classify things (pick the right label from a list), we shouldn't force it to write the answer like a human. Writing introduces confusion because words share meanings.

Instead, we should teach the robot to recognize the answer directly. However, we can still use the "writing" skill as a training tool to help the robot understand the context, making it a smarter, faster, and more accurate observer of the world.

In short: Stop asking the robot to write an essay to tell you what it sees. Ask it to point to the right picture, but let it "think" like a writer while it's studying.

1. Problem Statement

The paper addresses the challenge of temporal action understanding (e.g., step recognition, forecasting, and online action detection) using Multimodal Large Language Models (MLLMs).

Current State: Most recent approaches treat action recognition as a generative task, where MLLMs autoregressively generate action labels as free-form text (e.g., "add onion").
The Issue: This generative approach is inefficient and prone to errors.
- Semantic Overlap: Action labels often share common verbs and objects (e.g., "add onion" vs. "add rice"). When tokenized into subwords, these shared tokens create semantic overlap in the output space, causing the model to confuse similar actions.
- Inefficiency: Autoregressive generation requires multiple forward passes to generate a single label, leading to high inference latency compared to single-step classification.
- Performance Gap: Generative classifiers underperform discriminative counterparts in closed-set settings due to the added complexity of modeling text semantics rather than direct class boundaries.

2. Methodology

The authors propose a comprehensive framework comparing and bridging generative and discriminative approaches, culminating in the Generation-Assisted Discriminative (GAD) classifier.

A. Comparative Analysis

Discriminative Classifier: The authors adapt pre-trained MLLMs by appending a learnable [CLS] token to the input sequence. This token attends to all visual and textual inputs to produce a global representation, which is fed into a classification head.
- Advantage: Predicts the action in a single forward pass, avoiding autoregressive decoding. It ignores label semantics, thereby eliminating confusion caused by shared subwords.
Generative Classifier: Uses the standard MLLM setup to autoregressively generate subword tokens for the action label.
- Disadvantage: Slower inference and higher error rates due to semantic overlap in the token space.

B. Bridging the Gap

To understand the performance gap, the authors conducted experiments manipulating the tokenization of action labels:

Randomized Mapping: Replacing subwords with random tokens showed that local word connections are not the primary issue.
Desynchronized Mapping: Ensuring shared subwords map to different tokens improved performance, confirming that semantic overlap is the root cause of generative errors.
Extended Vocabulary: Treating each action label as a single unique token (similar to the discriminative approach) allowed the generative model to match the discriminative model's performance, proving that the architecture can be unified if tokenization is optimized.

C. The Proposed Solution: Generation-Assisted Discriminative (GAD) Classifier

The GAD framework combines the efficiency of discriminative learning with the semantic richness of generative modeling.

Architecture: It uses a unified end-to-end framework with a shared backbone (visual encoder + language decoder).
Training Strategy (Sequential Learning):
1. Discriminative Objective: The model learns to predict the action class directly via the [CLS] token and a classification head (Cross-Entropy Loss).
2. Generative Objective (Auxiliary): The model is simultaneously trained to generate the action label (and optionally context, such as the previous action) via the language modeling head.
3. Loss Function: $L_{GAD} = L_{cls} + \lambda L'_{gen}$ . The generative loss acts as a regularizer, enriching the discriminative representation with semantic and contextual cues without forcing the inference to be generative.
Inference: The generative branch is disabled. Only the discriminative head is used for prediction, ensuring the speed of a discriminative classifier while retaining the benefits of generative training.

3. Key Contributions

Empirical Evidence: Demonstrated that discriminative classifiers significantly outperform generative classifiers in closed-set action understanding due to the elimination of semantic overlap and inference efficiency.
Theoretical Insight: Identified that shared subwords in action labels are the primary source of confusion for generative models and showed that treating labels as atomic tokens bridges the performance gap.
GAD Framework: Proposed a novel Generation-Assisted Discriminative classifier that leverages generative modeling as an auxiliary task to regularize discriminative learning, achieving state-of-the-art results while maintaining fast inference.
Efficiency: Achieved significant speedups (up to 3x faster) compared to autoregressive methods without sacrificing accuracy.

4. Experimental Results

The method was evaluated on five datasets across four tasks: Step Recognition, Step Forecasting, Task Recognition, and Online Action Detection (OAD).

Datasets: COIN, EPIC-Kitchens-100, CrossTask, Ego4D GoalStep, and THUMOS'14.
Performance Gains:
- Accuracy: GAD achieved State-of-the-Art (SOTA) results.
  - COIN: +2.5% Top-1 accuracy gain over generative methods.
  - EPIC-Kitchens-100: +6.8% F1 score gain.
  - Ego4D GoalStep: +1.5% F1 score gain.
- Efficiency:
  - Inference Speed: Up to 3x faster on COIN and 1.8x faster training on OAD tasks compared to generative baselines.
  - Model Size: A 1B parameter discriminative model outperformed 8B parameter generative models.
Qualitative Analysis: GAD reduced diverse misclassifications (e.g., confusing "add sugar" with "add meat") and showed robustness even when the auxiliary generative output was imperfect, proving the discriminative head learns a robust representation.

5. Significance

Paradigm Shift: The paper challenges the prevailing trend of using MLLMs purely for generative text output in classification tasks. It argues that for closed-set problems, discriminative formulations are superior in both accuracy and latency.
Practical Deployment: By decoupling training (generative) from inference (discriminative), GAD offers a practical solution for real-time applications (like online action detection) where low latency is critical.
Generalization: The approach preserves compatibility with pre-trained MLLMs and can be adapted to various video understanding tasks without requiring task-specific architectural changes, only fine-tuning.
Future Direction: It highlights the potential of using generative objectives not as the final output mechanism, but as a powerful tool for representation learning and regularization in discriminative tasks.