Beyond Caption-Based Queries for Video Moment Retrieval

Imagine you have a massive library of home videos, and you want to find a specific moment in them. You type a search query into a computer, hoping it finds the exact clip you're looking for. This is the job of Video Moment Retrieval (VMR).

For years, the computers (AI models) trained to do this have been like students who only studied for a very specific, weird type of exam. Here is the problem the paper solves, explained simply:

1. The Problem: The "Over-Descriptive" Teacher

Currently, AI models are trained using captions written by humans who have already watched the video.

The Scenario: A human watches a video of a soccer game. They see a player in a yellow jersey score a goal.
The Caption (Training Data): "A man in a yellow jersey intercepts a loose pass from the opposing team near the box and scores a powerful volley."
The Reality: When a real user wants to find that moment, they don't know the player wore yellow, or that it was a volley. They just type: "When are goals being scored?"

The AI is like a student who memorized the teacher's detailed notes but fails the test when the question is asked in simple, vague language. The AI is "over-specialized" and gets confused by the real world.

2. The Experiment: Creating a "Real World" Test

The researchers realized that existing AI models were failing when tested on these simple, real-world searches. To prove this, they didn't just guess; they built a new testing ground.

They took three famous video datasets and used a smart AI (a "rewriter") to take those detailed, fancy captions and simplify them.

Original: "A man ties his running shoes before starting a marathon."
Simplified (Search Query): "A person getting ready to exercise."

They created three new benchmarks (HD-EPIC-S, YC2-S, ANC-S) where the "questions" were less detailed, just like real users ask them.

3. The Discovery: Two Big Hurdles

When they tested the old AI models on these new, simple questions, performance crashed. They found two main reasons why:

The Language Gap: The AI is used to hearing "yellow jersey" and "volley." When it hears "exercise" or "goal," it doesn't know how to connect the dots. It's like teaching a dog to fetch a "red ball," and then asking it to fetch a "toy." The dog knows the object, but not the word.
The "One-and-Done" Trap (Multi-Moment Gap): This was the bigger surprise.
- The Training: In the old data, every question had exactly one correct answer (one moment). The AI learned to look for just one thing.
- The Reality: A simple search like "cooking food" might happen five different times in a video.
- The Crash: The AI, trained to find only one thing, would find the first instance of cooking and then stop looking. It was like a security guard who sees one person enter a building and assumes no one else will, so he stops watching the door.

4. The Culprit: "Query Collapse"

The researchers discovered a specific technical glitch called Active Decoder-Query Collapse.

Imagine the AI has a team of 100 detectives (decoder queries) ready to find moments in the video.

In Training: Because the training data only ever had one "criminal" (moment) to find, the detectives learned to sit around and let just one or two of them do all the work. The other 98 detectives went to sleep (became inactive).
In Reality: When you ask for "cooking food," there are 5 moments to find. But the AI only wakes up 2 detectives. They can only find 2 moments. The other 3 moments are missed because the AI's "team" is asleep.

5. The Solution: Waking Up the Team

The researchers fixed this by changing the AI's architecture (its brain structure) in two clever ways:

Stop the "Groupthink" (Removing Self-Attention): In the old setup, the detectives talked to each other and decided, "Hey, you do the work, I'll relax." The researchers cut off this conversation, forcing every detective to work independently.
The "Random Nap" Strategy (Query Dropout): They forced the AI to randomly "fire" (ignore) some of the detectives during training. This sounds counter-intuitive, but it forced the AI to realize, "Oh no! If I don't wake up all the detectives, I might miss the crime!" This trained the AI to keep its whole team active and ready.

The Result

By making these changes, the AI became much better at handling real-world, vague searches.

It improved its ability to find moments by up to 14.8%.
For tricky searches with multiple moments, it improved by nearly 22%.

The Takeaway

This paper teaches us that to build AI that works in the real world, we can't just train it on perfect, detailed descriptions. We have to train it to handle the messy, vague, and multi-layered way humans actually speak. And sometimes, the best way to fix a lazy team of AI detectives is to stop them from talking to each other and force them to stay awake!

1. Problem Definition

Video Moment Retrieval (VMR) aims to localize specific temporal segments in a video given a textual query. Current state-of-the-art VMR models are trained on datasets where queries are derived from captions written by annotators after watching the video.

The Core Issue: These "caption-based queries" are overly descriptive, fine-grained, and visually informed. They induce a visual bias that does not reflect real-world user behavior.
Real-World Gap: Actual users typically formulate search queries without watching the video first. These queries are often:
1. Under-specified: Lacking specific visual details (e.g., "When are goals being scored?" vs. "A man in a yellow jersey intercepts a pass...").
2. Multi-moment: A single general query often corresponds to multiple distinct moments in a video, whereas standard datasets usually map one query to a single ground-truth (GT) moment.
Consequence: Models trained on caption-based data suffer significant performance degradation when evaluated on realistic search queries, particularly failing to retrieve multiple relevant moments.

2. Methodology

A. Benchmark Creation (Search-Query Pipelines)

The authors propose a pipeline to transform existing caption-based datasets into Search-Query Benchmarks without requiring costly new manual annotations.

Under-specification: Using Large Language Models (LLMs, specifically Gemma-12B), they rewrite fine-grained captions into progressively under-specified versions.
- Rewriter Agent: Removes context (subject, object, intent) while preserving core semantics.
- Validator Agent: Ensures the rewritten query is consistent with the original caption.
Query Grouping: Since under-specified queries often map to multiple video segments, the pipeline groups fine-grained queries that map to the same broad search query using sentence embeddings (SBERT).
Benchmarks Introduced:
- HD-EPIC-S{1,2,3}: Three levels of under-specification based on the HD-EPIC dataset.
- YC2-S: Based on YouCook2.
- ANC-S: Based on ActivityNet-Captions.
- Result: These benchmarks introduce a significant "multi-moment" gap, with up to 47% of queries mapping to multiple GT moments.

B. New Evaluation Metrics

Standard metrics (Recall@1, mAP) are inadequate for multi-moment queries because they treat the query as a single unit, masking failures to retrieve specific moments within a set. The authors introduce:

Multi-moment Recall ( $R_m$ ): Evaluates each GT moment independently. A moment is considered retrieved if it appears in the top predictions without being penalized by other valid moments ranked higher.
Multi-moment mAP ( $mAP_m$ ): Computes Average Precision for each GT moment individually, ignoring predictions that match other GT moments in the same query to avoid penalizing correct detections of different instances.

C. Architectural Analysis & Mitigation

The authors identify two main causes for performance degradation:

Language Gap: The linguistic shift from descriptive to abstract queries.
Multi-moment Gap: The mismatch between single-moment training priors and multi-moment inference.

Key Finding: Active Decoder-Query Collapse
In DETR-based architectures (the standard for VMR), the model relies on $K$ learnable decoder queries. The authors observe that during inference on search queries, only a small subset of these queries remains "active" (high confidence), while the rest collapse to zero. This is caused by:

Coordination Collapse: Self-attention mechanisms cause queries to suppress each other to avoid redundancy, leading to a consensus on a single moment.
Index Collapse: The model overfits to specific query indices during single-moment training, leaving other indices permanently inactive.

Proposed Solution: Architectural Modifications (-SA+QD)
To mitigate this without re-annotating data, they propose two modifications:

Remove Self-Attention (-SA): Eliminating the self-attention layer in the decoder prevents queries from coordinating/suppressing each other, encouraging independent predictions. Redundancy is handled via post-processing Non-Maximum Suppression (NMS).
Query Dropout (QD): Randomly zeroing out a percentage of learnable queries during training. This prevents the model from over-relying on a fixed subset of query indices, forcing it to distribute supervision across more queries.

3. Key Contributions

Reformulation of VMR: Shifting the focus from caption-based queries to under-specified search queries to better align with real-world usage.
Three New Benchmarks: Creation of HD-EPIC-S, YC2-S, and ANC-S, which introduce realistic linguistic under-specification and multi-moment scenarios.
Diagnosis of Degradation: Identification and quantification of the Language Gap and Multi-moment Gap, specifically pinpointing Active Decoder-Query Collapse as the primary architectural failure mode.
Novel Metrics: Introduction of $R_m$ and $mAP_m$ for fair evaluation of multi-moment retrieval.
Effective Mitigation: Demonstration that simple architectural changes (removing SA and adding QD) significantly improve generalization without needing new training data.

4. Experimental Results

Performance Drop: Standard models (CG-DETR, LD-DETR) trained on captions suffer massive performance drops (up to 77.40% relative degradation in $R_m$ ) when evaluated on search queries.
Impact of Modifications: The proposed -SA+QD method significantly bridges this gap:
- Improves performance on search queries by up to 14.82% in $mAP_m$ .
- Improves performance on multi-moment search queries by up to 21.83% in $mAP_m$ .
- Recovers nearly 70% of the performance gap between the baseline and an "oracle" model trained directly on search queries.
Ablation Studies:
- Removing Self-Attention and Query Dropout together is necessary; doing either alone yields marginal gains.
- Alternative methods like 1-to-k matching or data augmentation failed to improve generalization as effectively, often leading to redundant predictions.
- The improvements are consistent across different architectures (DETR-based and Anchor-based Flash-VTG), confirming the issue is systemic to the training distribution, not just the architecture.

5. Significance

This work fundamentally challenges the current paradigm of Video Moment Retrieval. It highlights that the "visual bias" in existing datasets creates a false sense of security regarding model performance. By introducing search-query benchmarks and identifying the "query collapse" phenomenon, the paper provides a clear path forward for making VMR systems robust enough for real-world deployment, where users search with vague, multi-intent queries. The proposed solution is highly practical as it requires no expensive re-annotation of existing massive datasets, only architectural tweaks.