Beyond Caption-Based Queries for Video Moment Retrieval

This paper investigates the performance degradation of existing Video Moment Retrieval methods when transitioning from caption-based to search queries, identifies language and multi-moment gaps alongside a decoder-query collapse as key causes, and proposes architectural modifications to significantly improve generalization on multi-moment search queries.

David Pujol-Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a massive library of home videos, and you want to find a specific moment in them. You type a search query into a computer, hoping it finds the exact clip you're looking for. This is the job of Video Moment Retrieval (VMR).

For years, the computers (AI models) trained to do this have been like students who only studied for a very specific, weird type of exam. Here is the problem the paper solves, explained simply:

1. The Problem: The "Over-Descriptive" Teacher

Currently, AI models are trained using captions written by humans who have already watched the video.

  • The Scenario: A human watches a video of a soccer game. They see a player in a yellow jersey score a goal.
  • The Caption (Training Data): "A man in a yellow jersey intercepts a loose pass from the opposing team near the box and scores a powerful volley."
  • The Reality: When a real user wants to find that moment, they don't know the player wore yellow, or that it was a volley. They just type: "When are goals being scored?"

The AI is like a student who memorized the teacher's detailed notes but fails the test when the question is asked in simple, vague language. The AI is "over-specialized" and gets confused by the real world.

2. The Experiment: Creating a "Real World" Test

The researchers realized that existing AI models were failing when tested on these simple, real-world searches. To prove this, they didn't just guess; they built a new testing ground.

They took three famous video datasets and used a smart AI (a "rewriter") to take those detailed, fancy captions and simplify them.

  • Original: "A man ties his running shoes before starting a marathon."
  • Simplified (Search Query): "A person getting ready to exercise."

They created three new benchmarks (HD-EPIC-S, YC2-S, ANC-S) where the "questions" were less detailed, just like real users ask them.

3. The Discovery: Two Big Hurdles

When they tested the old AI models on these new, simple questions, performance crashed. They found two main reasons why:

  • The Language Gap: The AI is used to hearing "yellow jersey" and "volley." When it hears "exercise" or "goal," it doesn't know how to connect the dots. It's like teaching a dog to fetch a "red ball," and then asking it to fetch a "toy." The dog knows the object, but not the word.
  • The "One-and-Done" Trap (Multi-Moment Gap): This was the bigger surprise.
    • The Training: In the old data, every question had exactly one correct answer (one moment). The AI learned to look for just one thing.
    • The Reality: A simple search like "cooking food" might happen five different times in a video.
    • The Crash: The AI, trained to find only one thing, would find the first instance of cooking and then stop looking. It was like a security guard who sees one person enter a building and assumes no one else will, so he stops watching the door.

4. The Culprit: "Query Collapse"

The researchers discovered a specific technical glitch called Active Decoder-Query Collapse.

Imagine the AI has a team of 100 detectives (decoder queries) ready to find moments in the video.

  • In Training: Because the training data only ever had one "criminal" (moment) to find, the detectives learned to sit around and let just one or two of them do all the work. The other 98 detectives went to sleep (became inactive).
  • In Reality: When you ask for "cooking food," there are 5 moments to find. But the AI only wakes up 2 detectives. They can only find 2 moments. The other 3 moments are missed because the AI's "team" is asleep.

5. The Solution: Waking Up the Team

The researchers fixed this by changing the AI's architecture (its brain structure) in two clever ways:

  1. Stop the "Groupthink" (Removing Self-Attention): In the old setup, the detectives talked to each other and decided, "Hey, you do the work, I'll relax." The researchers cut off this conversation, forcing every detective to work independently.
  2. The "Random Nap" Strategy (Query Dropout): They forced the AI to randomly "fire" (ignore) some of the detectives during training. This sounds counter-intuitive, but it forced the AI to realize, "Oh no! If I don't wake up all the detectives, I might miss the crime!" This trained the AI to keep its whole team active and ready.

The Result

By making these changes, the AI became much better at handling real-world, vague searches.

  • It improved its ability to find moments by up to 14.8%.
  • For tricky searches with multiple moments, it improved by nearly 22%.

The Takeaway

This paper teaches us that to build AI that works in the real world, we can't just train it on perfect, detailed descriptions. We have to train it to handle the messy, vague, and multi-layered way humans actually speak. And sometimes, the best way to fix a lazy team of AI detectives is to stop them from talking to each other and force them to stay awake!