VQPP: Video Query Performance Prediction Benchmark

This paper introduces VQPP, the first benchmark for video query performance prediction comprising 56K text queries and 51K videos, which evaluates various predictors and demonstrates their utility in training large language models for query reformulation.

Adrian Catalin Lutu, Eduard Poesina, Radu Tudor Ionescu

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are a librarian in a massive, chaotic library filled with millions of video clips. A patron walks up and asks, "Show me a video of a happy horse."

In a perfect world, the librarian would instantly pull up the perfect video. But in the real world, sometimes the librarian finds a great match, and sometimes they pull up a video of a sad horse, or a cartoon horse, or nothing at all.

The Problem: How does the librarian know before they start searching if the patron's request is easy to fulfill or a nightmare? If they know a request is "hard," they might try a different strategy, or they might warn the patron, "Hey, this is tricky, let's try rephrasing it."

This paper introduces a new tool called VQPP (Video Query Performance Prediction) to solve exactly that problem.

The Big Idea: The "Crystal Ball" for Search

The authors built a benchmark (a standardized test) to train computers to act as "crystal balls." Instead of actually searching the library first, the computer looks at the question (the query) and predicts: "Will this search be easy or hard?"

If the computer predicts the search will be hard, the system can automatically fix the question before wasting time searching.

How They Built the Test

To train these "crystal balls," the researchers needed a huge library and a way to grade the searches. They used:

  1. Two Giant Libraries: They grabbed data from MSR-VTT (10,000 random, wild videos like news and sports) and VATEX (41,000 shorter, more specific clips).
  2. Two Search Engines: They used two different "librarians" (AI models named GRAM and VAST) to actually perform the searches.
  3. The Scorecard: They ran 56,000 different questions. For each question, they checked: "Did the librarian find the right video in the top 10 results?"

This created a massive dataset where every question has a known "difficulty score."

The Contest: Who is the Best Predictor?

The researchers tested different types of "predictors" (AI models) to see which one could guess the difficulty best. They split them into two teams:

Team 1: The "Pre-Retrieval" Scouts (The Guessers)
These scouts look only at the question text before searching.

  • The Old School Scout: Counts words, checks for confusing grammar, or looks up synonyms. (Like a human guessing, "That's a long sentence, it might be hard.")
  • The Deep Learner (BERT): A smart AI that reads the meaning of the sentence. It understands that "a fearful animation scene" is vague, while "a movie scene with Morgan Freeman running in armor" is specific.
  • The Chatbot (Llama): A giant language model asked to guess the difficulty based on examples.

Team 2: The "Post-Retrieval" Analysts (The Reviewers)
These analysts wait for the search to finish, look at the list of videos returned, and then say, "Wow, these results look messy, so the search was hard."

  • They use complex vision models to compare the videos in the results to see if they make sense together.

The Surprise Winner

You might think the "Reviewers" (Team 2) would win because they have the actual results to look at. But they didn't.

The winner was the Deep Learner (BERT) from Team 1.

  • Why? In video search, the "results list" is often very noisy. Even if the search engine fails, the list of wrong videos might look random and confusing, making it hard for the analyst to tell why it failed.
  • The BERT model, however, realized that the question itself was the problem. It learned that vague questions lead to bad results, even without seeing the results. It's like a chef who knows a recipe will fail just by reading the ingredients list, without even having to cook it.

The Superpower: Fixing the Questions

The coolest part of the paper isn't just predicting difficulty; it's fixing the questions.

The researchers took their winning "crystal ball" (the BERT predictor) and used it as a coach for a language model (an AI that writes text).

  1. The AI writes a new version of a user's question.
  2. The "Coach" (BERT) scores the new question: "This one is better! It's more likely to find the right video."
  3. The AI learns from this score and gets better at writing questions.

The Result: When they tested this, the AI successfully rewrote vague questions like "a fearful animation scene" into specific ones like "a movie scene starring Morgan Freeman and men in armor running." The search engine then found the right video much more often!

The Takeaway

This paper is a big step forward because:

  1. It's the first of its kind: No one had ever built a standardized test for predicting video search difficulty before.
  2. It's efficient: You don't need to run expensive video searches to know if a question is good; a smart text-reader can tell you.
  3. It's practical: It can automatically improve how we talk to video search engines, making them feel much smarter and more helpful.

In short, VQPP teaches computers to listen to your question and say, "I can help you find that, but let's tweak your wording first so we don't waste time!"

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →