ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

This paper introduces ShotFinder, a new benchmark and a three-stage retrieval pipeline that addresses the gap in open-domain video shot retrieval by formalizing editing requirements into keyframe-oriented descriptions with five controllable constraints, revealing that current multimodal models still struggle significantly with complex attributes like color and visual style compared to human performance.

Tao Yu, Haopeng Jin, Hao Wang, Shenghua Chai, Yujia Yang, Junhao Gong, Jiaming Guo, Minghui Zhang, Xinlong Chen, Zhenghao Zhang, Yuxuan Zhou, Yufei Xiong, Shanbin Zhang, Jiabing Yang, Hongzhu Yi, Xinming Wang, Cheng Zhong, Xiao Ma, Zhang Zhang, Yan Huang, Liang Wang

Published 2026-02-17
📖 5 min read🧠 Deep dive

🎬 The Big Idea: Finding a Needle in a Haystack of Movies

Imagine you are a video editor working on a movie. You need a very specific clip: "A long-haired woman sitting at a table, leaning forward, looking sad, with warm sunset lighting and jazz music playing."

In the old days, you'd have to scroll through hours of footage, hoping to find that exact moment. Today, we have super-smart AI (Large Language Models) that can read and understand the world. But can they find that specific 5-second clip inside a 2-hour movie just by reading your description?

The Answer: Not really. Not yet.

This paper introduces ShotFinder, a new "test" (benchmark) and a new "method" to see how good AI really is at this specific task.


🧪 Part 1: The Test (The "ShotFinder" Benchmark)

The researchers realized that while AI is great at finding text or static images, it's terrible at finding specific moments in videos. So, they built a giant challenge course for AI.

The Course:
They created 1,210 specific challenges. Each challenge is a detailed description of a video clip (a "shot") taken from YouTube.

  • The Description: "A man in a blue suit running up stairs."
  • The Constraints: They added extra rules to make it harder, like:
    • Time: "It must happen after he falls down."
    • Color: "The whole scene must look very warm and orange."
    • Style: "It must look like a 1990s cartoon."
    • Sound: "You must hear a dog barking."
    • Quality: "It must be high-definition (1080p)."

The Goal: The AI has to go to the internet (like YouTube), find the right video, and then pinpoint the exact second that matches the description.

The Result: The AI struggled. Even the smartest models (like GPT-5 or Gemini) only got about 25% to 27% of the answers right. Humans, on the other hand, got about 88% right. It turns out, finding a specific mood or color in a video is still very hard for computers.


🛠️ Part 2: The Solution (The "Imagination" Method)

Since the AI was failing, the authors built a new way to help it. They call it ShotFinder (the method).

Think of this method as a three-step detective process:

Step 1: The "Imagination" Phase (The Dreamer)

If you ask an AI to search for "a sad woman at a table," it might just search for those words. But video titles are weird. A video might be titled "How to make sad coffee" or "Cinematic rain scenes."

  • The Trick: The AI is told to imagine the whole movie first. It asks itself: "If this sad woman were in a movie, what would the title of that movie be? What genre is it?"
  • The Analogy: Instead of looking for a specific brick, the AI imagines the whole house the brick belongs to, then looks for the house. This helps it find the right video library.

Step 2: The "Retriever" Phase (The Scavenger)

Once the AI has its "imagined" search terms, it goes to the internet (YouTube) and downloads a bunch of candidate videos that might contain the clip.

Step 3: The "Localizer" Phase (The Sniper)

Now the AI has a pile of videos. It needs to find the exact 5-second clip.

  • It breaks the video down into many small snapshots (frames).
  • It compares the snapshots to your description.
  • The Sniper: It picks the one frame that matches your description perfectly and says, "This is it!"

📉 What Did They Learn? (The Takeaways)

After running the test, the researchers found some interesting things:

  1. AI is getting better, but it's still a rookie: The best AI models are far behind human editors. They miss subtle details like "sadness" or "warm lighting."
  2. Some things are easier than others:
    • Easy: Finding things based on Time (e.g., "after the explosion"). AI is okay at this.
    • Hard: Finding things based on Color or Art Style. AI is terrible at telling the difference between "warm sunset" and "cool morning."
  3. Bigger isn't always better: Sometimes a smaller, smarter AI model did just as well as a massive, expensive one. It's not just about how big the brain is; it's about how it's wired.
  4. The "Imagination" helps: The method that used "Video Imagination" (Step 1) worked much better than just searching for keywords. It proved that AI needs to understand the context of a video, not just the words.

🚀 Why Does This Matter?

This isn't just a game. This technology is the future of video editing.

  • Imagine a future where you tell your computer: "Show me the clip where the hero cries in the rain," and it instantly finds it in a library of 10,000 movies.
  • Right now, that's science fiction. ShotFinder is the first step to making it a reality, showing us exactly where the AI is failing so we can fix it.

In short: We taught AI to dream up the right search terms, but it still needs a lot of practice to find the perfect video clip.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →