ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

This paper introduces ForeSea, an AI forensic search system featuring a three-stage pipeline for multimodal video retrieval and reasoning, alongside ForeSeaQA, the first benchmark designed to evaluate complex image-and-text queries with precise temporal grounding in long-horizon surveillance footage.

Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli

Published 2026-03-25
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a crime, but instead of a few hours of footage, you have weeks of video from hundreds of security cameras. You need to find one specific person, or a specific moment, like "When did the guy in the red hat join the fight?"

Doing this manually is like trying to find a specific grain of sand on a beach by looking at every single grain one by one. It's impossible.

This paper introduces ForeSea, a new AI system designed to be the ultimate "digital detective" for surveillance video. Here is how it works, explained simply:

1. The Problem: The "Needle in a Haystack"

Current AI tools are like a librarian who only reads the titles of books. If you ask, "Show me the book about the guy in the red hat," the librarian might guess based on the title, but they can't actually see the guy in the picture you show them, nor can they tell you exactly which page (or second of video) the action happens.

  • Old AI: "Here is a list of videos that mention 'fighting'." (But it doesn't know who is fighting or when).
  • The Gap: Real detectives need to say, "Here is a photo of the suspect. Show me every time this specific person appears and what they were doing."

2. The Solution: ForeSea (The "Smart Filter")

The authors built a system called ForeSea that acts like a three-step assembly line to solve this problem.

Step 1: The "Human Tracker" (The Bouncer)

Imagine a bouncer at a club who only lets people in who match a specific description.

  • Instead of watching the whole video, ForeSea first uses a tracker to find only the people in the video.
  • It cuts out all the empty streets, the trees, and the cars. It only keeps the little video clips where a person is visible.
  • Analogy: If the video is a 10-hour movie, this step cuts it down to just the 30 minutes where the suspect actually appears.

Step 2: The "Multilingual Indexer" (The Smart Librarian)

Now, the system has thousands of tiny clips of people. It needs to organize them so you can find them later.

  • Most systems only understand text. If you type "red hat," it finds clips with that text.
  • ForeSea uses a Multimodal Encoder. Think of this as a librarian who understands both words and pictures.
  • You can ask: "Show me the guy in the red hat" (Text) OR you can upload a photo of the guy and ask "When did this person enter?" (Image + Text). The system understands both at the same time and creates a "search index" for them.

Step 3: The "Detective Brain" (The Reasoner)

Once the system finds the top 3 most likely clips, it sends them to a powerful AI brain (a Video Large Language Model).

  • This brain looks at the clips, reads the question, and says: "Yes, I see him. He entered the building at 10:35 AM and was wearing a red hat."
  • Crucially, it gives you the exact timestamp (the "temporal grounding") so you don't have to scrub through the video yourself.

3. The New Playground: ForeSeaQA

To prove their system works, the authors realized there was no "test" for this kind of specific, image-based searching. So, they built a new benchmark called ForeSeaQA.

  • The Test: They created 1,000+ questions based on real crime videos.
  • The Twist: Some questions are just text ("Did a fight happen?"), but others are Multimodal (Here is a photo of a person + "When did this person start running?").
  • The Result: ForeSea crushed the competition. It was not only more accurate at finding the answer but also much better at pinpointing the exact time it happened.

4. Why This Matters (The "Aha!" Moment)

  • Speed: Because ForeSea filters out the boring parts of the video first, it answers questions twice as fast as other systems. It doesn't waste time watching empty hallways.
  • Accuracy: It doesn't just guess; it points to the evidence. If you ask "Did he steal the bag?", it shows you the 5-second clip where the theft happened.
  • Flexibility: It works even if you don't know the person's name, just what they look like (the photo).

Summary Analogy

Imagine you are looking for a lost toy in a giant, messy attic.

  • Old AI: Reads a list of items in the attic and guesses, "Maybe the toy is near the boxes." You have to climb through the whole attic to check.
  • ForeSea:
    1. Filters: Only looks at the piles where a child was playing.
    2. Search: You show it a picture of the toy. It instantly finds the 3 piles that look like they contain that toy.
    3. Answer: It hands you the toy and says, "It was in the red box at 2:00 PM."

This paper is a big step forward because it moves surveillance AI from "guessing based on text" to "seeing and understanding based on images and time," making it a true partner for real-world investigations.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →