ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

Imagine you are a detective trying to solve a crime, but instead of a few hours of footage, you have weeks of video from hundreds of security cameras. You need to find one specific person, or a specific moment, like "When did the guy in the red hat join the fight?"

Doing this manually is like trying to find a specific grain of sand on a beach by looking at every single grain one by one. It's impossible.

This paper introduces ForeSea, a new AI system designed to be the ultimate "digital detective" for surveillance video. Here is how it works, explained simply:

1. The Problem: The "Needle in a Haystack"

Current AI tools are like a librarian who only reads the titles of books. If you ask, "Show me the book about the guy in the red hat," the librarian might guess based on the title, but they can't actually see the guy in the picture you show them, nor can they tell you exactly which page (or second of video) the action happens.

Old AI: "Here is a list of videos that mention 'fighting'." (But it doesn't know who is fighting or when).
The Gap: Real detectives need to say, "Here is a photo of the suspect. Show me every time this specific person appears and what they were doing."

2. The Solution: ForeSea (The "Smart Filter")

The authors built a system called ForeSea that acts like a three-step assembly line to solve this problem.

Step 1: The "Human Tracker" (The Bouncer)

Imagine a bouncer at a club who only lets people in who match a specific description.

Instead of watching the whole video, ForeSea first uses a tracker to find only the people in the video.
It cuts out all the empty streets, the trees, and the cars. It only keeps the little video clips where a person is visible.
Analogy: If the video is a 10-hour movie, this step cuts it down to just the 30 minutes where the suspect actually appears.

Step 2: The "Multilingual Indexer" (The Smart Librarian)

Now, the system has thousands of tiny clips of people. It needs to organize them so you can find them later.

Most systems only understand text. If you type "red hat," it finds clips with that text.
ForeSea uses a Multimodal Encoder. Think of this as a librarian who understands both words and pictures.
You can ask: "Show me the guy in the red hat" (Text) OR you can upload a photo of the guy and ask "When did this person enter?" (Image + Text). The system understands both at the same time and creates a "search index" for them.

Step 3: The "Detective Brain" (The Reasoner)

Once the system finds the top 3 most likely clips, it sends them to a powerful AI brain (a Video Large Language Model).

This brain looks at the clips, reads the question, and says: "Yes, I see him. He entered the building at 10:35 AM and was wearing a red hat."
Crucially, it gives you the exact timestamp (the "temporal grounding") so you don't have to scrub through the video yourself.

3. The New Playground: ForeSeaQA

To prove their system works, the authors realized there was no "test" for this kind of specific, image-based searching. So, they built a new benchmark called ForeSeaQA.

The Test: They created 1,000+ questions based on real crime videos.
The Twist: Some questions are just text ("Did a fight happen?"), but others are Multimodal (Here is a photo of a person + "When did this person start running?").
The Result: ForeSea crushed the competition. It was not only more accurate at finding the answer but also much better at pinpointing the exact time it happened.

4. Why This Matters (The "Aha!" Moment)

Speed: Because ForeSea filters out the boring parts of the video first, it answers questions twice as fast as other systems. It doesn't waste time watching empty hallways.
Accuracy: It doesn't just guess; it points to the evidence. If you ask "Did he steal the bag?", it shows you the 5-second clip where the theft happened.
Flexibility: It works even if you don't know the person's name, just what they look like (the photo).

Summary Analogy

Imagine you are looking for a lost toy in a giant, messy attic.

Old AI: Reads a list of items in the attic and guesses, "Maybe the toy is near the boxes." You have to climb through the whole attic to check.
ForeSea:
1. Filters: Only looks at the piles where a child was playing.
2. Search: You show it a picture of the toy. It instantly finds the 3 piles that look like they contain that toy.
3. Answer: It hands you the toy and says, "It was in the red box at 2:00 PM."

This paper is a big step forward because it moves surveillance AI from "guessing based on text" to "seeing and understanding based on images and time," making it a true partner for real-world investigations.

1. Problem Statement

Current video surveillance systems face significant challenges in analyzing long-form, multi-camera footage to find specific targets or events. Existing approaches suffer from three main limitations:

Inadequate Query Modalities: Traditional systems rely on object detection/tracking pipelines or CLIP-based text retrieval. They struggle with multimodal queries (e.g., "When did this person [image] join the fight?"), which are standard in real-world forensic investigations but underexplored in research.
Shallow Reasoning & Temporal Grounding: Existing Video Large Language Models (VideoLLMs) and Retrieval-Augmented Generation (VideoRAG) systems often fail to perform precise temporal reasoning. They may answer a question correctly but fail to localize the specific time interval where the event occurred, or they hallucinate evidence.
Lack of Benchmarks: There is no existing benchmark that evaluates video QA under multimodal query conditions (image + text) with precise temporal grounding annotations in the surveillance domain.

2. Methodology: The ForeSea Framework

The authors propose ForeSea, a plug-and-play, three-stage AI forensic search system designed to handle complex multimodal queries efficiently.

A. System Architecture

ForeSea operates as a VideoRAG (Retrieval-Augmented Generation) pipeline:

Tracking & Segmentation (Filtering):
- A person-tracking module (using ByteTrack and YOLO) processes raw surveillance footage.
- It segments long videos into person-centric clips, drastically reducing the search space by filtering out irrelevant background frames.
Multimodal Embedding (Indexing):
- A unified Multimodal Encoder (based on GCL/VISTA) embeds the person-centric clips into a shared vector space.
- Crucially, this encoder supports unified retrieval for both text-only queries and image-text queries. It avoids the information loss associated with projecting all modalities into text-only space.
- The database stores embeddings alongside metadata (camera ID, timestamp, bounding box).
Query Answering & Reasoning:
- Given a query (text or image+text), the system retrieves the top- $K$ candidate clips.
- These clips are fed into a Video Large Multimodal Model (VideoLLM) (specifically VideoLLaMA3).
- The VideoLLM is guided by a system prompt to output: (1) a summary of events, (2) a multiple-choice answer, and (3) precise temporal grounding (start/end timestamps) linked to visual evidence.
- Spatial Augmentation: To improve reasoning, the system optionally injects bounding box coordinates as text or overlays them on frames, directing the model's attention to the specific person.

B. The ForeSeaQA Benchmark

To evaluate this setting, the authors introduced ForeSeaQA, the first benchmark for multimodal, temporally grounded video QA in surveillance.

Data Source: Built from the UCF-Crime dataset using a semi-automated data engine.
Construction:
1. Entity Extraction: LLMs extract human entities from dense video captions.
2. Visual Grounding: Multimodal LLMs generate bounding boxes and crop reference images for these entities.
3. QA Generation: LLMs generate questions across 6 subtasks: Search, Activity, Event, Temporal, Counting, and Anomaly.
4. Multimodal Integration: Person-specific questions are paired with a reference image of the individual.
5. Verification: Human workers verify validity, ambiguity, and temporal accuracy.
Metrics: Evaluates both Multiple-Choice Accuracy and Temporal Intersection-over-Union (IoU).

3. Key Contributions

ForeSeaQA Benchmark: The first dataset to jointly evaluate multiple-choice accuracy and temporal localization under both text-only and multimodal (image+text) query conditions in surveillance.
ForeSea System: A simple yet effective VideoRAG framework that combines person-centric tracking, unified multimodal embedding, and VideoLLM reasoning. It is the first system designed to excel in complex multimodal forensic search.
Inductive Bias Discovery: The paper demonstrates that person-centric retrieval is a powerful inductive bias for surveillance understanding, outperforming dense video processing even with smaller parameter counts.

4. Experimental Results

The authors evaluated ForeSea against state-of-the-art VideoLLMs (e.g., VideoLLaMA3, Qwen2.5-VL) and VideoRAG baselines (VideoRAG, T*) on ForeSeaQA and open-domain benchmarks.

Performance on ForeSeaQA:
- Accuracy: ForeSea achieved the highest overall accuracy (66.0%) and multimodal accuracy (65.4%), outperforming the 72B-parameter Qwen2.5-VL and other 7B models.
- Temporal Grounding: ForeSea achieved a 13.6% Temporal IoU, significantly outperforming baselines (VideoRAG: 3.5%, VideoLLaMA3: 13.2%). This confirms that person-centric retrieval provides precise temporal evidence.
- Robustness: ForeSea maintained high accuracy (>65%) on both text-only and multimodal queries, whereas other models saw significant accuracy drops (up to 6 points) when switching to multimodal inputs.
Efficiency:
- ForeSea achieved lower end-to-end latency (2.6s) compared to retrieval-augmented baselines (5.2s–7.6s) and VideoLLaMA3 (3.8s), despite performing retrieval. This is because it processes far fewer frames (only relevant person-centric clips).
Generalization:
- On open-domain benchmarks (VideoMME and MLVU), ForeSea matched or exceeded SOTA methods while using only half the input frames, proving its scalability beyond surveillance.

5. Significance

This work represents a significant step forward in AI forensic analysis by bridging the gap between theoretical video understanding and practical investigative needs.

Practical Impact: It enables analysts to search for specific individuals using natural language and reference images, receiving not just an answer but timestamped, evidence-backed video clips.
Technical Advancement: It challenges the notion that larger models or dense video processing are always superior, showing that structured retrieval (person-centric) combined with unified multimodal embeddings yields better accuracy, temporal precision, and efficiency.
Future Direction: The introduction of ForeSeaQA sets a new standard for evaluating long-horizon video understanding, specifically highlighting the critical need for systems that can handle multimodal queries and precise temporal grounding simultaneously.