LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Imagine you have a massive library containing 250 gigabytes of video footage—thousands of hours of news, documentaries, and travel shows. Now, imagine you need to find a specific 5-second clip where a man is being interviewed in front of a specific cathedral in Hanoi, but you only have a vague description like "that old church with the twin towers."

Trying to find this needle in a haystack using a standard search engine is like trying to find a specific book by guessing the color of its cover. It's frustrating and often fails.

LLandMark is a new, super-smart team of digital detectives designed to solve this problem. Instead of one giant brain trying to do everything, it uses a multi-agent framework—think of it as a specialized task force where every member has a specific job.

Here is how LLandMark works, broken down into simple analogies:

1. The Team of Detectives (The Multi-Agent Framework)

When you ask a question, LLandMark doesn't just "search." It breaks the job down among four specialized agents:

The Planner (Query Parsing Agent): This is the team leader. When you type, "Show me the video near the Turtle Tower," the Planner doesn't just take the words literally. It analyzes your intent, breaks the sentence into parts, and creates a "search map." It decides, "Okay, we need to look for visual clues of a tower, listen for the words 'Turtle Tower' in the audio, and check for text on the screen."
The Local Expert (Landmark Knowledge Agent): This is the cultural specialist. Standard search engines often fail with local landmarks because they don't "know" what a specific Vietnamese landmark looks like. If you say "St. Joseph's Cathedral," this agent knows that's not just a building; it's a "dark gray stone building with twin square bell towers and Gothic architecture." It rewrites your query into a detailed visual description so the computer can actually see what you mean.
The Scanners (Parallel Search): While the Planner and Expert are working, a team of scanners runs in parallel. One scans the video for objects (like "person" or "car"), one listens to the audio (transcribing speech), and one reads the text on the screen (OCR).
The Judge (Reranking Agent): This agent takes all the clues gathered by the scanners and the experts. It weighs the evidence: "The audio said 'cathedral,' the visual scan saw 'twin towers,' and the text said 'Hanoi.' This is a match!" It then synthesizes a clear answer for you.

2. The "Magic Translator" for Text (OCR Refinement)

One of the biggest headaches in video search is reading text that appears on screen. In Vietnam, text often has diacritics (accents like á, ờ, ỹ). Standard tools often drop these accents, turning "Hà Nội" into "Ha Noi," which changes the meaning or makes it unsearchable.

LLandMark uses a Gemini AI as a "spell-checker and translator." It takes the messy, accent-less text from the video and magically fixes it, adding the correct accents and fixing typos. It's like having a native speaker sit next to the computer, correcting its reading mistakes in real-time so it understands the Vietnamese context perfectly.

3. The "Show Me, Don't Just Tell Me" Feature (Image-to-Image)

Sometimes, words aren't enough. If you ask for "Ben Thanh Market," a computer might just look for the words "Ben Thanh" or a generic picture of a market.

LLandMark has a special mode where it acts like a visual detective.

You say, "Show me Ben Thanh Market."
The system automatically goes to the internet, finds a real, high-quality photo of Ben Thanh Market, and says, "Okay, now I know exactly what that looks like."
It then uses that photo as a "search key" to scan through millions of video frames, looking for visual matches. It's like handing a detective a photo of the suspect and saying, "Find anyone who looks like this," rather than just describing them.

4. The Result: A Clear, Explainable Answer

In the end, LLandMark doesn't just give you a list of video links. It gives you a story. It shows you the specific frame, highlights the text it read, shows you the audio it heard, and explains why it thinks this is the right video.

Why does this matter?
Previous systems were like a student trying to memorize a dictionary but failing to understand the culture. LLandMark is like a local guide who knows the city, speaks the language, understands the history, and can point you exactly where to look. It makes searching through massive video libraries as easy as asking a knowledgeable friend for directions.

In the recent HCMAIC 2025 competition, this team of digital detectives proved they were the best in the country, ranking in the top 56 out of 680 teams by successfully finding complex, culturally specific video clips that others missed.

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

1. The Team of Detectives (The Multi-Agent Framework)

2. The "Magic Translator" for Text (OCR Refinement)

3. The "Show Me, Don't Just Tell Me" Feature (Image-to-Image)

4. The Result: A Clear, Explainable Answer

1. Problem Statement

2. Methodology

A. Preprocessing & Data Foundation

B. The LLandMark Multi-Agent Framework

C. Key Technical Innovations

3. Key Contributions

4. Experimental Results

5. Significance

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

1. The Team of Detectives (The Multi-Agent Framework)

2. The "Magic Translator" for Text (OCR Refinement)

3. The "Show Me, Don't Just Tell Me" Feature (Image-to-Image)

4. The Result: A Clear, Explainable Answer

1. Problem Statement

2. Methodology

A. Preprocessing & Data Foundation

B. The LLandMark Multi-Agent Framework

C. Key Technical Innovations

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization