LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

LE-NeuS is a latency-efficient neuro-symbolic framework for long-form video question answering that achieves a significant reduction in inference latency (from 90x to ~10x compared to base VLMs) while preserving accuracy gains through CLIP-guided adaptive frame sampling and batched proposition detection.

Shawn Liang, Sahil Shah, Chengwei Zhou, SP Sharan, Harsh Goel, Arnab Sanyal, Sandeep Chinchali, Gourav Datta

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are trying to find a specific, complex moment in a three-hour movie to answer a question like: "After the hero finds the tree, strips the bark, and builds a shelter, what does he do next?"

The Problem: The "Slow-Motion" Detective

Current AI models (called VLMs) are like detectives who are very smart but incredibly slow and literal.

  • The Old Way (Uniform Sampling): The detective looks at 32 random snapshots of the movie. They might miss the bark-stripping scene entirely because it happened between two snapshots.
  • The Previous "Smart" Way (NeuS-QA): This was a super-organized detective who wrote down a strict checklist (a "temporal logic" plan). They watched every single frame of the movie to check off items on the list: "Did he find the tree? Yes. Did he strip the bark? Yes."
    • The Catch: This was incredibly accurate, but it took 90 times longer than just asking the AI a simple question. It was like hiring a team of 90 people to watch the movie frame-by-frame just to find one scene. For real-world use (like on a phone or a robot), this was too slow to be practical.

The Solution: LE-NeuS (The "Smart Skipper")

The authors created LE-NeuS, a new system that keeps the "smart checklist" accuracy but makes it 10 times faster. They did this using three clever tricks:

1. The "CLIP" Filter (The Bouncer)

Before the expensive AI detective starts working, a lightweight, fast "bouncer" (called CLIP) scans the movie.

  • How it works: The bouncer knows what "finding a tree" looks like. If a frame is just a shot of the sky or a tree that isn't being touched, the bouncer says, "Skip this, it's boring."
  • The Analogy: Imagine you have a 3-hour movie. Instead of watching every second, you only watch the scenes where the main character is actually doing something. You skip all the long pauses and background scenery.

2. The "Batching" Trick (The Assembly Line)

The old system asked the AI detective one question at a time: "Is this a tree?" (Wait for answer). "Is this bark?" (Wait for answer). This is like a factory worker picking up one widget, painting it, putting it down, and then picking up the next one.

  • The Fix: LE-NeuS grabs a whole stack of questions and asks them all at once.
  • The Analogy: Instead of one worker painting one widget, you have a conveyor belt. The worker paints 10 widgets in the time it used to take to paint one. This uses the computer's power much more efficiently.

3. The "Multi-Segment" Strategy (The Highlight Reel)

Sometimes the answer isn't in one long continuous scene. Maybe the hero finds the tree at minute 5, strips bark at minute 20, and builds a shelter at minute 45.

  • The Old Way: The AI tried to watch the entire movie from minute 5 to 45 continuously, getting confused by the boring parts in between.
  • The New Way: LE-NeuS creates a "Highlight Reel." It stitches together just the relevant clips (the tree, the bark, the shelter) and ignores the 30 minutes of nothingness in between. It then asks the AI to solve the puzzle using only these high-quality clips.

The Result: Fast and Accurate

By combining these tricks, LE-NeuS achieves a "sweet spot":

  • Speed: It is no longer 90 times slower than a basic AI; it's only about 10 times slower. This makes it fast enough to potentially run on powerful edge devices (like advanced cameras or robots).
  • Accuracy: Because it still uses the "strict checklist" (formal logic) to verify the sequence of events, it is actually more accurate than the basic AI, especially for tricky questions that require understanding time and order.

In a Nutshell

Think of LE-NeuS as a smart video editor who doesn't just watch the whole movie blindly. Instead, they:

  1. Scan the movie quickly to find the interesting parts (Adaptive Sampling).
  2. Group their questions to ask the AI efficiently (Batching).
  3. Cut out the boring parts to focus only on the evidence (Multi-Segment Retrieval).

This allows the AI to solve complex, time-based puzzles in long videos without taking hours to do so.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →