LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

Imagine you are trying to find a specific, complex moment in a three-hour movie to answer a question like: "After the hero finds the tree, strips the bark, and builds a shelter, what does he do next?"

The Problem: The "Slow-Motion" Detective

Current AI models (called VLMs) are like detectives who are very smart but incredibly slow and literal.

The Old Way (Uniform Sampling): The detective looks at 32 random snapshots of the movie. They might miss the bark-stripping scene entirely because it happened between two snapshots.
The Previous "Smart" Way (NeuS-QA): This was a super-organized detective who wrote down a strict checklist (a "temporal logic" plan). They watched every single frame of the movie to check off items on the list: "Did he find the tree? Yes. Did he strip the bark? Yes."
- The Catch: This was incredibly accurate, but it took 90 times longer than just asking the AI a simple question. It was like hiring a team of 90 people to watch the movie frame-by-frame just to find one scene. For real-world use (like on a phone or a robot), this was too slow to be practical.

The Solution: LE-NeuS (The "Smart Skipper")

The authors created LE-NeuS, a new system that keeps the "smart checklist" accuracy but makes it 10 times faster. They did this using three clever tricks:

1. The "CLIP" Filter (The Bouncer)

Before the expensive AI detective starts working, a lightweight, fast "bouncer" (called CLIP) scans the movie.

How it works: The bouncer knows what "finding a tree" looks like. If a frame is just a shot of the sky or a tree that isn't being touched, the bouncer says, "Skip this, it's boring."
The Analogy: Imagine you have a 3-hour movie. Instead of watching every second, you only watch the scenes where the main character is actually doing something. You skip all the long pauses and background scenery.

2. The "Batching" Trick (The Assembly Line)

The old system asked the AI detective one question at a time: "Is this a tree?" (Wait for answer). "Is this bark?" (Wait for answer). This is like a factory worker picking up one widget, painting it, putting it down, and then picking up the next one.

The Fix: LE-NeuS grabs a whole stack of questions and asks them all at once.
The Analogy: Instead of one worker painting one widget, you have a conveyor belt. The worker paints 10 widgets in the time it used to take to paint one. This uses the computer's power much more efficiently.

3. The "Multi-Segment" Strategy (The Highlight Reel)

Sometimes the answer isn't in one long continuous scene. Maybe the hero finds the tree at minute 5, strips bark at minute 20, and builds a shelter at minute 45.

The Old Way: The AI tried to watch the entire movie from minute 5 to 45 continuously, getting confused by the boring parts in between.
The New Way: LE-NeuS creates a "Highlight Reel." It stitches together just the relevant clips (the tree, the bark, the shelter) and ignores the 30 minutes of nothingness in between. It then asks the AI to solve the puzzle using only these high-quality clips.

The Result: Fast and Accurate

By combining these tricks, LE-NeuS achieves a "sweet spot":

Speed: It is no longer 90 times slower than a basic AI; it's only about 10 times slower. This makes it fast enough to potentially run on powerful edge devices (like advanced cameras or robots).
Accuracy: Because it still uses the "strict checklist" (formal logic) to verify the sequence of events, it is actually more accurate than the basic AI, especially for tricky questions that require understanding time and order.

In a Nutshell

Think of LE-NeuS as a smart video editor who doesn't just watch the whole movie blindly. Instead, they:

Scan the movie quickly to find the interesting parts (Adaptive Sampling).
Group their questions to ask the AI efficiently (Batching).
Cut out the boring parts to focus only on the evidence (Multi-Segment Retrieval).

This allows the AI to solve complex, time-based puzzles in long videos without taking hours to do so.

1. Problem Statement

Long-form Video Question Answering (LVQA) requires systems to perform semantic grounding, temporal reasoning, and compositional inference over extended video durations. While Neuro-symbolic approaches (specifically NeuS-QA) have demonstrated superior accuracy by translating natural language queries into Temporal Logic (TL) specifications and performing formal model checking, they suffer from prohibitive computational costs.

The Bottleneck: Existing neuro-symbolic pipelines rely on sequential and dense proposition detection across every frame window to construct a video automaton. This results in latency overheads up to 90× slower than standard Vision-Language Model (VLM) prompting.
The Consequence: This latency makes neuro-symbolic methods impractical for latency-sensitive edge deployments or real-time applications, despite their high accuracy and interpretability.

2. Methodology: LE-NeuS Framework

The authors propose LE-NeuS, a framework designed to drastically reduce inference latency while preserving the formal reasoning guarantees of temporal logic. The method restructures the pipeline through three core optimizations:

A. CLIP-Guided Two-Stage Adaptive Sampling

Instead of processing every frame uniformly, LE-NeuS exploits the visual redundancy inherent in long-form videos.

Stage 1: Semantic Relevance Filtering: A lightweight CLIP encoder projects video frames and query propositions into a shared latent space. Frames with low semantic similarity to the target propositions are pruned. Only frames exceeding a similarity threshold $\tau_s$ are retained, along with a temporal window to preserve context.
Stage 2: Visual Redundancy Elimination: Within the candidate set, a second pass removes near-duplicate frames. Using a redundancy threshold $\tau_r$ , the system selects only keyframes that represent significant visual changes.
Label Propagation: For frames discarded during redundancy elimination, proposition labels are propagated from the preceding keyframe, avoiding unnecessary VLM inference.

B. Batched Proposition Detection

The baseline NeuS-QA treats each (frame window, proposition) pair as a separate inference call, underutilizing GPU resources.

Optimization: LE-NeuS employs batched inference. It stacks multiple proposition-window pairs into a single batch.
Mechanism: Since the visual input (the frame window) is constant across the batch, the visual encoder features are computed once and broadcasted. This reduces the number of forward passes from $|P|$ (number of propositions) to $\lceil |P|/B \rceil$ , where $B$ is the batch size, significantly amortizing fixed inference costs.

C. Multi-Segment Frames-of-Interest (FoI) Retrieval

Unlike prior methods that extract only the single largest continuous segment satisfying the logic, LE-NeuS retrieves multiple disjoint segments where the temporal logic holds.

Benefit: This concentrates the final VLM reasoning on high-density evidence segments rather than diluting attention across long, irrelevant continuous spans. It improves the probability of sampling true evidence frames within the VLM's fixed context window.

D. Theoretical Latency Bounds

The authors derive a formal latency bound:
$L_{LE-NeuS} \leq L_{LQ2TL} + T \cdot L_{CLIP} + \lceil \alpha \rho T \rceil \cdot L_{VLM} + L_{MC} + L_{VQA}$
Where $\alpha$ is the semantic filtering ratio and $\rho$ is the keyframe retention rate. The analysis proves that latency efficiency is achievable when the joint density ( $\alpha\rho$ ) of processed windows is sufficiently low.

3. Key Contributions

Latency-Efficient Neuro-Symbolic Framework: First systematic approach to reducing neuro-symbolic video understanding latency from ~90× to ~10× relative to base VLMs.
Principled Optimizations: Introduction of CLIP-guided adaptive sampling and batched proposition detection to address the specific bottleneck of automaton construction.
Theoretical Analysis: Derivation of latency bounds and conditions under which neuro-symbolic reasoning can operate efficiently at scale.
Multi-Segment Retrieval: A novel strategy to handle compositional queries by retrieving disjoint evidence segments, improving both accuracy and sampling efficiency.

4. Experimental Results

Experiments were conducted on LongVideoBench, Video-MME, and MLVU using NVIDIA H100 GPUs.

Accuracy:
- LE-NeuS achieves a new state-of-the-art on LongVideoBench, reaching 67.10% overall accuracy (using InternVL2.5-8B), surpassing the NeuS-QA baseline (61.89%) by 5.21%.
- It outperforms other structured reasoning frameworks (e.g., VideoTree) by over 16%.
- On Video-MME (Temporal Reasoning), it achieves a 12.07% improvement over NeuS-QA.
Efficiency (Latency):
- Speedup: Achieves a global 12.53× speedup over the NeuS-QA baseline.
- Absolute Latency: Reduces average inference time from ~554s (NeuS-QA) to ~44s (LE-NeuS) for long videos.
- Resource Usage: For 60-minute videos, it processes only 281 frames compared to 1,427 frames in the baseline, while maintaining sub-linear latency scaling.
Ablation Studies:
- Batching alone provides a ~3.2× speedup.
- Adaptive Sampling further reduces latency by ~3.8×.
- Multi-Segment Retrieval recovers accuracy lost during aggressive pruning, adding ~6.9% accuracy gain.

5. Significance and Impact

Bridging the Gap: LE-NeuS successfully bridges the gap between the high accuracy/interpretability of neuro-symbolic AI and the low-latency requirements of real-world deployment.
Scalability: By decoupling inference cost from video length through adaptive sampling, it enables neuro-symbolic reasoning for videos up to one hour long, a task previously considered computationally infeasible for formal verification methods.
Broader Applicability: The principles of selective grounding and parallel verification extend beyond LVQA to other latency-critical domains such as autonomous driving, embodied AI agents, and safety-critical edge monitoring, where structured temporal reasoning must operate under strict time budgets.

In conclusion, LE-NeuS demonstrates that neuro-symbolic video understanding does not need to be prohibitively slow; with principled architectural changes, it can achieve near-real-time performance without sacrificing the rigorous logical guarantees that make it superior to heuristic retrieval methods.