Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Imagine you are watching a live cooking show on TV, and you ask the host, "Tell me exactly when the water starts boiling."

The Old Way (Current Models):
Right now, most AI video assistants work like a nervous security guard who checks the pot every single second.

Tick-tock: Is it boiling? No.
Tick-tock: Is it boiling? No.
Tick-tock: Is it boiling? No.
Tick-tock: Is it boiling? YES!

The problem is that the guard has to stop and think deeply about the water, the steam, and the heat every single second just to decide whether to speak. This is exhausting for the computer (slow) and often leads to mistakes because the guard is too tired to think clearly (inaccurate).

The New Way (Em-Garde):
The paper introduces Em-Garde, which changes the game entirely. Instead of checking the pot every second, Em-Garde works like a smart chef with a recipe card.

Here is how it works in three simple steps:

1. The "Recipe Card" (Instruction-Guided Proposal Parser)

When you first ask your question ("When does the water boil?"), Em-Garde doesn't just say "Okay." It takes a deep breath and writes a shopping list of visual clues before the video even starts playing.

It translates your complex question into simple, concrete things to look for: "Look for vigorous bubbles," "Look for a steady stream of steam," or "Look for the lid rattling."
The Analogy: Instead of asking the computer to "understand boiling" every second, it gives the computer a specific list of things to spot, like a detective with a "Wanted" poster.

2. The "Eagle-Eyed Scout" (Lightweight Proposal Matching)

Now, the video starts streaming. Em-Garde uses a tiny, super-fast "scout" (a lightweight model) to watch the video.

The scout doesn't need to understand the whole story or the philosophy of cooking. It just glances at the screen and asks: "Do I see bubbles? Do I see steam?"
It compares what it sees against the "Wanted Poster" (the list of clues) using a simple math trick (embedding matching).
The Analogy: This is like a bouncer at a club checking IDs. The bouncer doesn't need to know the guest's life story; they just check if the photo on the ID matches the face. If it matches, they let them in (trigger a response). If not, they keep scanning. This is incredibly fast.

3. The "Smart Alarm" (Triggering)

As soon as the scout sees a match (e.g., "Whoa, that's a huge bubble!"), it sounds the alarm.

Only then does the "Big Brain" (the main AI) wake up, look at the scene, and say, "Ah, the water is boiling! Here is your answer."
The Analogy: The heavy lifting (thinking and talking) only happens when absolutely necessary. The rest of the time, the system is just a fast, efficient scanner.

Why is this a big deal?

Speed: Because the "scout" is so simple, it can watch the video at full speed (like 15 frames per second) without getting tired.
Accuracy: Because the "Big Brain" only speaks when the clues are perfect, it doesn't make up answers or get confused by irrelevant scenes.
Efficiency: It solves the "Efficiency-Accuracy Dilemma." Old models tried to be fast or smart. Em-Garde is fast and smart by splitting the job: one part is fast at scanning, the other is smart at answering.

In a nutshell:
Em-Garde stops trying to be a genius at every single moment. Instead, it prepares a clear set of instructions, sends a fast runner to look for those specific things, and only calls in the genius when the runner finds a match. This makes proactive video understanding (where the AI speaks up before you ask) actually practical and fast.

1. Problem Definition

The paper addresses the challenge of Proactive Streaming Video Understanding. In this paradigm, a model receives a natural language query before the relevant event occurs and must continuously monitor a video stream to autonomously decide when to respond (e.g., "Notify me when the water boils").

Core Challenges:

Efficiency-Accuracy Dilemma: Existing approaches treat triggering as a per-frame decision problem. To meet real-time constraints (5–15 frames per second), models must either be heavily compressed (losing semantic depth) or run full-scale reasoning on every frame (too slow).
Granularity Mismatch: Complex semantic reasoning (understanding the query and context) is computationally expensive, yet it must be performed at high frequency to detect events instantly.
Generalizability: Current models struggle to adapt to diverse, open-ended user queries without specific fine-tuning for every scenario.

2. Methodology: Em-Garde Framework

The authors propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. Instead of performing heavy reasoning on every frame, the system shifts the complex interpretation to the query time, leaving the streaming loop to perform lightweight visual matching.

The framework consists of two primary stages:

A. Instruction-Guided Proposal Parser (IGPP)

Function: Invoked once at query time ( $t_0$ ).
Input: The user's instruction ( $I$ ) and a short history of video frames.
Process: A large Multimodal LLM (MLLM) parses the high-level instruction into a set of structured, perceptually grounded visual proposals ( $P$ $P$ ).
- Example: For "Notify when water boils," the IGPP generates proposals like "vigorous bubbling," "sustained steam emission," or "kettle whistle."
Training:
- Dataset: A custom dataset, Parse2Prop-1K, containing 668 queries and human/GPT-5 generated proposals.
- Two-Stage Training:
  1. Supervised Fine-Tuning (SFT): Teaches the model the format and basic proposal generation.
  2. Reinforcement Learning (RL): Optimizes the proposals to be temporally localized (only appearing when the event happens) and perceptually grounded (detectable by simple vision models). The reward function balances event recall against false positives.

B. Lightweight Proposal Matching Module (LPMM)

Function: Runs continuously in the streaming loop.
Process:
1. Receives a short sliding window of recent video frames.
2. Encodes the video segment and the pre-generated proposals ( $P$ ) into a shared multimodal embedding space using a lightweight embedding model (e.g., Ops-MM-V1).
3. Computes cosine similarity between the video embedding and proposal embeddings.
Triggering Logic: A response is triggered when the similarity score for any proposal exhibits a sharp surge exceeding a predefined threshold ( $\theta$ ).
Efficiency: Since LPMM only performs embedding matching (no complex reasoning), it can run at high frame rates (10–15 fps) on standard hardware.

3. Key Contributions

Paradigm Shift: Moves away from "per-frame decision making" to a "Propose-Match" architecture. This separates expensive semantic reasoning (done once) from efficient visual perception (done continuously).
Instruction-Guided Proposal Parser (IGPP): Introduces a mechanism to translate abstract user queries into concrete, visual cues that a lightweight model can detect, solving the generalizability issue.
Reinforcement Learning for Proposals: Demonstrates that RL significantly improves the quality of proposals, making them more temporally precise and less prone to distracting semantic noise compared to SFT-only models.
Visual Encoding Cache: Implements a caching mechanism for overlapping video frames in the sliding window, reducing redundant encoding and boosting inference speed by 2–3 $\times$ .

4. Experimental Results

The framework was evaluated on StreamingBench, OVO-Bench, and ProactiveVideoQA.

Proactive Response Accuracy:
- StreamingBench: Outperformed existing models by >3% in accuracy.
- OVO-Bench: Achieved a 10% improvement in F1 score compared to prior state-of-the-art (SOTA) models.
- ProactiveVideoQA: Achieved competitive PAUC scores against specialized models.
Efficiency:
- Achieved 10–15 FPS on A100 GPUs for arbitrarily long videos.
- Unlike models like VideoLLM-Online or MMDuet-2, Em-Garde's latency does not degrade as video context lengthens, maintaining real-time performance.
Online Video Understanding:
- Maintained strong performance on real-time perception and backward tracing tasks (e.g., 76.7% on StreamingBench Real-time VU), proving that decoupling does not sacrifice the model's ability to understand the video content.

5. Significance and Impact

Solving the Efficiency-Accuracy Trade-off: Em-Garde demonstrates that high-accuracy proactive responses are possible under strict computational constraints by restructuring the problem, rather than just compressing models.
Scalability: The architecture is highly scalable for long-duration streams (e.g., surveillance, live sports, household assistance) because the computational cost per frame remains constant regardless of video length.
Practical Deployment: By using off-the-shelf embedding models for the matching stage and shifting heavy lifting to an asynchronous query-time step, the system is more practical for real-world deployment than previous RL-based or heavy-context approaches.
Future Direction: The paper highlights that while the triggering mechanism is robust, future work could focus on improving the discriminative power of embedding models to handle subtle scene changes and integrating the decision and generation stages for joint optimization.

In conclusion, Em-Garde provides a robust, efficient, and generalizable solution for proactive video understanding, effectively bridging the gap between complex semantic reasoning and real-time streaming constraints.