EVA: Efficient Reinforcement Learning for End-to-End Video Agent
EVA is an efficient reinforcement learning framework for end-to-end video agents that achieves query-driven, adaptive video understanding through a "planning-before-perception" reasoning loop and a novel three-stage training pipeline, outperforming existing baselines by 6–12% on six benchmarks.
Imagine you are trying to solve a mystery in a massive, 3-hour-long movie. You have a question: "Who stole the diamond, and exactly when did they do it?"
The Old Way: The "Blind Fumble"
Most current AI video agents are like a detective who is forced to watch the entire movie at once, but they can only look at it through a tiny, blurry keyhole.
The Problem: If they try to watch the whole thing in high definition, their brain (computer memory) explodes. So, they usually just take 10 random snapshots from the beginning, middle, and end.
The Result: They miss the crucial scene where the theft happened because it wasn't in their random snapshots. Or, they waste hours watching scenes of people eating lunch that have nothing to do with the crime. They are passive; they just wait for the video to be fed to them.
The New Way: EVA (The "Smart Detective")
The paper introduces EVA (Efficient Video Agent). Think of EVA not as a camera, but as a smart, strategic detective who knows how to use a remote control.
EVA follows a simple philosophy: "Plan before you look."
Instead of staring at the screen immediately, EVA does this:
Reads the Clue: It looks at your question first.
Makes a Plan: It thinks, "Okay, the question is about a theft. I don't need to watch the whole movie. I should probably look at the scene where the party starts, then zoom in on the jewelry box."
Takes Action: It uses a tool to grab only the specific 5 seconds of video it needs, at high quality.
Reflects: It looks at those 5 seconds. "Hmm, I see a hand, but I can't see the face. I need to zoom in closer on the next 10 seconds."
Repeats: It keeps doing this—Plan, Watch, Reflect, Zoom—until it has the answer.
The Secret Sauce: The Three-Stage Training
How did they teach a computer to be this smart? They didn't just give it a textbook; they trained it like a human apprentice through three stages:
Stage 1: The "Copycat" Phase (SFT)
Analogy: Like a student copying the teacher's homework.
What happened: They showed the AI thousands of examples of good detectives solving cases. The AI learned the format: "First I think, then I ask for a video clip, then I look, then I answer." It learned the rules of the game.
Stage 2: The "Correction" Phase (KTO)
Analogy: Like a coach saying, "Stop guessing! Look at the evidence!"
What happened: The AI started making mistakes. Sometimes it guessed the answer without looking, or it looked at the wrong part of the video. The researchers showed it examples of these failures and said, "No, that's a bad strategy." This taught the AI what not to do.
Stage 3: The "Trial by Fire" Phase (GRPO)
Analogy: Like a video game where you get points for winning and lose points for wasting time.
What happened: The AI was put in a simulation where it had to solve video puzzles. If it found the answer quickly and accurately, it got a "reward." If it wasted time watching irrelevant scenes or guessed wrong, it got a "penalty." Over time, it learned to be incredibly efficient, only watching what was necessary.
Why is this a Big Deal?
It Saves Energy: Instead of downloading and processing 10,000 frames of a video (which is slow and expensive), EVA might only look at 50 frames. It's like reading a book by skimming the chapters you need instead of reading every word of a 1,000-page novel.
It's Smarter: Because it plans first, it doesn't get distracted by irrelevant scenes. It knows exactly where to look.
It Adapts: If a question is simple, it takes a quick glance. If a question is hard, it knows to zoom in and look closely. It's not a "one-size-fits-all" robot.
The Bottom Line
EVA turns video understanding from a brute-force task (watching everything and hoping to see something) into a strategic task (thinking about what you need, then looking only there). It's the difference between a person frantically flipping through a magazine and a detective calmly examining the evidence with a magnifying glass.
1. Problem Statement
Video understanding using Multimodal Large Language Models (MLLMs) faces significant challenges due to the long token sequences inherent in videos, which contain extensive temporal dependencies and redundant frames.
Current Limitations: Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. This leads to inefficiency, especially on long videos.
Agent Limitations: Recent "agent-based" methods introduce external tools (e.g., frame selection) but often rely on manually designed, rigid workflows and "perception-first" strategies. They typically ingest a set of uniformly sampled frames before reasoning, leading to redundant visual processing and limited reasoning efficiency.
Core Question: How can an MLLM-based agent autonomously decide what to watch, when to watch, and how to watch (resolution/fps) to answer queries efficiently?
2. Methodology: The EVA Framework
The authors propose EVA, an Efficient Reinforcement Learning framework for an End-to-End Video Agent that adopts a "Planning-Before-Perception" paradigm.
A. Core Paradigm: Planning-Before-Perception
Unlike traditional methods that feed visual data first, EVA forces the agent to reason solely from the textual query initially. The agent operates in an iterative loop:
Summary: Analyze the query and current context.
Planning: Decide what visual information is needed (time range, resolution, frame count).
Action: Call a flexible tool to extract specific frames.
Reflection: Evaluate if the gathered evidence is sufficient; if not, repeat the loop.
The Tool: EVA uses a flexible frame-selection tool allowing control over:
start_time / end_time: Temporal window.
nframes: Number of frames to sample.
resize: Spatial downsampling ratio (zoom level). This allows the agent to trade off temporal coverage vs. spatial detail dynamically.
B. Three-Stage Training Pipeline
To train such an autonomous agent, the authors designed a simple yet effective three-stage pipeline:
Data:EVA-SFT (10k samples) generated by a teacher model (Qwen2.5-VL-72B) using prompts for "Past Success," "Workflow Hints," and "Reflective Thinking."
Format: Summary + Planning + Action + Reflection.
Stage 2: Kahneman–Tversky Optimization (KTO)
Goal: Guide the agent to prefer effective strategies and avoid common failure modes (e.g., guessing without evidence, over-sampling) before online RL.
Mechanism: Uses single-sample preference labels ("chosen" vs. "rejected") rather than pairwise comparisons (unlike DPO).
Data:EVA-KTO (11k samples) containing successful trajectories and specific failure cases (e.g., answering without enough visual tokens).
Goal: Online reinforcement learning to optimize the policy for both open-ended and multiple-choice questions.
Mechanism: The model generates multiple rollouts. A Data-Enhanced approach is used: failure cases from the current policy are used to prompt a teacher model to generate new, harder QA pairs, creating a dynamic training set.
Accuracy: ROUGE score for open-ended; Completeness Self-Verification (CSV) for multiple-choice.
Format: A small penalty/reward to prevent "reward hacking" (guessing answers without proper tool usage).
3. Key Contributions
Novel Planning-Before-Perception Framework: EVA shifts the paradigm from passive recognition to active, adaptive agents that plan their visual observation strategy before consuming tokens.
Three-Stage End-to-End Training: A scalable pipeline combining SFT, KTO (for strategy correction), and GRPO (for policy optimization) that bridges supervised imitation and reinforcement learning.
High-Quality Datasets: Construction of EVA-SFT, EVA-KTO, and EVA-RL datasets specifically designed to support stable agentic training.
Efficiency & Performance: Demonstrates that reasoning-driven visual planning can achieve higher accuracy with significantly fewer visual tokens compared to brute-force sampling.
4. Experimental Results
EVA was evaluated on six benchmarks: LSDBench, LongVideoBench, MLVU, VideoMME, LVBench, and Video-Holmes.
Sampling Dilemma (LSDBench):
EVA achieved 51.8% accuracy using only 6.2K visual tokens.
This surpasses the Qwen2.5-VL baseline (50.1%) and approaches closed-source models like Gemini-2.0-Flash (56.2%), which uses 696.6K tokens.
Significance: Proves EVA solves the sampling dilemma by selecting only relevant frames.
Long-Form Video Understanding:
EVA outperformed existing open-source and adaptive agent baselines (e.g., VideoAgent, FrameThinker) by 1–3% on LongVideoBench, MLVU, VideoMME, and LVBench.
It achieved this while processing only 20–30 frames per video on average, compared to hundreds or thousands for static sampling methods.
Zero-Shot Reasoning (Video-Holmes):
EVA achieved competitive performance (37.2%) in a zero-shot setting, demonstrating strong transferability to complex reasoning tasks (social reasoning, causal inference) without task-specific supervision.
Efficiency Analysis:
The inference runtime is dominated by the compact set of adaptively selected visual tokens, not the number of reasoning steps.
EVA reduces visual token usage by ~90% compared to dense sampling baselines while maintaining or improving accuracy.
5. Significance and Impact
Paradigm Shift: EVA moves video understanding from "perception-first" (passive) to "planning-first" (active), allowing models to act as autonomous "watchers" that decide how to observe.
Scalability: By drastically reducing the number of visual tokens required, EVA makes long-video understanding feasible on standard hardware without relying on massive context windows.
Generalization: The framework shows that agents can learn to balance computational cost and reasoning depth, adapting their strategy (e.g., low-res overview vs. high-res zoom) based on the specific query.
Future Direction: The work highlights the potential of RL-based training pipelines for developing truly autonomous multimodal agents capable of complex, multi-step interactions with dynamic environments.