EVA: Efficient Reinforcement Learning for End-to-End Video Agent

EVA is an efficient reinforcement learning framework for end-to-end video agents that achieves query-driven, adaptive video understanding through a "planning-before-perception" reasoning loop and a novel three-stage training pipeline, outperforming existing baselines by 6–12% on six benchmarks.

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

Published 2026-03-25
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a mystery in a massive, 3-hour-long movie. You have a question: "Who stole the diamond, and exactly when did they do it?"

The Old Way: The "Blind Fumble"

Most current AI video agents are like a detective who is forced to watch the entire movie at once, but they can only look at it through a tiny, blurry keyhole.

  • The Problem: If they try to watch the whole thing in high definition, their brain (computer memory) explodes. So, they usually just take 10 random snapshots from the beginning, middle, and end.
  • The Result: They miss the crucial scene where the theft happened because it wasn't in their random snapshots. Or, they waste hours watching scenes of people eating lunch that have nothing to do with the crime. They are passive; they just wait for the video to be fed to them.

The New Way: EVA (The "Smart Detective")

The paper introduces EVA (Efficient Video Agent). Think of EVA not as a camera, but as a smart, strategic detective who knows how to use a remote control.

EVA follows a simple philosophy: "Plan before you look."

Instead of staring at the screen immediately, EVA does this:

  1. Reads the Clue: It looks at your question first.
  2. Makes a Plan: It thinks, "Okay, the question is about a theft. I don't need to watch the whole movie. I should probably look at the scene where the party starts, then zoom in on the jewelry box."
  3. Takes Action: It uses a tool to grab only the specific 5 seconds of video it needs, at high quality.
  4. Reflects: It looks at those 5 seconds. "Hmm, I see a hand, but I can't see the face. I need to zoom in closer on the next 10 seconds."
  5. Repeats: It keeps doing this—Plan, Watch, Reflect, Zoom—until it has the answer.

The Secret Sauce: The Three-Stage Training

How did they teach a computer to be this smart? They didn't just give it a textbook; they trained it like a human apprentice through three stages:

  1. Stage 1: The "Copycat" Phase (SFT)

    • Analogy: Like a student copying the teacher's homework.
    • What happened: They showed the AI thousands of examples of good detectives solving cases. The AI learned the format: "First I think, then I ask for a video clip, then I look, then I answer." It learned the rules of the game.
  2. Stage 2: The "Correction" Phase (KTO)

    • Analogy: Like a coach saying, "Stop guessing! Look at the evidence!"
    • What happened: The AI started making mistakes. Sometimes it guessed the answer without looking, or it looked at the wrong part of the video. The researchers showed it examples of these failures and said, "No, that's a bad strategy." This taught the AI what not to do.
  3. Stage 3: The "Trial by Fire" Phase (GRPO)

    • Analogy: Like a video game where you get points for winning and lose points for wasting time.
    • What happened: The AI was put in a simulation where it had to solve video puzzles. If it found the answer quickly and accurately, it got a "reward." If it wasted time watching irrelevant scenes or guessed wrong, it got a "penalty." Over time, it learned to be incredibly efficient, only watching what was necessary.

Why is this a Big Deal?

  • It Saves Energy: Instead of downloading and processing 10,000 frames of a video (which is slow and expensive), EVA might only look at 50 frames. It's like reading a book by skimming the chapters you need instead of reading every word of a 1,000-page novel.
  • It's Smarter: Because it plans first, it doesn't get distracted by irrelevant scenes. It knows exactly where to look.
  • It Adapts: If a question is simple, it takes a quick glance. If a question is hard, it knows to zoom in and look closely. It's not a "one-size-fits-all" robot.

The Bottom Line

EVA turns video understanding from a brute-force task (watching everything and hoping to see something) into a strategic task (thinking about what you need, then looking only there). It's the difference between a person frantically flipping through a magazine and a detective calmly examining the evidence with a magnifying glass.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →