APPO: Attention-guided Perception Policy Optimization for Video Reasoning

Imagine you are trying to solve a complex mystery based on a video clip. You have two main skills to rely on: Observation (seeing exactly what is happening in the video) and Reasoning (using logic to figure out the answer).

For a long time, researchers thought the key to getting better at these video puzzles was to make the "Reasoning" brain smarter—like hiring a more experienced detective. But this paper, APPO, argues that we've been looking at the wrong part of the problem.

Here is the simple breakdown of what the authors discovered and how they fixed it:

1. The Big Discovery: "The Eyes Matter More Than the Brain"

The authors ran a fascinating experiment. They took a "smart" reasoning engine (a powerful AI) and paired it with different levels of "eyes" (vision models).

Scenario A: They kept the "eyes" fixed and made the "brain" much smarter (upgrading from a standard model to a super-smart one). Result: The score barely went up (only 0.7%).
Scenario B: They kept the "brain" the same but gave it slightly better "eyes" (upgrading the vision model). Result: The score jumped significantly (1.4%).

The Analogy: Imagine trying to solve a jigsaw puzzle in a dark room.

Improving Reasoning is like hiring a genius who knows exactly how the puzzle should fit together, but they still can't see the pieces because the lights are off.
Improving Perception is like turning on a bright light. Suddenly, the same genius can see the pieces clearly and solve the puzzle much faster.

The paper concludes: In video tasks, seeing clearly is more important than thinking harder.

2. The Problem: The "Lazy Detective"

Current AI training methods (like GRPO and DAPO) are a bit like a teacher who only gives a grade at the very end of a test.

The Old Way: The AI watches a video, guesses an answer, and gets a "Pass" or "Fail."
The Flaw: If the AI fails, it doesn't know why. Did it miss the cat jumping? Did it confuse the order of events? The teacher just says "Wrong," and the AI has to guess what to fix. This is inefficient and expensive because you need humans to write detailed notes on every frame to teach the AI what it missed.

3. The Solution: APPO (The "Spotlight" Teacher)

The authors created a new training method called APPO (Attention-guided Perception Policy Optimization). Instead of just grading the final answer, APPO acts like a spotlight teacher who watches the AI's thought process in real-time.

Here is how it works, step-by-step:

Step 1: The Group Debate. The AI generates several different answers (a group of "detectives"). Some get a high score (correct), and some get a low score (wrong).
Step 2: The Spotlight. The system looks at the "winners" (high-scoring answers) and asks: "What specific frames in the video were you looking at when you got this right?" It uses the AI's internal "attention" (where it was looking) to find these crucial moments.
Step 3: The Comparison. It then looks at the "losers" (low-scoring answers) and sees what they were looking at.
- Example: The winner noticed the "blue cat turned its head." The loser missed that and thought the cat was sleeping.
Step 4: The Reward. Instead of just rewarding the whole answer, APPO gives a bonus reward specifically to the tiny words (tokens) in the AI's thought process that were looking at the "blue cat." It punishes the words that were looking at the wrong things.

The Analogy:
Imagine a classroom where students are solving a math problem.

Old Method: The teacher says, "You got it wrong. Try again." (The student doesn't know if they messed up the addition or the multiplication).
APPO Method: The teacher walks around and says, "Hey, Student A, you looked at the correct numbers and added them right! Good job! Student B, you were looking at the wrong numbers. Stop looking there!"
This teaches the AI to pay attention to the right details without needing a human to write a 10-page essay explaining every mistake.

4. Why This is a Game-Changer

Cheaper: You don't need expensive human annotators to label every single frame of a video. The AI teaches itself what to look for by comparing its own good and bad guesses.
Smarter: It forces the AI to become a better observer. By focusing on the "intra-group perception tokens" (the specific words describing the video), the AI learns to spot tiny details like a kitten yawning or a cat turning its head.
Better Results: In tests, this method beat the previous best methods (GRPO and DAPO) consistently, especially on tricky videos where missing a small detail leads to a wrong answer.

Summary

The paper says: Stop trying to make the AI think harder; start teaching it to see better.

APPO is a clever training trick that uses the AI's own "correct" guesses to shine a spotlight on the important parts of a video, teaching the model to notice the tiny details that make the difference between a right and a wrong answer. It's like turning on the lights in that dark puzzle room.

1. Problem Statement

The paper addresses a critical bottleneck in Complex Video Reasoning for Multimodal Large Language Models (MLLMs). While Reinforcement Learning with Verifiable Rewards (RLVR) has successfully enhanced reasoning capabilities in text-based tasks (e.g., via GRPO and DAPO), its application to video reasoning has been limited.

The authors identify two fundamental issues:

Perception vs. Reasoning Imbalance: Complex video reasoning relies heavily on fine-grained perception (e.g., noticing specific actions, object states, or temporal sequences) rather than just high-level logical reasoning. Existing RL methods often fail to improve perception because they rely on sparse outcome rewards (e.g., final answer correctness), which provide insufficient guidance for specific perceptual errors.
Cost of Annotation: Enhancing perception typically requires expensive, fine-grained annotations (e.g., frame-level bounding boxes or timestamps) or additional reward models, which are computationally costly and difficult to scale.

Core Research Questions:

Is enhancing perception or reasoning more critical for video reasoning performance?
How can we optimize fine-grained perception during reasoning without relying on expensive annotations?

2. Key Empirical Observation

Before proposing a solution, the authors conducted a "divide-and-conquer" analysis by decoupling perception and reasoning modules. They cross-combined four perception models (varying in scale) with four reasoning models.

Finding: Enhancing perception yields significantly larger performance gains than enhancing reasoning.
- Example: Upgrading the reasoning model from Qwen3-8B to OpenAI-o3 (with fixed perception) improved performance by only 0.7%.
- Example: Upgrading the perception model from 7B to 32B (with fixed reasoning) improved performance by 1.4%.
Conclusion: The primary bottleneck in video reasoning is perception, not reasoning. Therefore, optimization strategies must target fine-grained perception.

3. Methodology: APPO Algorithm

The authors propose APPO (Attention-guided Perception Policy Optimization), an algorithm designed to enhance fine-grained perception through reasoning using token-level dense rewards, without extra annotation.

Core Mechanism

APPO transforms sparse outcome rewards into dense, frame-level guidance signals by leveraging the attention mechanisms of the model. It operates in two main steps:

Step 1: Attention-Guided Frame Selection

Grouping: A group of $G$ responses is generated for a single video question. These are split into two sets based on reward scores: High-reward ( $S_1$ ) and Low-reward ( $S_2$ ).
Attention Tracking: The algorithm tracks attention weights from response tokens to visual tokens (video frames).
Frame Identification:
- High-reward responses are assumed to focus on the "correct" or "crucial" frames.
- Low-reward responses are assumed to miss or misinterpret these frames.
- The algorithm identifies the set of frames ( $\psi'$ ) that high-reward responses focus on but low-reward responses miss (using Hard, Soft, or All selection strategies).

Step 2: Intra-Group Perception Tokens Re-weighting

Definition: Tokens from different responses that focus on the same crucial frame are grouped together as "Intra-group Perception Tokens."
Discrepancy Measurement: The algorithm calculates the Kullback-Leibler (KL) divergence between the probability distributions of tokens within a group. This measures the discrepancy in how different responses perceive the same frame.
Re-weighting:
- Tokens from high-reward paths (which correctly attended to the frame) are assigned higher weights.
- Tokens from low-reward paths are suppressed.
- A weight $W$ is computed for each token based on this divergence, modulated by a hyperparameter $\alpha$ .
Optimization: The policy model is optimized using a modified loss function where the standard advantage $A_i$ is multiplied by the token-level weight $W$ . This forces the model to prioritize learning the specific tokens that correspond to crucial visual frames.

4. Key Contributions

Quantitative Insight: Provided empirical evidence that enhancing perception is more impactful than enhancing reasoning for video tasks, challenging the prevailing focus on pure reasoning optimization.
APPO Algorithm: Proposed a novel RLVR algorithm that generates token-level fine-grained rewards directly from sparse outcome rewards using attention mechanisms, eliminating the need for expensive frame-level annotations or external reward models.
Low-Cost Enhancement: Demonstrated a method to jointly improve perception and reasoning capabilities in a cost-effective manner.

5. Experimental Results

The authors evaluated APPO on diverse benchmarks (SEED-Bench-R1, Perception Test, NExT-GQA, VSI-Bench, MVBench, NExT-QA) using Qwen2.5-VL models (3B and 7B).

Performance Gains: APPO consistently outperformed strong baselines (GRPO, DAPO, and SFT).
- On SEED-Bench-R1, APPO improved performance by 0.5% to 4% over DAPO.
- On NExT-GQA (fine-grained spatiotemporal task), APPO showed significant improvements in mIoU (1.0% gain on 3B model), directly validating improved perception.
Generalization: APPO showed superior performance on Out-of-Distribution (OOD) data (Level-2 and Level-3), suggesting the model learned more robust perceptual features rather than overfitting to specific patterns.
Efficiency: APPO achieved these results with a smaller training dataset (34K samples) compared to other video reasoning models trained on 260K+ samples.
Training Dynamics: Analysis showed APPO maintains higher generation entropy and gradient norms during training, indicating a larger exploration space and more stable learning of perceptual signals.

6. Significance

This work shifts the paradigm in video reasoning optimization. Instead of merely scaling up reasoning models or designing complex reward functions for final answers, APPO proves that optimizing the "senses" (perception) of the model via reasoning feedback is the most effective path forward.

By leveraging attention mechanisms to create dense rewards from sparse signals, APPO offers a scalable, low-cost solution for training MLLMs to understand complex video content, making it highly applicable to scenarios requiring precise temporal and spatial understanding (e.g., surveillance, autonomous driving, and detailed video analysis).

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

1. The Big Discovery: "The Eyes Matter More Than the Brain"

2. The Problem: The "Lazy Detective"

3. The Solution: APPO (The "Spotlight" Teacher)

4. Why This is a Game-Changer

Summary

1. Problem Statement

2. Key Empirical Observation

3. Methodology: APPO Algorithm

Core Mechanism

4. Key Contributions

5. Experimental Results

6. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization