Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
The paper introduces SpecTemp, a reinforcement learning-based framework that enhances the efficiency of long video understanding by decoupling temporal perception and reasoning through a cooperative dual-model design, where a lightweight draft MLLM proposes salient frames for verification by a powerful target MLLM, thereby significantly accelerating inference while maintaining competitive accuracy.