Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
L'article présente « Think While Watching », un cadre de raisonnement vidéo en flux continu pour les modèles multimodaux qui, en maintenant une mémoire par segment et en superposant l'observation à la réflexion, permet des interactions multi-tours efficaces tout en améliorant la précision et en réduisant le nombre de tokens générés.
Lu Wang (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Zhuoran Jin (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yupu Hao (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yubo Chen (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Kang Liu (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China), Yulong Ao (Beijing Academy of Artificial Intelligence), Jun Zhao (The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China)2026-03-13💬 cs.CL