MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
This paper introduces MSVBench, the first comprehensive benchmark designed to evaluate multi-shot video generation through hierarchical scripts and a hybrid LMM-expert framework, revealing that current models lack true world modeling capabilities while achieving near-perfect alignment with human judgments.