v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
This paper introduces v-HUB, a novel benchmark for video humor understanding based on non-verbal short videos with rich annotations, which reveals that current multimodal large language models struggle with visual humor alone but show improved performance when environmental audio is incorporated.