Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
This paper introduces the first formal framework for Audio-Visual World Models (AVWM), presenting the AVW-4k dataset and the AV-CDiT model to enable high-fidelity, synchronized simulation of binaural audio and visual dynamics that significantly enhances agent performance in continuous navigation tasks.