Multi-View Based Audio Visual Target Speaker Extraction
This paper proposes Multi-View Tensor Fusion (MVTF), a novel framework that leverages synchronized multi-perspective lip videos during training to learn cross-view correlations, thereby significantly enhancing target speaker extraction performance and robustness for both single-view and multi-view inference scenarios.