Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
This paper identifies and addresses the "visual shortcuts" plaguing existing Multimodal Knowledge-Based Visual Question Answering benchmarks by introducing the RETINA dataset, which forces models to reason about related entities, and proposing the MIMIR model that leverages multi-image retrieval to overcome these limitations.