Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning
This paper introduces a counterfactual evaluation framework revealing that while reinforcement learning with verifiable rewards improves accuracy on medical VQA benchmarks, it often degrades genuine visual grounding by enabling models to rely on text shortcuts and hallucinate visual reasoning, necessitating new evaluation metrics and training objectives that explicitly enforce visual dependence.