VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
This paper introduces VirtueBench, a new benchmark designed to evaluate the trustworthiness of Vision-Language Models in long video understanding by distinguishing between answerable and unanswerable cases to prevent misleading accuracy scores caused by guessing under uncertainty.