GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
GroundCount proposes a framework that augments Vision-Language Models with explicit spatial grounding from object detection models to significantly mitigate counting hallucinations, demonstrating that structured prompt-based integration outperforms feature-level fusion and yields consistent accuracy improvements across most architectures.