Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting
The paper proposes Graph-of-Mark (GoM), a novel pixel-level visual prompting technique that overlays scene graphs onto images to capture object relationships, thereby significantly enhancing the spatial reasoning and zero-shot performance of multimodal language models.