InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
This paper presents InterCoG, a novel text-vision interleaved chain-of-grounding reasoning framework that enhances fine-grained image editing in complex multi-entity scenes by explicitly deducing target locations through text-based spatial reasoning before performing visual grounding and outcome specification, supported by a new dataset and auxiliary training modules to ensure spatial precision.