TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning
TikArt is a reinforcement learning-trained multimodal agent that stabilizes fine-grained visual reasoning by employing a Think-Aperture-Observe loop to sequentially acquire and linguistically encode evidence through zooming and segmentation, thereby overcoming the limitations of single-pass global image encoding.