OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

This paper introduces OddGridBench, a benchmark revealing that current multimodal large language models significantly underperform humans in detecting fine-grained visual discrepancies, and proposes OddGrid-GRPO, a reinforcement learning framework that effectively enhances this sensitivity through curriculum learning and distance-aware rewards.

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong MingWed, 11 Ma💻 cs

NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors

The paper presents NLiPsCalib, an efficient and physics-consistent calibration framework that utilizes Near-Light Photometric Stereo and controllable light sources to enable high-fidelity 3D reconstruction of curved visuotactile sensors through simple contacts with everyday objects, thereby overcoming the cost and complexity of existing methods.

Xuhao Qin, Feiyu Zhao, Yatao Leng, Runze Hu, Chenxi XiaoWed, 11 Ma💻 cs

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

The paper introduces IntroSVG, an introspective framework that enhances text-to-SVG generation by employing a unified Visual Language Model in a closed-loop "generate-review-refine" cycle, where the model acts as both generator and critic to iteratively improve outputs based on visual rendering feedback.

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu GaoWed, 11 Ma💻 cs

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

The paper introduces See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that enhances robotic manipulation robustness by dynamically grounding instructions into spatial subgoals and enabling closed-loop error recovery through state rewinding, achieving state-of-the-art performance on challenging benchmarks without additional training.

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zhihui Li, Salman Khan, Jun Yu, Xiaojun ChangWed, 11 Ma💻 cs

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

The paper introduces Stable Video Object Removal (SVOR), a robust framework that achieves state-of-the-art, flicker-free video object removal under real-world imperfections by employing a Mask Union strategy for stable erasure, a Denoising-Aware Segmentation head for precise localization, and a Curriculum Two-Stage training approach to handle shadows, abrupt motion, and defective masks.

Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian LuanWed, 11 Ma💻 cs

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

ForgeDreamer is a novel text-to-3D generation framework designed for industrial applications that overcomes domain adaptation and geometric reasoning limitations by integrating a Multi-Expert LoRA Ensemble for interference-free cross-category generalization and a Cross-View Hypergraph approach for capturing high-order structural dependencies to ensure manufacturing-level precision.

Junhao Cai, Deyu Zeng, Junhao Pang, Lini Li, Zongze Wu, Xiaopin ZhongWed, 11 Ma💻 cs

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

This paper introduces a large-scale framework for Vision-and-Language Navigation that leverages web-based room tour videos and implicit geometry representations to overcome simulator limitations, enabling robust zero-shot navigation agents with state-of-the-art performance across multiple benchmarks.

Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan LaptevWed, 11 Ma💻 cs

Towards Instance Segmentation with Polygon Detection Transformers

This paper introduces Poly-DETR, a lightweight instance segmentation framework that reformulates the task as sparse vertex regression using polar representation and specialized attention mechanisms, achieving superior performance and reduced memory consumption compared to traditional mask-based methods, particularly in high-resolution and domain-specific scenarios.

Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao LiWed, 11 Ma💻 cs

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

This paper introduces Geometric Semantic Decoupling (GSD), a parameter-free module that enhances the generalizability of AI-generated image detectors by explicitly removing dominant semantic priors from learned representations, thereby forcing models to rely on robust forensic artifacts rather than failing via "semantic fallback" when encountering unseen generation pipelines.

Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui RenWed, 11 Ma💻 cs