cs.CV papers | Gist.Science

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

This paper introduces OddGridBench, a benchmark revealing that current multimodal large language models significantly underperform humans in detecting fine-grained visual discrepancies, and proposes OddGrid-GRPO, a reinforcement learning framework that effectively enhances this sensitivity through curriculum learning and distance-aware rewards.

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong MingWed, 11 Ma💻 cs

NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors

The paper presents NLiPsCalib, an efficient and physics-consistent calibration framework that utilizes Near-Light Photometric Stereo and controllable light sources to enable high-fidelity 3D reconstruction of curved visuotactile sensors through simple contacts with everyday objects, thereby overcoming the cost and complexity of existing methods.

Xuhao Qin, Feiyu Zhao, Yatao Leng, Runze Hu, Chenxi XiaoWed, 11 Ma💻 cs

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

The paper introduces IntroSVG, an introspective framework that enhances text-to-SVG generation by employing a unified Visual Language Model in a closed-loop "generate-review-refine" cycle, where the model acts as both generator and critic to iteratively improve outputs based on visual rendering feedback.

Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu GaoWed, 11 Ma💻 cs

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

The paper introduces See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that enhances robotic manipulation robustness by dynamically grounding instructions into spatial subgoals and enabling closed-loop error recovery through state rewinding, achieving state-of-the-art performance on challenging benchmarks without additional training.

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zhihui Li, Salman Khan, Jun Yu, Xiaojun ChangWed, 11 Ma💻 cs

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

The paper introduces MDTrack, a novel multimodal object tracking framework that achieves state-of-the-art performance by employing a Mixture of Experts for adaptive modality-aware fusion and utilizing decoupled State Space Models with cross-attention mechanisms for independent yet synergistic temporal propagation.

Shilei Wang, Pujian Lai, Dong Gao, Jifeng Ning, Gong ChengWed, 11 Ma💻 cs

CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

CogBlender is a novel framework that enables continuous, multi-dimensional control over the cognitive properties of text-to-image generation by mapping cognitive space to visual semantics and dynamically steering the flow-matching process through interpolated velocity fields guided by cognitive anchors.

Shengqi Dang, Jiaying Lei, Yi He, Ziqing Qian, Nan CaoWed, 11 Ma💻 cs

Learning Convex Decomposition via Feature Fields

This paper introduces a novel, self-supervised feature field learning approach that enables the first feed-forward model for open-world 3D convex decomposition, producing high-quality, generalizable results across diverse representations like meshes, CAD models, and Gaussian splats to accelerate applications such as collision detection.

Yuezhi Yang, Qixing Huang, Mikaela Angelina Uy, Nicholas SharpWed, 11 Ma💻 cs

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

The paper introduces Stable Video Object Removal (SVOR), a robust framework that achieves state-of-the-art, flicker-free video object removal under real-world imperfections by employing a Mask Union strategy for stable erasure, a Denoising-Aware Segmentation head for precise localization, and a Curriculum Two-Stage training approach to handle shadows, abrupt motion, and defective masks.

Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian LuanWed, 11 Ma💻 cs

Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists

This paper proposes novel training strategies and losses, including Gaussian scale resetting and an entropy constraint on alpha blending, to shorten the Gaussian lists used in 3D Gaussian splatting, thereby significantly accelerating the learning process without compromising rendering quality.

Jiaqi Liu, Zhizhong HanWed, 11 Ma💻 cs

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

ForgeDreamer is a novel text-to-3D generation framework designed for industrial applications that overcomes domain adaptation and geometric reasoning limitations by integrating a Multi-Expert LoRA Ensemble for interference-free cross-category generalization and a Cross-View Hypergraph approach for capturing high-order structural dependencies to ensure manufacturing-level precision.

Junhao Cai, Deyu Zeng, Junhao Pang, Lini Li, Zongze Wu, Xiaopin ZhongWed, 11 Ma💻 cs

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

This paper introduces a large-scale framework for Vision-and-Language Navigation that leverages web-based room tour videos and implicit geometry representations to overcome simulator limitations, enabling robust zero-shot navigation agents with state-of-the-art performance across multiple benchmarks.

Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan LaptevWed, 11 Ma💻 cs

Multimodal Graph Representation Learning with Dynamic Information Pathways

This paper proposes DiP, a novel multimodal graph representation learning framework that utilizes modality-specific pseudo nodes and dynamic information pathways to achieve adaptive, sparse, and linear-complexity message propagation, consistently outperforming existing baselines in link prediction and node classification tasks.

Xiaobin Hong, Mingkai Lin, Xiaoli Wang, Chaoqun Wang, Wenzhong LiWed, 11 Ma💻 cs

Towards Instance Segmentation with Polygon Detection Transformers

This paper introduces Poly-DETR, a lightweight instance segmentation framework that reformulates the task as sparse vertex regression using polar representation and specialized attention mechanisms, achieving superior performance and reduced memory consumption compared to traditional mask-based methods, particularly in high-resolution and domain-specific scenarios.

Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao LiWed, 11 Ma💻 cs

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

This paper introduces Geometric Semantic Decoupling (GSD), a parameter-free module that enhances the generalizability of AI-generated image detectors by explicitly removing dominant semantic priors from learned representations, thereby forcing models to rely on robust forensic artifacts rather than failing via "semantic fallback" when encountering unseen generation pipelines.

Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui RenWed, 11 Ma💻 cs

RAE-NWM: Navigation World Model in Dense Visual Representation Space

The paper proposes RAE-NWM, a navigation world model that operates in a dense DINOv2 feature space using a Conditional Diffusion Transformer with a decoupled head and time-driven gating to achieve superior structural stability and action accuracy compared to traditional latent-space approaches.

Mingkun Zhang, Wangtian Shen, Fan Zhang, Haijian Qin, Zihao Pei, Ziyang MengWed, 11 Ma💻 cs

HelixTrack: Event-Based Tracking and RPM Estimation of Propeller-like Objects

HelixTrack is a fully event-driven method that jointly tracks propeller-like objects and estimates their RPM with microsecond latency by back-warping events to a rotor plane and refining pose through phase-coupled geometry, validated on a newly introduced TQE dataset where it outperforms existing baselines.

Radim Spetlik, Michal Pliska, Vojtech Vrba, Jiri MatasWed, 11 Ma💻 cs

UniField: A Unified Field-Aware MRI Enhancement Framework

The paper introduces UniField, a unified framework that leverages pre-trained 3D foundation models and a novel Field-Aware Spectral Rectification Mechanism to overcome data scarcity and spectral bias in MRI field-strength enhancement, supported by the release of a large-scale multi-field dataset that significantly outperforms state-of-the-art methods.

Yiyang Lin, Chenhui Wang, Zhihao Peng, Yixuan YuanWed, 11 Ma💻 cs

Distributed Convolutional Neural Networks for Object Recognition

This paper proposes a lightweight distributed convolutional neural network (DisCNN) trained with a novel loss function that maps positive samples to a compact high-dimensional space while pushing negative samples to the origin, thereby disentangling positive-class features to achieve robust object recognition and detection even in complex backgrounds and for unseen classes.

Liang SunWed, 11 Ma💻 cs

TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

TubeMLLM is a unified foundation model that integrates topological priors via natural language prompting and a shared-attention architecture to achieve state-of-the-art, robust, and zero-shot generalizable topology-aware perception and generation for vessel-like anatomy across diverse medical imaging modalities.

Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun GuWed, 11 Ma💻 cs

Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints

This paper proposes a geometry-aware metric learning framework using rotation- and scale-invariant inter-joint angle descriptors derived from static hand keypoints to achieve robust cross-lingual few-shot sign language recognition, significantly outperforming conventional coordinate-based methods across diverse sign languages.

Chayanin Chamachot, Kanokphan LertniponphanWed, 11 Ma💻 cs

← Previous Next →