cs.CV papers | Gist.Science

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

This paper introduces GroundedSurg, the first multi-procedure benchmark designed to evaluate language-conditioned, instance-level surgical tool segmentation by pairing surgical images with natural language descriptions and precise spatial annotations to address the limitations of existing category-level evaluation paradigms in clinical AI.

Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak + 4 more2026-03-03💻 cs

GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation

GuiDINO introduces a framework that leverages DINOv3 as a visual guidance generator to produce spatial guide masks via a lightweight TokenBook mechanism, effectively enhancing medical image segmentation across diverse datasets and backbones without requiring full fine-tuning of the foundation model.

Zhuonan Liang, Wei Guo, Jie Gan + 4 more2026-03-03💻 cs

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

ClinCoT is a novel framework that enhances medical vision-language models by introducing a clinical-aware visual chain-of-thought mechanism and an iterative, scoring-based preference optimization strategy to improve factual grounding and reduce hallucinations through region-level reasoning.

Xiwei Liu, Yulong Li, Xinlin Zhuang + 5 more2026-03-03🤖 cs.AI

Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations

This paper proposes PR-A $^2$ CL, a novel framework for compositional visual relations that combines augmented anomaly contrastive learning with a predictive-and-verify paradigm to effectively identify outlier images by iteratively predicting and verifying compositional rules.

Chengtai Li, Yuting He, Jianfeng Ren + 4 more2026-03-03🤖 cs.AI

Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

The paper proposes TCD-Net, a Vision Transformer-based image denoising framework that utilizes teacher-guided causal interventions, including environmental bias adjustment and orthogonal content-noise disentanglement, to eliminate spurious correlations and achieve state-of-the-art fidelity and real-time performance.

Kuai Jiang, Zhaoyan Ding, Guijuan Zhang + 2 more2026-03-03💻 cs

ArtLLM: Generating Articulated Assets via 3D LLM

ArtLLM is a novel framework that leverages a 3D multimodal large language model to autoregressively predict kinematic structures and generate high-fidelity articulated 3D assets directly from complete meshes, significantly outperforming existing methods in accuracy and generalization for applications like robotics and simulation.

Penghao Wang, Siyuan Xie, Hongyu Yan + 4 more2026-03-03💻 cs

TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning

This paper proposes TC-SSA, a learnable token compression framework that utilizes gated semantic slot aggregation to efficiently process gigapixel whole slide images by reducing visual tokens to 1.7% of the original sequence while preserving diagnostically critical information and outperforming existing sampling-based methods in both reasoning and classification tasks.

Zhuo Chen, Shawn Young, Lijian Xu2026-03-03🤖 cs.AI

ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features

This paper introduces ConVibNet, a real-time deep learning framework that enhances continuous needle detection in ultrasound-guided interventions by leveraging temporal dependencies and a novel intersection-and-difference loss to achieve superior tip localization accuracy and robustness compared to existing baselines.

Jiamei Guo, Zhehao Duan, Maria Neiiendam + 3 more2026-03-03💻 cs

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping

This paper presents D-REX, a differentiable real-to-sim-to-real engine that leverages Gaussian Splatting to identify object mass from visual and control data for constructing high-fidelity digital twins, thereby enabling robust, force-aware dexterous grasping policies that effectively bridge the sim-to-real gap.

Haozhe Lou, Mingtong Zhang, Haoran Geng + 9 more2026-03-03💻 cs

GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection

GRAD-Former is a novel, parameter-efficient framework for remote sensing change detection that utilizes a gated robust attention mechanism with Adaptive Feature Relevance and Refinement to overcome the limitations of existing models in handling high-resolution imagery and limited training data, achieving state-of-the-art performance across multiple datasets.

Durgesh Ameta, Ujjwal Mishra, Praful Hambarde + 1 more2026-03-03🤖 cs.AI

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

BeautyGRPO is a reinforcement learning framework for face retouching that overcomes the trade-off between pixel-level mimicry and stochastic noise by leveraging a fine-grained preference dataset and a novel Dynamic Path Guidance mechanism to achieve high-fidelity, aesthetically aligned results.

Jiachen Yang, Xianhui Lin, Yi Dong + 4 more2026-03-03💻 cs

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

FREE-Edit is a zero-shot image-driven video editing framework based on rectified flow models that utilizes an Editing-aware (REE) injection method to dynamically modulate attention injection intensity according to an optical flow-warped editing mask, thereby avoiding semantic conflicts and producing higher-quality results without fine-tuning.

Maomao Li, Yunfei Liu, Yu Li2026-03-03💻 cs

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

This paper introduces TripleSumm, an adaptive triple-modality fusion architecture that dynamically weights visual, text, and audio features at the frame level to achieve state-of-the-art video summarization, alongside the release of MoSu, the first large-scale benchmark providing all three modalities.

Sumin Kim, Hyemin Jeong, Mingu Kang + 3 more2026-03-03🤖 cs.LG

VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification

VP-Hype is a novel hybrid framework that combines a linear-time Mamba-Transformer backbone with dual-modal visual-textual prompting to achieve state-of-the-art hyperspectral image classification accuracy even under extreme label scarcity.

Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche + 4 more2026-03-03💻 cs

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

RnG is a novel feed-forward Transformer that unifies 3D reconstruction and generation by employing a reconstruction-guided causal attention mechanism to infer complete, implicit 3D representations from partial 2D observations, enabling state-of-the-art, real-time rendering of both visible and plausible unseen geometry.

Mochu Xiang, Zhelun Shen, Xuesong Li + 7 more2026-03-03💻 cs

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

This paper introduces VisNec, a principled framework that measures visual necessity to filter out redundant and misaligned samples from multimodal instruction datasets, enabling models to achieve superior performance with significantly less training data.

Mingkang Dong, Hongyi Cai, Jie Li + 4 more2026-03-03🤖 cs.AI

CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling

CoSMo3D addresses the brittleness of open-world 3D semantic segmentation by introducing an LLM-guided framework that learns a latent canonical reference frame to align object parts across categories, thereby achieving state-of-the-art performance through stable, pose-invariant part semantics.

Li Jin, Weikai Chen, Yujie Wang + 7 more2026-03-03💻 cs

Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

This paper presents a finetuned Vision-Language Model that leverages monocular RGB images, natural language, and robot states to estimate 3D object positions for human-robot interaction, achieving a median error of 13 mm and significantly outperforming non-finetuned baselines.

Ari Wahl, Dorian Gawlinski, David Przewozny + 3 more2026-03-03🤖 cs.LG

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

This paper addresses the limitation of existing vision-language models in adapting to evolving safety policies by introducing SafeEditBench, a new benchmark for evaluating cross-policy generalization, and SafeGuard-VL, a reinforcement learning-based method that enables robust, policy-adaptive image guardrails.

Caiyong Piao, Zhiyuan Yan, Haoming Xu + 4 more2026-03-03💻 cs

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

This paper presents AgilePruner, an adaptive visual token pruning framework for Large Vision-Language Models that leverages empirical insights into the complementary strengths of attention-based and diversity-based methods to reduce computational overhead while mitigating hallucinations across varying image complexities.

Changwoo Baek, Jouwon Song, Sohyeon Kim + 1 more2026-03-03🤖 cs.LG

← Previous Next →