cs.CV papers | Gist.Science

ConVibNet: Needle Detection during Continuous Insertion via Frequency-Inspired Features

This paper introduces ConVibNet, a real-time deep learning framework that enhances continuous needle detection in ultrasound-guided interventions by leveraging temporal dependencies and a novel intersection-and-difference loss to achieve superior tip localization accuracy and robustness compared to existing baselines.

Jiamei Guo, Zhehao Duan, Maria Neiiendam + 3 more2026-03-03💻 cs

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping

This paper presents D-REX, a differentiable real-to-sim-to-real engine that leverages Gaussian Splatting to identify object mass from visual and control data for constructing high-fidelity digital twins, thereby enabling robust, force-aware dexterous grasping policies that effectively bridge the sim-to-real gap.

Haozhe Lou, Mingtong Zhang, Haoran Geng + 9 more2026-03-03💻 cs

GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection

GRAD-Former is a novel, parameter-efficient framework for remote sensing change detection that utilizes a gated robust attention mechanism with Adaptive Feature Relevance and Refinement to overcome the limitations of existing models in handling high-resolution imagery and limited training data, achieving state-of-the-art performance across multiple datasets.

Durgesh Ameta, Ujjwal Mishra, Praful Hambarde + 1 more2026-03-03🤖 cs.AI

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

BeautyGRPO is a reinforcement learning framework for face retouching that overcomes the trade-off between pixel-level mimicry and stochastic noise by leveraging a fine-grained preference dataset and a novel Dynamic Path Guidance mechanism to achieve high-fidelity, aesthetically aligned results.

Jiachen Yang, Xianhui Lin, Yi Dong + 4 more2026-03-03💻 cs

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

FREE-Edit is a zero-shot image-driven video editing framework based on rectified flow models that utilizes an Editing-aware (REE) injection method to dynamically modulate attention injection intensity according to an optical flow-warped editing mask, thereby avoiding semantic conflicts and producing higher-quality results without fine-tuning.

Maomao Li, Yunfei Liu, Yu Li2026-03-03💻 cs

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

This paper introduces TripleSumm, an adaptive triple-modality fusion architecture that dynamically weights visual, text, and audio features at the frame level to achieve state-of-the-art video summarization, alongside the release of MoSu, the first large-scale benchmark providing all three modalities.

Sumin Kim, Hyemin Jeong, Mingu Kang + 3 more2026-03-03🤖 cs.LG

VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification

VP-Hype is a novel hybrid framework that combines a linear-time Mamba-Transformer backbone with dual-modal visual-textual prompting to achieve state-of-the-art hyperspectral image classification accuracy even under extreme label scarcity.

Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche + 4 more2026-03-03💻 cs

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

RnG is a novel feed-forward Transformer that unifies 3D reconstruction and generation by employing a reconstruction-guided causal attention mechanism to infer complete, implicit 3D representations from partial 2D observations, enabling state-of-the-art, real-time rendering of both visible and plausible unseen geometry.

Mochu Xiang, Zhelun Shen, Xuesong Li + 7 more2026-03-03💻 cs

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

This paper introduces VisNec, a principled framework that measures visual necessity to filter out redundant and misaligned samples from multimodal instruction datasets, enabling models to achieve superior performance with significantly less training data.

Mingkang Dong, Hongyi Cai, Jie Li + 4 more2026-03-03🤖 cs.AI

CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling

CoSMo3D addresses the brittleness of open-world 3D semantic segmentation by introducing an LLM-guided framework that learns a latent canonical reference frame to align object parts across categories, thereby achieving state-of-the-art performance through stable, pose-invariant part semantics.

Li Jin, Weikai Chen, Yujie Wang + 7 more2026-03-03💻 cs

Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

This paper presents a finetuned Vision-Language Model that leverages monocular RGB images, natural language, and robot states to estimate 3D object positions for human-robot interaction, achieving a median error of 13 mm and significantly outperforming non-finetuned baselines.

Ari Wahl, Dorian Gawlinski, David Przewozny + 3 more2026-03-03🤖 cs.LG

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

This paper addresses the limitation of existing vision-language models in adapting to evolving safety policies by introducing SafeEditBench, a new benchmark for evaluating cross-policy generalization, and SafeGuard-VL, a reinforcement learning-based method that enables robust, policy-adaptive image guardrails.

Caiyong Piao, Zhiyuan Yan, Haoming Xu + 4 more2026-03-03💻 cs

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

This paper presents AgilePruner, an adaptive visual token pruning framework for Large Vision-Language Models that leverages empirical insights into the complementary strengths of attention-based and diversity-based methods to reduce computational overhead while mitigating hallucinations across varying image complexities.

Changwoo Baek, Jouwon Song, Sohyeon Kim + 1 more2026-03-03🤖 cs.LG

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

The MAMA-MIA Challenge establishes a large-scale, multi-institutional benchmark using US training and European test data to evaluate and improve the generalizability and fairness of AI models for breast MRI tumor segmentation and treatment response prediction across diverse demographic subgroups.

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar + 43 more2026-03-03🤖 cs.AI

Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography

This paper proposes a method to improve fast diffusion-based computed tomography reconstructions from sparse data by incorporating a complementary imaging modality (such as X-ray CT) as a guidance signal without requiring retraining of the diffusion prior.

Timofey Efimov, Singanallur Venkatakrishnan, Maliha Hossain + 2 more2026-03-03💻 cs

Certifiable Estimation with Factor Graphs

This paper presents a unified framework that synthesizes modular factor graph modeling with certifiable convex relaxation techniques by demonstrating that key mathematical transformations preserve factor graph structure, thereby enabling the use of existing, high-performance robotics libraries to implement globally optimal estimation without requiring specialized solver expertise.

Zhexin Xu, Nikolas R. Sanderson, Hanna Jiamei Zhang + 1 more2026-03-03💻 cs

FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration

FoSS is a dual-branch framework that integrates frequency-domain Fourier analysis with linear-time selective state-space models to achieve state-of-the-art trajectory prediction accuracy on Argoverse benchmarks while significantly reducing computational complexity and model parameters.

Yizhou Huang, Gengze Jiang, Yihua Cheng + 1 more2026-03-03💻 cs

Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis

This paper proposes an uncertainty-aware multi-task framework for breast ultrasound analysis that leverages multi-level bidirectional decoder interactions and adaptive feature weighting to overcome task interference and improve simultaneous lesion segmentation and tissue classification performance.

Abdullah Al Shafi, Md Kawsar Mahmud Khan Zunayed, Safin Ahmmed + 2 more2026-03-03🤖 cs.AI

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

This paper presents a controlled study demonstrating that reinforcement learning primarily sharpens output distributions and improves sampling efficiency in medical Vision-Language Models only after supervised fine-tuning has established non-trivial reasoning support, leading to a boundary-aware training recipe that achieves strong performance across medical benchmarks.

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh + 4 more2026-03-03💻 cs

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

This paper presents AG-VAS, a novel zero-shot visual anomaly segmentation framework that leverages Large Multimodal Models enhanced with learnable semantic anchor tokens, a Semantic-Pixel Alignment Module, and a specialized instruction dataset to overcome limitations in abstract concept representation and achieve state-of-the-art localization performance across industrial and medical benchmarks.

Zhen Qu, Xian Tao, Xiaoyi Bao + 4 more2026-03-03🤖 cs.AI

← Previous Next →