cs.CV papers | Gist.Science

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

This paper presents AgilePruner, an adaptive visual token pruning framework for Large Vision-Language Models that leverages empirical insights into the complementary strengths of attention-based and diversity-based methods to reduce computational overhead while mitigating hallucinations across varying image complexities.

Changwoo Baek, Jouwon Song, Sohyeon Kim + 1 more2026-03-03🤖 cs.LG

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

The MAMA-MIA Challenge establishes a large-scale, multi-institutional benchmark using US training and European test data to evaluate and improve the generalizability and fairness of AI models for breast MRI tumor segmentation and treatment response prediction across diverse demographic subgroups.

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar + 43 more2026-03-03🤖 cs.AI

Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography

This paper proposes a method to improve fast diffusion-based computed tomography reconstructions from sparse data by incorporating a complementary imaging modality (such as X-ray CT) as a guidance signal without requiring retraining of the diffusion prior.

Timofey Efimov, Singanallur Venkatakrishnan, Maliha Hossain + 2 more2026-03-03💻 cs

Certifiable Estimation with Factor Graphs

This paper presents a unified framework that synthesizes modular factor graph modeling with certifiable convex relaxation techniques by demonstrating that key mathematical transformations preserve factor graph structure, thereby enabling the use of existing, high-performance robotics libraries to implement globally optimal estimation without requiring specialized solver expertise.

Zhexin Xu, Nikolas R. Sanderson, Hanna Jiamei Zhang + 1 more2026-03-03💻 cs

FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration

FoSS is a dual-branch framework that integrates frequency-domain Fourier analysis with linear-time selective state-space models to achieve state-of-the-art trajectory prediction accuracy on Argoverse benchmarks while significantly reducing computational complexity and model parameters.

Yizhou Huang, Gengze Jiang, Yihua Cheng + 1 more2026-03-03💻 cs

Multi-Level Bidirectional Decoder Interaction for Uncertainty-Aware Breast Ultrasound Analysis

This paper proposes an uncertainty-aware multi-task framework for breast ultrasound analysis that leverages multi-level bidirectional decoder interactions and adaptive feature weighting to overcome task interference and improve simultaneous lesion segmentation and tissue classification performance.

Abdullah Al Shafi, Md Kawsar Mahmud Khan Zunayed, Safin Ahmmed + 2 more2026-03-03🤖 cs.AI

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

This paper presents a controlled study demonstrating that reinforcement learning primarily sharpens output distributions and improves sampling efficiency in medical Vision-Language Models only after supervised fine-tuning has established non-trivial reasoning support, leading to a boundary-aware training recipe that achieves strong performance across medical benchmarks.

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh + 4 more2026-03-03💻 cs

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

This paper presents AG-VAS, a novel zero-shot visual anomaly segmentation framework that leverages Large Multimodal Models enhanced with learnable semantic anchor tokens, a Semantic-Pixel Alignment Module, and a specialized instruction dataset to overcome limitations in abstract concept representation and achieve state-of-the-art localization performance across industrial and medical benchmarks.

Zhen Qu, Xian Tao, Xiaoyi Bao + 4 more2026-03-03🤖 cs.AI

Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

This paper presents a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding across multiple datasets, concluding that while foundation models offer flexibility, supervised training remains the most reliable approach for accurately detecting small objects and delineating boundaries in cluttered disaster scenes when annotations are available.

Anna Michailidou, Georgios Angelidis, Vasileios Argyriou + 2 more2026-03-03💻 cs

You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image

The paper proposes NVB-Face, a novel one-stage framework that directly generates consistent, high-quality novel-view face images from a single degraded (blind) input by extracting features and transforming them into 3D-aware latent representations via a diffusion model, thereby outperforming traditional two-stage restoration-and-synthesis pipelines.

Taoyue Wang, Xiang Zhang, Xiaotian Li + 2 more2026-03-03🤖 cs.AI

Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth

The paper proposes Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a novel framework that enables high-fidelity, ground-truth-free multispectral image reconstruction by leveraging projective geometry and adapting pretrained foundation models to outperform existing methods on real-world datasets.

Andrew Wang, Mike Davies2026-03-03💻 cs

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

MixerCSeg is an efficient, state-of-the-art architecture for crack segmentation that integrates CNN-like local texture analysis, Transformer-style global dependency modeling, and Mamba-inspired sequential context processing within a unified encoder, enhanced by specialized edge-aware and multi-scale fusion modules to achieve high performance with minimal computational cost.

Zilong Zhao, Zhengming Ding, Pei Niu + 2 more2026-03-03🤖 cs.AI

TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

The paper proposes TIMI, a training-free framework that achieves high spatial fidelity in image-to-3D multi-instance generation by leveraging pre-trained model priors through an Instance-aware Separation Guidance module for disentanglement and a Spatial-stabilized Geometry-adaptive Update module for geometric preservation, outperforming existing methods without additional training overhead.

Xiao Cai, Lianli Gao, Pengpeng Zeng + 3 more2026-03-03💻 cs

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

This paper proposes AOT, a training-free method that optimizes token reduction in Video Large Language Models by establishing local and global token anchors and aggregating informative contexts via optimal transport to efficiently eliminate redundancy while preserving spatiotemporal fidelity.

Jinlong Li, Liyuan Jiang, Haonan Zhang + 1 more2026-03-03💻 cs

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

UniTalking is a unified, end-to-end diffusion framework that leverages Multi-Modal Transformer Blocks and pre-trained video priors to generate high-fidelity, lip-synchronized talking portraits with personalized voice cloning, achieving superior performance over existing open-source methods.

Hebeizi Li, Zihao Liang, Benyuan Sun + 4 more2026-03-03💻 cs

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

The paper introduces SeaVIS, the first online framework for audio-visual instance segmentation that utilizes Causal Cross Attention Fusion and Audio-Guided Contrastive Learning to effectively track and segment sounding objects in continuous video streams while suppressing silent instances.

Yingjian Zhu, Ying Wang, Yuyang Hong + 5 more2026-03-03💻 cs

Unifying Language-Action Understanding and Generation for Autonomous Driving

This paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving that unifies language and action tokens in a shared codebook, incorporates an auxiliary action understanding objective for bidirectional semantic alignment, and employs a coarse-to-fine generation strategy to significantly improve instruction following, driving performance, and inference efficiency.

Xinyang Wang, Qian Liu, Wenjie Ding + 7 more2026-03-03💻 cs

Revisiting Global Token Mixing in Task-Dependent MRI Restoration: Insights from Minimal Gated CNN Baselines

This paper demonstrates that the effectiveness of global token mixing in MRI restoration is highly task-dependent, showing that while local gated CNNs suffice for reconstruction and super-resolution tasks constrained by physics or preserved low-frequency data, global models are superior for denoising tasks involving spatially heteroscedastic noise.

Xiangjian Hou, Chao Qin, Chang Ni + 3 more2026-03-03⚡ eess

Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

This paper introduces the Deepfake Forensics Adapter (DFA), a novel dual-stream framework that integrates a frozen CLIP model with global and local forensic analysis streams to achieve state-of-the-art generalization in deepfake detection, particularly demonstrating significant performance improvements on the challenging DFDC dataset.

Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon + 3 more2026-03-03💻 cs

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

The paper introduces VidDoS, the first universal Energy-Latency Attack framework for Video-LLMs that employs masked teacher forcing and refusal penalties to generate instance-agnostic triggers, causing extreme token expansion and inference latency that lead to critical safety violations in real-time applications.

Duoxun Tang, Dasen Dai, Jiyao Wang + 3 more2026-03-03🤖 cs.AI

← Previous Next →