cs.CV papers | Gist.Science

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

This paper introduces Mesh-Pro, an asynchronous online reinforcement learning framework featuring Advantage-guided Ranking Preference Optimization (ARPO) and novel mesh tokenization techniques, which significantly improves training efficiency and achieves state-of-the-art performance in artist-style quadrilateral mesh generation.

Zhen Zhou, Jian Liu, Biwen Lei + 10 more2026-03-03💻 cs

TP-Spikformer: Token Pruned Spiking Transformer

The paper proposes TP-Spikformer, a training-free token pruning framework for spiking transformers that utilizes a heuristic spatiotemporal criterion and block-level early stopping to significantly reduce computational and storage overhead while maintaining competitive performance across diverse architectures and tasks.

Wenjie Wei, Xiaolong Zhou, Malu Zhang + 8 more2026-03-03💻 cs

CaptionFool: Universal Image Captioning Model Attacks

The paper introduces CaptionFool, a universal adversarial attack capable of manipulating state-of-the-art image captioning models to generate arbitrary, potentially offensive target captions by altering only a tiny fraction of image patches, thereby exposing critical vulnerabilities in vision-language systems.

Swapnil Parekh2026-03-03🤖 cs.AI

RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

This paper introduces Retrieval-Augmented Flow Matching (RAFM), a novel method that enhances unpaired CBCT-to-CT translation by leveraging a frozen DINOv3 encoder and a global memory bank to construct high-quality pseudo pairs, thereby stabilizing rectified flow training and outperforming existing approaches on the SynthRAD2023 benchmark.

Xianhao Zhou, Jianghao Wu, Lanfeng Zhong + 4 more2026-03-03💻 cs

Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer

This paper proposes the MIMD-3DVT, a novel 3D Vision Transformer model that integrates multiple brain regions and mixed data sources (imaging, demographics, and cognitive assessments) to achieve a state-of-the-art 97.14% accuracy in classifying Alzheimer's Disease.

Juan A. Castro-Silva, Maria N. Moreno Garcia, Diego H. Peluffo-Ordoñez2026-03-03💻 cs

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

This paper introduces M-JudgeBench, a ten-dimensional capability-oriented benchmark for diagnosing weaknesses in Multimodal Large Language Model (MLLM) judges, and proposes Judge-MCTS, a data generation framework that trains the superior M-Judger models to address these identified limitations.

Zeyu Chen, Huanjin Yao, Ziwang Zhao + 1 more2026-03-03🤖 cs.AI

Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

This paper proposes LAS-VAD, a novel weakly supervised video anomaly detection framework that integrates anomaly-connected components and intention reasoning with attribute guidance to effectively learn anomaly semantics and outperform state-of-the-art methods on benchmark datasets.

Yu Wang, Shengjie Zhao2026-03-03💻 cs

Geometry OR Tracker: Universal Geometric Operating Room Tracking

The paper introduces Geometry OR Tracker, a two-stage pipeline that rectifies unreliable camera calibration to establish a globally consistent metric frame, thereby enabling robust multi-view 3D tracking in operating rooms where traditional methods fail due to geometric inconsistencies.

Yihua Shao, Kang Chen, Feng Xue + 6 more2026-03-03🤖 cs.AI

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

This paper proposes MIDAS, a multimodal jailbreak framework that bypasses safety mechanisms in advanced MLLMs by decomposing harmful semantics into risk-bearing subunits dispersed across multiple images and leveraging cross-image reasoning to reconstruct malicious intent, achieving an average attack success rate of 81.46% against closed-source models.

Yilian Liu, Xiaojun Jia, Guoshun Nan + 6 more2026-03-03🤖 cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

This paper proposes Decoupling Adaptation for Stability and Plasticity (DASP), a novel framework that addresses negative transfer and catastrophic forgetting in multi-modal test-time adaptation by leveraging interdimensional redundancy to identify biased modalities and applying an asymmetric strategy that updates plastic components for biased data while preserving stable components for unbiased data.

Yongbo He, Zirun Guo, Tao Jin2026-03-03🤖 cs.AI

MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation

This paper introduces MicroVerse, a specialized video generation model for simulating microscopic biological phenomena, supported by the MicroWorldBench evaluation framework and the expert-verified MicroSim-10K dataset, to address the limitations of current models in scientific fidelity and enable applications in drug discovery, education, and visualization.

Rongsheng Wang, Minghao Wu, Hongru Zhou + 4 more2026-03-03🤖 cs.AI

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

This paper introduces the LangGap benchmark to expose the critical language understanding deficits in state-of-the-art Vision-Language-Action models, demonstrating that while targeted data augmentation offers partial improvements, current models fundamentally struggle to generalize to linguistically diverse instructions.

Yuchen Hou, Lin Zhao2026-03-03💬 cs.CL

UNICBench: UNIfied Counting Benchmark for MLLM

This paper introduces UNICBench, a unified multimodal benchmark and toolkit comprising over 14,000 annotated QA pairs across images, documents, and audio, designed to rigorously evaluate and reveal significant reasoning gaps in the counting capabilities of 45 state-of-the-art multimodal large language models.

Chenggang Rong, Tao Han, Zhiyuan Zhao + 5 more2026-03-03💻 cs

Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation

This paper introduces a novel data-centric benchmark, a new public dataset, and two advanced techniques that leverage model uncertainty, prediction consistency, and representation analysis to effectively identify, quantify, and rank label noise in remote sensing image segmentation, outperforming existing baselines.

Keiller Nogueira, Codrut-Andrei Diaconu, Dávid Kerekes + 9 more2026-03-03💻 cs

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow is a mask-free, two-stage Flow Matching framework that resolves the stability-plasticity dilemma in multi-subject image generation by combining task-adaptive timestep scheduling, VLM-driven prompt synthesis, and group-level Direct Preference Optimization to achieve superior identity fidelity and aesthetic harmony in complex scenarios like age transformation.

Honghao Cai, Xiangyuan Wang, Yunhao Bai + 10 more2026-03-03🤖 cs.AI

Linking Modality Isolation in Heterogeneous Collaborative Perception

To address the challenge of modality isolation in heterogeneous collaborative perception where agents lack co-occurring training data, the paper proposes CodeAlign, an efficient, co-occurrence-free framework that achieves state-of-the-art performance by aligning modalities through cross-modal feature-code-feature translation using codebooks.

Changxing Liu, Zichen Chao, Siheng Chen2026-03-03💻 cs

Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark

This paper addresses the limitations of existing image-based spectral reconstruction methods by introducing the first high-quality dynamic hyperspectral dataset (DynaSpec), a novel Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT) model that leverages spatiotemporal feature propagation for superior video-level reconstruction, and a comprehensive benchmark for both simulation and real-world evaluation.

Lijing Cai, Zhan Shi, Chenglong Huang + 6 more2026-03-03💻 cs

Exploring 3D Dataset Pruning

This paper addresses the challenges of 3D dataset pruning caused by long-tail class distributions by formulating the problem as expected risk approximation and proposing a method that combines representation-aware subset selection with per-class retention quotas and prior-invariant teacher supervision to simultaneously improve Overall Accuracy and Mean Accuracy while enabling flexible trade-off control.

Xiaohan Zhao, Xinyi Shang, Jiacheng Liu + 1 more2026-03-03🤖 cs.LG

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

This paper introduces RC-GeoCP, a pioneering framework for radar-camera collaborative perception that establishes a radar-anchored geometric consensus through structure rectification, uncertainty-aware communication, and consensus-driven aggregation to achieve state-of-the-art performance with reduced communication overhead.

Xiaokai Bai, Lianqing Zheng, Runwei Guan + 2 more2026-03-03💻 cs

Stateful Cross-layer Vision Modulation

This paper proposes SCVM, a cross-layer memory-modulated vision framework that dynamically regulates representation evolution through recursive memory states and layer-wise feedback modulation, enabling multimodal large language models to achieve improved performance on visual tasks without requiring additional encoders, token expansion, or language model fine-tuning.

Ying Liu, Yudong Han, Kean Shi + 1 more2026-03-03💻 cs

← Previous Next →