cs.CV papers | Gist.Science

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

This paper proposes MIDAS, a multimodal jailbreak framework that bypasses safety mechanisms in advanced MLLMs by decomposing harmful semantics into risk-bearing subunits dispersed across multiple images and leveraging cross-image reasoning to reconstruct malicious intent, achieving an average attack success rate of 81.46% against closed-source models.

Yilian Liu, Xiaojun Jia, Guoshun Nan + 6 more2026-03-03🤖 cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

This paper proposes Decoupling Adaptation for Stability and Plasticity (DASP), a novel framework that addresses negative transfer and catastrophic forgetting in multi-modal test-time adaptation by leveraging interdimensional redundancy to identify biased modalities and applying an asymmetric strategy that updates plastic components for biased data while preserving stable components for unbiased data.

Yongbo He, Zirun Guo, Tao Jin2026-03-03🤖 cs.AI

MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation

This paper introduces MicroVerse, a specialized video generation model for simulating microscopic biological phenomena, supported by the MicroWorldBench evaluation framework and the expert-verified MicroSim-10K dataset, to address the limitations of current models in scientific fidelity and enable applications in drug discovery, education, and visualization.

Rongsheng Wang, Minghao Wu, Hongru Zhou + 4 more2026-03-03🤖 cs.AI

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

This paper introduces the LangGap benchmark to expose the critical language understanding deficits in state-of-the-art Vision-Language-Action models, demonstrating that while targeted data augmentation offers partial improvements, current models fundamentally struggle to generalize to linguistically diverse instructions.

Yuchen Hou, Lin Zhao2026-03-03💬 cs.CL

UNICBench: UNIfied Counting Benchmark for MLLM

This paper introduces UNICBench, a unified multimodal benchmark and toolkit comprising over 14,000 annotated QA pairs across images, documents, and audio, designed to rigorously evaluate and reveal significant reasoning gaps in the counting capabilities of 45 state-of-the-art multimodal large language models.

Chenggang Rong, Tao Han, Zhiyuan Zhao + 5 more2026-03-03💻 cs

Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation

This paper introduces a novel data-centric benchmark, a new public dataset, and two advanced techniques that leverage model uncertainty, prediction consistency, and representation analysis to effectively identify, quantify, and rank label noise in remote sensing image segmentation, outperforming existing baselines.

Keiller Nogueira, Codrut-Andrei Diaconu, Dávid Kerekes + 9 more2026-03-03💻 cs

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow is a mask-free, two-stage Flow Matching framework that resolves the stability-plasticity dilemma in multi-subject image generation by combining task-adaptive timestep scheduling, VLM-driven prompt synthesis, and group-level Direct Preference Optimization to achieve superior identity fidelity and aesthetic harmony in complex scenarios like age transformation.

Honghao Cai, Xiangyuan Wang, Yunhao Bai + 10 more2026-03-03🤖 cs.AI

Linking Modality Isolation in Heterogeneous Collaborative Perception

To address the challenge of modality isolation in heterogeneous collaborative perception where agents lack co-occurring training data, the paper proposes CodeAlign, an efficient, co-occurrence-free framework that achieves state-of-the-art performance by aligning modalities through cross-modal feature-code-feature translation using codebooks.

Changxing Liu, Zichen Chao, Siheng Chen2026-03-03💻 cs

Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark

This paper addresses the limitations of existing image-based spectral reconstruction methods by introducing the first high-quality dynamic hyperspectral dataset (DynaSpec), a novel Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT) model that leverages spatiotemporal feature propagation for superior video-level reconstruction, and a comprehensive benchmark for both simulation and real-world evaluation.

Lijing Cai, Zhan Shi, Chenglong Huang + 6 more2026-03-03💻 cs

Exploring 3D Dataset Pruning

This paper addresses the challenges of 3D dataset pruning caused by long-tail class distributions by formulating the problem as expected risk approximation and proposing a method that combines representation-aware subset selection with per-class retention quotas and prior-invariant teacher supervision to simultaneously improve Overall Accuracy and Mean Accuracy while enabling flexible trade-off control.

Xiaohan Zhao, Xinyi Shang, Jiacheng Liu + 1 more2026-03-03🤖 cs.LG

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

This paper introduces RC-GeoCP, a pioneering framework for radar-camera collaborative perception that establishes a radar-anchored geometric consensus through structure rectification, uncertainty-aware communication, and consensus-driven aggregation to achieve state-of-the-art performance with reduced communication overhead.

Xiaokai Bai, Lianqing Zheng, Runwei Guan + 2 more2026-03-03💻 cs

Stateful Cross-layer Vision Modulation

This paper proposes SCVM, a cross-layer memory-modulated vision framework that dynamically regulates representation evolution through recursive memory states and layer-wise feedback modulation, enabling multimodal large language models to achieve improved performance on visual tasks without requiring additional encoders, token expansion, or language model fine-tuning.

Ying Liu, Yudong Han, Kean Shi + 1 more2026-03-03💻 cs

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

This paper introduces HistoSelect, a question-guided, coarse-to-fine retrieval framework that mimics pathologists' human-like scanning behavior to efficiently identify relevant tissue regions and informative patches in gigapixel whole slide images, thereby significantly reducing computational costs while improving accuracy and interpretability in pathology visual question answering.

Wentao Huang, Weimin Lyu, Peiliang Lou + 8 more2026-03-03💻 cs

Direct low-field MRI super-resolution using undersampled k-space

This paper proposes a novel k-space dual channel U-Net framework that directly reconstructs high-quality, high-field-like MRI images from undersampled low-field k-space data, outperforming traditional spatial-domain methods and achieving quality comparable to full k-space acquisitions.

Daniel Tweneboah Anyimadu, Mohammed M. Abdelsamea, Ahmed Karam Eldaly2026-03-03💻 cs

Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis

This paper introduces the Mixture of Low-Rank Experts (MoLRE) framework, a parameter-efficient fine-tuning method that significantly enhances the performance of diverse foundation models on comprehensive multi-label head CT diagnosis by employing specialized low-rank adapters and unsupervised soft routing without requiring explicit pathology supervision.

Youngjin Yoo, Han Liu, Bogdan Georgescu + 14 more2026-03-03💻 cs

CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion

The paper proposes CoLC, a communication-efficient collaborative perception framework that leverages LiDAR completion techniques—specifically Foreground-Aware Point Sampling, Completion-Enhanced Early Fusion, and Dense-Guided Dual Alignment—to restore scene completeness from sparse transmissions and achieve superior perception-communication trade-offs while remaining robust to model heterogeneity.

Yushan Han, Hui Zhang, Qiming Xia + 2 more2026-03-03💻 cs

SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion

SCOUT is a fast, self-supervised spectral CT reconstruction method that leverages spatial nonlocal similarity and projection domain conjugate properties to generate pseudo-3D data, enabling high-fidelity imaging with detail recovery and artifact mitigation under ultra-low data regimes without requiring external datasets or pre-training.

Guoquan Wei, Liu Shi, Shaoyu Wang + 3 more2026-03-03💻 cs

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

This paper proposes STMI, a novel multi-modal object Re-Identification framework that integrates segmentation-guided feature modulation, semantic token reallocation, and cross-modal hypergraph interaction to enhance foreground representation, preserve discriminative cues, and capture high-order semantic relationships while mitigating background noise.

Xingguo Xu, Zhanyu Liu, Weixiang Zhou + 5 more2026-03-03💻 cs

TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

TokenSplat is a feed-forward framework that achieves joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images by introducing a token-aligned prediction module and an asymmetric dual-flow decoder to enable robust, iterative-free 3D scene modeling.

Yihui Li, Chengxin Lv, Zichen Tang + 2 more2026-03-03💻 cs

Towards Universal Khmer Text Recognition

This paper proposes a Universal Khmer Text Recognition (UKTR) framework featuring a novel modality-aware adaptive feature selection (MAFS) technique to overcome data scarcity and modality-specific limitations, achieving state-of-the-art performance while introducing the first comprehensive benchmark for the task.

Marry Kong, Rina Buoy, Sovisal Chenda + 3 more2026-03-03💻 cs

← Previous Next →