cs.CV papers | Gist.Science

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

This paper introduces EmbedLens to reveal that multimodal large language models exhibit significant visual token sparsity and redundancy, demonstrating that only a subset of "alive" tokens carry essential semantic information which can be efficiently processed via mid-layer injection rather than full internal computation.

Yingqi Fan, Junlong Tong, Anhao Zhao + 1 more2026-03-03🤖 cs.AI

Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

The paper proposes Multimodal Adaptive RAG (MMA-RAG), a framework that dynamically decides whether to incorporate retrieved external knowledge by analyzing the model's internal visual and textual representations, thereby effectively reducing hallucinations and improving performance in Visual Question Answering tasks.

Ruoshuang Du, Xin Sun, Qiang Liu + 4 more2026-03-03🤖 cs.LG

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

The paper introduces MLLM-4D, a framework that enhances multimodal large language models' 4D spatial-temporal reasoning from 2D RGB inputs by curating specialized datasets and employing a post-training strategy combining supervised fine-tuning with GRPO-based reinforcement learning.

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang + 2 more2026-03-03💻 cs

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Vision-TTT introduces a novel, efficient visual backbone that adapts Test-Time Training with bidirectional scanning and Conv2d modules to achieve linear-time complexity and global receptive fields, significantly outperforming existing models in both accuracy and computational efficiency on ImageNet and downstream tasks.

Quan Kong, Yanru Xiao, Yuhao Shen + 1 more2026-03-03💻 cs

Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness

Jano is a training-free framework that accelerates Diffusion Transformers by identifying heterogeneous convergence patterns in early denoising stages and applying adaptive token scheduling to achieve up to 2.4x speedup while preserving generation quality.

Yuyang Chen, Linqian Zeng, Yijin ZHou + 2 more2026-03-03💻 cs

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

This paper introduces Mesh-Pro, an asynchronous online reinforcement learning framework featuring Advantage-guided Ranking Preference Optimization (ARPO) and novel mesh tokenization techniques, which significantly improves training efficiency and achieves state-of-the-art performance in artist-style quadrilateral mesh generation.

Zhen Zhou, Jian Liu, Biwen Lei + 10 more2026-03-03💻 cs

TP-Spikformer: Token Pruned Spiking Transformer

The paper proposes TP-Spikformer, a training-free token pruning framework for spiking transformers that utilizes a heuristic spatiotemporal criterion and block-level early stopping to significantly reduce computational and storage overhead while maintaining competitive performance across diverse architectures and tasks.

Wenjie Wei, Xiaolong Zhou, Malu Zhang + 8 more2026-03-03💻 cs

CaptionFool: Universal Image Captioning Model Attacks

The paper introduces CaptionFool, a universal adversarial attack capable of manipulating state-of-the-art image captioning models to generate arbitrary, potentially offensive target captions by altering only a tiny fraction of image patches, thereby exposing critical vulnerabilities in vision-language systems.

Swapnil Parekh2026-03-03🤖 cs.AI

RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

This paper introduces Retrieval-Augmented Flow Matching (RAFM), a novel method that enhances unpaired CBCT-to-CT translation by leveraging a frozen DINOv3 encoder and a global memory bank to construct high-quality pseudo pairs, thereby stabilizing rectified flow training and outperforming existing approaches on the SynthRAD2023 benchmark.

Xianhao Zhou, Jianghao Wu, Lanfeng Zhong + 4 more2026-03-03💻 cs

Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer

This paper proposes the MIMD-3DVT, a novel 3D Vision Transformer model that integrates multiple brain regions and mixed data sources (imaging, demographics, and cognitive assessments) to achieve a state-of-the-art 97.14% accuracy in classifying Alzheimer's Disease.

Juan A. Castro-Silva, Maria N. Moreno Garcia, Diego H. Peluffo-Ordoñez2026-03-03💻 cs

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

This paper introduces M-JudgeBench, a ten-dimensional capability-oriented benchmark for diagnosing weaknesses in Multimodal Large Language Model (MLLM) judges, and proposes Judge-MCTS, a data generation framework that trains the superior M-Judger models to address these identified limitations.

Zeyu Chen, Huanjin Yao, Ziwang Zhao + 1 more2026-03-03🤖 cs.AI

Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

This paper proposes LAS-VAD, a novel weakly supervised video anomaly detection framework that integrates anomaly-connected components and intention reasoning with attribute guidance to effectively learn anomaly semantics and outperform state-of-the-art methods on benchmark datasets.

Yu Wang, Shengjie Zhao2026-03-03💻 cs

Geometry OR Tracker: Universal Geometric Operating Room Tracking

The paper introduces Geometry OR Tracker, a two-stage pipeline that rectifies unreliable camera calibration to establish a globally consistent metric frame, thereby enabling robust multi-view 3D tracking in operating rooms where traditional methods fail due to geometric inconsistencies.

Yihua Shao, Kang Chen, Feng Xue + 6 more2026-03-03🤖 cs.AI

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

This paper proposes MIDAS, a multimodal jailbreak framework that bypasses safety mechanisms in advanced MLLMs by decomposing harmful semantics into risk-bearing subunits dispersed across multiple images and leveraging cross-image reasoning to reconstruct malicious intent, achieving an average attack success rate of 81.46% against closed-source models.

Yilian Liu, Xiaojun Jia, Guoshun Nan + 6 more2026-03-03🤖 cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

This paper proposes Decoupling Adaptation for Stability and Plasticity (DASP), a novel framework that addresses negative transfer and catastrophic forgetting in multi-modal test-time adaptation by leveraging interdimensional redundancy to identify biased modalities and applying an asymmetric strategy that updates plastic components for biased data while preserving stable components for unbiased data.

Yongbo He, Zirun Guo, Tao Jin2026-03-03🤖 cs.AI

MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation

This paper introduces MicroVerse, a specialized video generation model for simulating microscopic biological phenomena, supported by the MicroWorldBench evaluation framework and the expert-verified MicroSim-10K dataset, to address the limitations of current models in scientific fidelity and enable applications in drug discovery, education, and visualization.

Rongsheng Wang, Minghao Wu, Hongru Zhou + 4 more2026-03-03🤖 cs.AI

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

This paper introduces the LangGap benchmark to expose the critical language understanding deficits in state-of-the-art Vision-Language-Action models, demonstrating that while targeted data augmentation offers partial improvements, current models fundamentally struggle to generalize to linguistically diverse instructions.

Yuchen Hou, Lin Zhao2026-03-03💬 cs.CL

UNICBench: UNIfied Counting Benchmark for MLLM

This paper introduces UNICBench, a unified multimodal benchmark and toolkit comprising over 14,000 annotated QA pairs across images, documents, and audio, designed to rigorously evaluate and reveal significant reasoning gaps in the counting capabilities of 45 state-of-the-art multimodal large language models.

Chenggang Rong, Tao Han, Zhiyuan Zhao + 5 more2026-03-03💻 cs

Data-Centric Benchmark for Label Noise Estimation and Ranking in Remote Sensing Image Segmentation

This paper introduces a novel data-centric benchmark, a new public dataset, and two advanced techniques that leverage model uncertainty, prediction consistency, and representation analysis to effectively identify, quantify, and rank label noise in remote sensing image segmentation, outperforming existing baselines.

Keiller Nogueira, Codrut-Andrei Diaconu, Dávid Kerekes + 9 more2026-03-03💻 cs

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow is a mask-free, two-stage Flow Matching framework that resolves the stability-plasticity dilemma in multi-subject image generation by combining task-adaptive timestep scheduling, VLM-driven prompt synthesis, and group-level Direct Preference Optimization to achieve superior identity fidelity and aesthetic harmony in complex scenarios like age transformation.

Honghao Cai, Xiangyuan Wang, Yunhao Bai + 10 more2026-03-03🤖 cs.AI

← Previous Next →