cs.CV papers | Gist.Science

TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications

The paper proposes TaiChi, a novel Vision-Language Model framework that enhances multimodal token communications through a dual-visual tokenizer, a Bilateral Attention Network for compact token fusion, and a KAN-based projector for precise cross-modal alignment, ultimately demonstrating superior performance in a joint VLM-channel coding system.

Feibo Jiang, Siwei Tu, Li Dong + 5 more2026-03-03🔢 math

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

RAISE is a training-free, requirement-driven evolutionary framework that achieves state-of-the-art text-to-image alignment by dynamically adapting computational resources to prompt complexity through iterative refinement and verification, significantly reducing the need for excessive samples and external model calls compared to existing methods.

Liyao Jiang, Ruichen Chen, Chao Gao + 1 more2026-03-03🤖 cs.AI

Random Wins All: Rethinking Grouping Strategies for Vision Tokens

This paper challenges the necessity of complex, carefully designed token grouping strategies in Vision Transformers by demonstrating that a simple random grouping approach not only matches or outperforms existing methods across various visual tasks and modalities but also reveals that meeting four key conditions—positional information, head feature diversity, global receptive field, and avoiding fixed grouping patterns—is sufficient for effective token grouping.

Qihang Fan, Yuang Ai, Huaibo Huang + 1 more2026-03-03💻 cs

ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

The paper proposes ArtiFixer, a two-stage pipeline that combines a bidirectional generative model with a causal auto-regressive diffusion model to efficiently generate hundreds of consistent novel views and enhance 3D reconstruction in under-observed areas, significantly outperforming existing state-of-the-art methods.

Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang + 7 more2026-03-03🤖 cs.LG

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

This paper introduces COG, an unsupervised framework for single-reference novel object pose estimation that formulates cross-view correspondence as a confidence-aware optimal transport problem to generate robust soft matches and achieve performance comparable to or exceeding supervised methods.

Yuchen Che, Jingtu Wu, Hao Zheng + 1 more2026-03-03💻 cs

M $^2$ : Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

The paper proposes M $^2$ , a training-free, dual-memory framework that enhances long-horizon web agents by combining dynamic trajectory summarization for internal state compression with offline insight retrieval for external guidance, achieving significant improvements in success rates and token efficiency across multiple benchmarks.

Dawei Yan, Haokui Zhang, Guangda Huzhang + 8 more2026-03-03💻 cs

Hierarchical Classification for Improved Histopathology Image Analysis

This paper proposes HiClass, a hierarchical classification framework based on multiple instance learning that utilizes bidirectional feature integration and tailored loss functions to enhance both coarse-grained and fine-grained whole-slide image classification by effectively capturing hierarchical relationships among histopathological labels.

Keunho Byeon, Jinsol Song, Seong Min Hong + 2 more2026-03-03💻 cs

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

This paper introduces EmbedLens to reveal that multimodal large language models exhibit significant visual token sparsity and redundancy, demonstrating that only a subset of "alive" tokens carry essential semantic information which can be efficiently processed via mid-layer injection rather than full internal computation.

Yingqi Fan, Junlong Tong, Anhao Zhao + 1 more2026-03-03🤖 cs.AI

Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

The paper proposes Multimodal Adaptive RAG (MMA-RAG), a framework that dynamically decides whether to incorporate retrieved external knowledge by analyzing the model's internal visual and textual representations, thereby effectively reducing hallucinations and improving performance in Visual Question Answering tasks.

Ruoshuang Du, Xin Sun, Qiang Liu + 4 more2026-03-03🤖 cs.LG

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

The paper introduces MLLM-4D, a framework that enhances multimodal large language models' 4D spatial-temporal reasoning from 2D RGB inputs by curating specialized datasets and employing a post-training strategy combining supervised fine-tuning with GRPO-based reinforcement learning.

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang + 2 more2026-03-03💻 cs

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Vision-TTT introduces a novel, efficient visual backbone that adapts Test-Time Training with bidirectional scanning and Conv2d modules to achieve linear-time complexity and global receptive fields, significantly outperforming existing models in both accuracy and computational efficiency on ImageNet and downstream tasks.

Quan Kong, Yanru Xiao, Yuhao Shen + 1 more2026-03-03💻 cs

Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness

Jano is a training-free framework that accelerates Diffusion Transformers by identifying heterogeneous convergence patterns in early denoising stages and applying adaptive token scheduling to achieve up to 2.4x speedup while preserving generation quality.

Yuyang Chen, Linqian Zeng, Yijin ZHou + 2 more2026-03-03💻 cs

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

This paper introduces Mesh-Pro, an asynchronous online reinforcement learning framework featuring Advantage-guided Ranking Preference Optimization (ARPO) and novel mesh tokenization techniques, which significantly improves training efficiency and achieves state-of-the-art performance in artist-style quadrilateral mesh generation.

Zhen Zhou, Jian Liu, Biwen Lei + 10 more2026-03-03💻 cs

TP-Spikformer: Token Pruned Spiking Transformer

The paper proposes TP-Spikformer, a training-free token pruning framework for spiking transformers that utilizes a heuristic spatiotemporal criterion and block-level early stopping to significantly reduce computational and storage overhead while maintaining competitive performance across diverse architectures and tasks.

Wenjie Wei, Xiaolong Zhou, Malu Zhang + 8 more2026-03-03💻 cs

CaptionFool: Universal Image Captioning Model Attacks

The paper introduces CaptionFool, a universal adversarial attack capable of manipulating state-of-the-art image captioning models to generate arbitrary, potentially offensive target captions by altering only a tiny fraction of image patches, thereby exposing critical vulnerabilities in vision-language systems.

Swapnil Parekh2026-03-03🤖 cs.AI

RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

This paper introduces Retrieval-Augmented Flow Matching (RAFM), a novel method that enhances unpaired CBCT-to-CT translation by leveraging a frozen DINOv3 encoder and a global memory bank to construct high-quality pseudo pairs, thereby stabilizing rectified flow training and outperforming existing approaches on the SynthRAD2023 benchmark.

Xianhao Zhou, Jianghao Wu, Lanfeng Zhong + 4 more2026-03-03💻 cs

Multiple Inputs and Mixwd data for Alzheimer's Disease Classification Based on 3D Vision Transformer

This paper proposes the MIMD-3DVT, a novel 3D Vision Transformer model that integrates multiple brain regions and mixed data sources (imaging, demographics, and cognitive assessments) to achieve a state-of-the-art 97.14% accuracy in classifying Alzheimer's Disease.

Juan A. Castro-Silva, Maria N. Moreno Garcia, Diego H. Peluffo-Ordoñez2026-03-03💻 cs

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

This paper introduces M-JudgeBench, a ten-dimensional capability-oriented benchmark for diagnosing weaknesses in Multimodal Large Language Model (MLLM) judges, and proposes Judge-MCTS, a data generation framework that trains the superior M-Judger models to address these identified limitations.

Zeyu Chen, Huanjin Yao, Ziwang Zhao + 1 more2026-03-03🤖 cs.AI

Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

This paper proposes LAS-VAD, a novel weakly supervised video anomaly detection framework that integrates anomaly-connected components and intention reasoning with attribute guidance to effectively learn anomaly semantics and outperform state-of-the-art methods on benchmark datasets.

Yu Wang, Shengjie Zhao2026-03-03💻 cs

Geometry OR Tracker: Universal Geometric Operating Room Tracking

The paper introduces Geometry OR Tracker, a two-stage pipeline that rectifies unreliable camera calibration to establish a globally consistent metric frame, thereby enabling robust multi-view 3D tracking in operating rooms where traditional methods fail due to geometric inconsistencies.

Yihua Shao, Kang Chen, Feng Xue + 6 more2026-03-03🤖 cs.AI

← Previous Next →

cs.CV