cs.CV papers | Gist.Science

ReMoT: Reinforcement Learning with Motion Contrast Triplets

This paper introduces ReMoT, a unified training paradigm that combines a rule-based framework for generating a large-scale motion-contrast dataset with Group Relative Policy Optimization to significantly enhance VLMs' spatio-temporal consistency and reasoning capabilities, achieving state-of-the-art performance on both new and standard benchmarks.

Cong Wan, Zeyu Guo, Jiangyang Li + 5 more2026-03-03💻 cs

OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation

This paper introduces OPGAgent, a multi-tool agentic system that enhances the accuracy and audibility of dental panoramic X-ray interpretation by coordinating specialized perception modules through a hierarchical evidence gathering process and a consensus mechanism, while also proposing the OPG-Bench benchmark for comprehensive evaluation beyond standard VQA metrics.

Zhaolin Yu, Litao Yang, Ben Babicka + 7 more2026-03-03🤖 cs.AI

DreamWorld: Unified World Modeling in Video Generation

DreamWorld introduces a unified framework that integrates complementary world knowledge into video generation through a Joint World Modeling Paradigm, employing Consistent Constraint Annealing and Multi-Source Inner-Guidance to overcome visual instability and achieve superior temporal, spatial, and semantic consistency compared to existing models.

Boming Tan, Xiangdong Zhang, Ning Liao + 5 more2026-03-03💻 cs

High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System

This paper presents a hardware-algorithm co-designed HDR imaging system that integrates an asymmetric event-SVE camera with a novel two-stage alignment framework and a cross-modal reconstruction network to achieve superior highlight recovery and edge fidelity in extreme illumination conditions.

Pengju Sun, Banglei Guan, Jing Tao + 4 more2026-03-03💻 cs

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

This paper introduces FEWTRANS, a comprehensive benchmark and the Hyperparameter Ensemble (HPE) evaluation protocol to rigorously assess few-shot transfer learning, revealing that pre-trained model selection and full-parameter fine-tuning often outperform sophisticated adaptation methods due to their ability to make distributed micro-adjustments without overfitting.

Xu Luo, Ji Zhang, Lianli Gao + 2 more2026-03-03🤖 cs.LG

U-VLM: Hierarchical Vision Language Modeling for Report Generation

The paper introduces U-VLM, a hierarchical vision-language model that combines progressive multi-stage training with multi-layer visual feature injection to achieve state-of-the-art radiology report generation on 3D medical imaging, demonstrating that specialized encoder pretraining can outperform massive language models.

Pengcheng Shi, Minghui Zhang, Kehan Song + 3 more2026-03-03💻 cs

Analyzing Physical Adversarial Example Threats to Machine Learning in Election Systems

This paper presents a probabilistic framework to quantify the number of adversarial ballots required to flip a U.S. election and empirically demonstrates through 144,000 physical print-and-scan experiments that the most effective adversarial attacks in the physical voting domain differ significantly from those in the digital domain.

Khaleque Md Aashiq Kamal, Surya Eada, Aayushi Verma + 4 more2026-03-03🤖 cs.LG

TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications

The paper proposes TaiChi, a novel Vision-Language Model framework that enhances multimodal token communications through a dual-visual tokenizer, a Bilateral Attention Network for compact token fusion, and a KAN-based projector for precise cross-modal alignment, ultimately demonstrating superior performance in a joint VLM-channel coding system.

Feibo Jiang, Siwei Tu, Li Dong + 5 more2026-03-03🔢 math

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

RAISE is a training-free, requirement-driven evolutionary framework that achieves state-of-the-art text-to-image alignment by dynamically adapting computational resources to prompt complexity through iterative refinement and verification, significantly reducing the need for excessive samples and external model calls compared to existing methods.

Liyao Jiang, Ruichen Chen, Chao Gao + 1 more2026-03-03🤖 cs.AI

Random Wins All: Rethinking Grouping Strategies for Vision Tokens

This paper challenges the necessity of complex, carefully designed token grouping strategies in Vision Transformers by demonstrating that a simple random grouping approach not only matches or outperforms existing methods across various visual tasks and modalities but also reveals that meeting four key conditions—positional information, head feature diversity, global receptive field, and avoiding fixed grouping patterns—is sufficient for effective token grouping.

Qihang Fan, Yuang Ai, Huaibo Huang + 1 more2026-03-03💻 cs

ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

The paper proposes ArtiFixer, a two-stage pipeline that combines a bidirectional generative model with a causal auto-regressive diffusion model to efficiently generate hundreds of consistent novel views and enhance 3D reconstruction in under-observed areas, significantly outperforming existing state-of-the-art methods.

Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang + 7 more2026-03-03🤖 cs.LG

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

This paper introduces COG, an unsupervised framework for single-reference novel object pose estimation that formulates cross-view correspondence as a confidence-aware optimal transport problem to generate robust soft matches and achieve performance comparable to or exceeding supervised methods.

Yuchen Che, Jingtu Wu, Hao Zheng + 1 more2026-03-03💻 cs

M $^2$ : Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

The paper proposes M $^2$ , a training-free, dual-memory framework that enhances long-horizon web agents by combining dynamic trajectory summarization for internal state compression with offline insight retrieval for external guidance, achieving significant improvements in success rates and token efficiency across multiple benchmarks.

Dawei Yan, Haokui Zhang, Guangda Huzhang + 8 more2026-03-03💻 cs

Hierarchical Classification for Improved Histopathology Image Analysis

This paper proposes HiClass, a hierarchical classification framework based on multiple instance learning that utilizes bidirectional feature integration and tailored loss functions to enhance both coarse-grained and fine-grained whole-slide image classification by effectively capturing hierarchical relationships among histopathological labels.

Keunho Byeon, Jinsol Song, Seong Min Hong + 2 more2026-03-03💻 cs

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

This paper introduces EmbedLens to reveal that multimodal large language models exhibit significant visual token sparsity and redundancy, demonstrating that only a subset of "alive" tokens carry essential semantic information which can be efficiently processed via mid-layer injection rather than full internal computation.

Yingqi Fan, Junlong Tong, Anhao Zhao + 1 more2026-03-03🤖 cs.AI

Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

The paper proposes Multimodal Adaptive RAG (MMA-RAG), a framework that dynamically decides whether to incorporate retrieved external knowledge by analyzing the model's internal visual and textual representations, thereby effectively reducing hallucinations and improving performance in Visual Question Answering tasks.

Ruoshuang Du, Xin Sun, Qiang Liu + 4 more2026-03-03🤖 cs.LG

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

The paper introduces MLLM-4D, a framework that enhances multimodal large language models' 4D spatial-temporal reasoning from 2D RGB inputs by curating specialized datasets and employing a post-training strategy combining supervised fine-tuning with GRPO-based reinforcement learning.

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang + 2 more2026-03-03💻 cs

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Vision-TTT introduces a novel, efficient visual backbone that adapts Test-Time Training with bidirectional scanning and Conv2d modules to achieve linear-time complexity and global receptive fields, significantly outperforming existing models in both accuracy and computational efficiency on ImageNet and downstream tasks.

Quan Kong, Yanru Xiao, Yuhao Shen + 1 more2026-03-03💻 cs

Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness

Jano is a training-free framework that accelerates Diffusion Transformers by identifying heterogeneous convergence patterns in early denoising stages and applying adaptive token scheduling to achieve up to 2.4x speedup while preserving generation quality.

Yuyang Chen, Linqian Zeng, Yijin ZHou + 2 more2026-03-03💻 cs

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

This paper introduces Mesh-Pro, an asynchronous online reinforcement learning framework featuring Advantage-guided Ranking Preference Optimization (ARPO) and novel mesh tokenization techniques, which significantly improves training efficiency and achieves state-of-the-art performance in artist-style quadrilateral mesh generation.

Zhen Zhou, Jian Liu, Biwen Lei + 10 more2026-03-03💻 cs

← Previous Next →

cs.CV