cs.CV papers | Gist.Science

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

EvoPrune is an early-stage visual token pruning method that performs layer-wise pruning guided by token similarity, diversity, and attention importance during visual encoding, achieving a 2 $\times$ inference speedup with minimal performance degradation on high-resolution images and videos.

Yuhao Chen, Bin Shan, Xin Ye + 1 more2026-03-05🤖 cs.AI

Polyp Segmentation Using Wavelet-Based Cross-Band Integration for Enhanced Boundary Representation

This paper proposes a wavelet-based polyp segmentation model that integrates grayscale and RGB representations through complementary frequency-consistent interaction to overcome low-contrast challenges and achieve superior boundary precision, as validated by extensive experiments on four benchmark datasets.

Haesung Oh, Jaesung Lee2026-03-05💻 cs

Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance

This paper proposes Embedded Runge-Kutta Guidance (ERK-Guid), a novel sampling method that leverages solver-induced local truncation errors as a guidance signal to detect stiffness and stabilize diffusion model generation, thereby outperforming state-of-the-art methods on benchmarks like ImageNet.

Inho Kong, Sojin Lee, Youngjoon Hong + 1 more2026-03-05🤖 cs.AI

MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction

MPFlow is a zero-shot multi-modal MRI reconstruction framework that leverages a self-supervised pretraining strategy (PAMRI) to guide rectified flow sampling with auxiliary structural scans, thereby significantly reducing hallucinations and improving anatomical fidelity compared to single-modality baselines while requiring fewer sampling steps.

Seunghoi Kim, Chen Jin, Henry F. J. Tregidgo + 2 more2026-03-05🤖 cs.AI

Order Is Not Layout: Order-to-Space Bias in Image Generation

This paper identifies and quantifies "Order-to-Space Bias" (OTS), a systematic flaw in modern image generation models where the textual order of entities incorrectly dictates their spatial layout, and demonstrates that this data-driven issue can be effectively mitigated through targeted fine-tuning and early-stage interventions without compromising generation quality.

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang + 3 more2026-03-05🤖 cs.AI

Glass Segmentation with Fusion of Learned and General Visual Features

This paper introduces a novel dual-backbone architecture that fuses general visual features from a frozen DINOv3 model with task-specific features from a supervised Swin model to achieve state-of-the-art glass segmentation performance across multiple datasets while maintaining competitive inference speed.

Risto Ojala, Tristan Ellison, Mo Chen2026-03-05💻 cs

QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment

To address the generalization challenges in No-Reference Point Cloud Quality Assessment caused by data scarcity, this paper proposes QD-PCQA, a novel unsupervised domain adaptation framework that transfers quality priors from images to point clouds through a Rank-weighted Conditional Alignment strategy and a Quality-guided Feature Augmentation module to enhance perceptual quality ranking and feature alignment.

Guohua Zhang, Jian Jin, Meiqin Liu + 2 more2026-03-05💻 cs

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

The paper proposes PROSPECT, a unified streaming vision-language navigation agent that integrates CUT3R-based spatial encoding with SigLIP semantic features and employs latent predictive representation learning to achieve state-of-the-art performance and robustness in long-horizon navigation tasks.

Zehua Fan, Wenqi Lyu, Wenxuan Song + 12 more2026-03-05🤖 cs.AI

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

DAGE introduces a dual-stream transformer architecture that efficiently estimates accurate, view-consistent geometry and camera poses from uncalibrated multi-view inputs by disentangling global coherence in a low-resolution stream from fine details in a high-resolution stream, achieving state-of-the-art performance while supporting high resolutions and long sequences.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh + 4 more2026-03-05💻 cs

WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images

This paper proposes WSI-INR, a novel patch-free framework utilizing Implicit Neural Representations and multi-resolution hash grid encoding to model whole-slide images as continuous functions, thereby overcoming the spatial fragmentation and resolution sensitivity of existing methods to achieve robust and accurate lesion segmentation across varying scales.

Yunheng Wu, Wenqi Huang, Liangyi Wang + 4 more2026-03-05💻 cs

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

The paper introduces KFRA, a knowledge-augmented agent that emulates expert analysis through a three-stage closed reasoning loop to achieve superior open-set fine-grained visual understanding and interpretable, evidence-driven reasoning, validated by the newly constructed FGExpertBench.

Junhan Chen, Zilu Zhou, Yujun Tong + 3 more2026-03-05💻 cs

LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

DriveMVS is a novel multi-view stereo framework for autonomous driving that leverages sparse LiDAR observations as geometric prompts and employs a spatio-temporal decoder to achieve state-of-the-art metric accuracy, temporal consistency, and cross-domain generalization.

Qihao Sun, Jiarun Liu, Ziqian Ni + 5 more2026-03-05💻 cs

Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

This paper proposes a novel framework for small object detection in complex backgrounds that integrates Residual Haar Wavelet Downsampling, Global Relation Modeling, Cross-Scale Hybrid Attention, and a Center-Assisted Loss to preserve fine-grained details, suppress noise, and enhance localization accuracy, achieving state-of-the-art performance on the RGBT-Tiny benchmark.

Wenguang Tao, Xiaotian Wang, Tian Yan + 2 more2026-03-05💻 cs

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

This paper introduces TAP, a training-free framework that accelerates diffusion model inference by adaptively selecting the most accurate predictor for each token at every step based on a low-cost probe, achieving significant speedups with minimal quality loss.

Haowei Zhu, Tingxuan Huang, Xing Wang + 7 more2026-03-05🤖 cs.LG

When and Where to Reset Matters for Long-Term Test-Time Adaptation

To address model collapse and knowledge loss in long-term test-time adaptation, this paper proposes an Adaptive and Selective Reset (ASR) framework that dynamically determines optimal reset timing and scope while employing an importance-aware regularizer to recover essential knowledge and an on-the-fly adjustment scheme to enhance adaptability.

Taejun Lim, Joong-Won Hwang, Kibok Lee2026-03-05🤖 cs.AI

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

The paper introduces STAR, a novel autoregressive pretraining method that utilizes image separators to quadruple the input sequence length of Vision Mamba, achieving a competitive 83.5% accuracy on ImageNet-1k by effectively leveraging long-range dependencies.

Hanpeng Liu, Zidan Wang, Shuoxi Zhang + 2 more2026-03-05🤖 cs.AI

Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10

This paper proposes a lightweight underwater object detection framework based on YOLOv10 that integrates a Multi-Stage Adaptive Enhancement module, a Dual-Pooling Sequential Attention mechanism, and a Focal Generalized IoU loss to significantly improve accuracy and robustness on benchmark datasets while maintaining a compact model size suitable for resource-constrained environments.

Md. Mushibur Rahman, Umme Fawzia Rahim, Enam Ahmed Taufik2026-03-05💻 cs

Vector-Quantized Soft Label Compression for Dataset Distillation

This paper addresses the significant storage overhead of soft labels in dataset distillation by introducing a vector-quantized autoencoder (VQAE) that achieves 30–40x additional compression on benchmarks like ImageNet-1K while preserving over 90% of the original model performance.

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash + 1 more2026-03-05💻 cs

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

This paper proposes Structure-aware Prompt Adaptation (SPA), a plug-and-play method that leverages the consistent local structures of semantically related concepts in the embedding space to effectively generalize from seen to unseen attributes and objects in Open-Vocabulary Compositional Zero-Shot Learning.

Yihang Duan, Jiong Wang, Pengpeng Zeng + 5 more2026-03-05💻 cs

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

This paper identifies "Lazy Attention Localization" as a key bottleneck in multimodal cold-start training, where models fail to increase visual attention, and proposes the Attention-Guided Visual Anchoring and Reflection (AVAR) framework to effectively reshape attention distributions, achieving a 7.0% performance gain on multimodal reasoning benchmarks.

Ruilin Luo, Chufan Shi, Yizhen Zhang + 10 more2026-03-05🤖 cs.AI

← Previous Next →