cs.CV papers | Gist.Science

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

This paper introduces RAGTrack, a novel Retrieval-Augmented Generation framework that enhances RGB-Thermal tracking by integrating textual descriptions via Multi-modal Large Language Models and employing adaptive token fusion with context-aware reasoning to overcome appearance variations and modality gaps.

Hao Li, Yuhao Wang, Wenning Hao + 3 more2026-03-05💻 cs

CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing

The paper introduces CoRe-BT, a multimodal benchmark comprising 310 patients with MRI, histopathology, and pathology reports, designed to evaluate robust brain tumor typing under realistic conditions of missing clinical data.

Juampablo E. Heras Rivera, Daniel K. Low, Xavier Xiong + 5 more2026-03-05💻 cs

Extending Neural Operators: Robust Handling of Functions Beyond the Training Set

This paper presents a rigorous framework that extends neural operators to robustly handle out-of-distribution input functions by leveraging kernel approximations and Reproducing Kernel Hilbert Space theory to ensure accurate prediction of both function values and derivatives, validated through solutions of elliptic partial differential equations on manifolds.

Blaine Quackenbush, Paul J. Atzberger2026-03-05🤖 cs.LG

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

This paper introduces Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial instructions into natural images to hijack Multimodal Large Language Models, demonstrating a 64% success rate in manipulating model outputs while remaining visually imperceptible to humans.

Neha Nagaraja, Lan Zhang, Zhilong Wang + 2 more2026-03-05🤖 cs.AI

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

The paper presents InfinityStory, a novel framework, dataset, and model that overcome key limitations in long-form video generation by ensuring background and character consistency across shots while enabling seamless multi-subject transitions for hour-long narratives.

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen + 27 more2026-03-05💻 cs

One-Step Face Restoration via Shortcut-Enhanced Coupling Flow

The paper proposes SCFlowFR, a one-step face restoration method that leverages data-dependent coupling, conditional mean estimation, and a shortcut constraint to model low-to-high quality dependencies, thereby eliminating path crossovers and enabling high-quality, single-step inference.

Xiaohui Sun, Hanlin Wu2026-03-05💻 cs

Field imaging framework for morphological characterization of aggregates with computer vision: Algorithms and applications

This dissertation presents a comprehensive field imaging framework that leverages advanced computer vision algorithms, including 2D instance segmentation and an integrated 3D reconstruction-segmentation-completion approach, to overcome the limitations of traditional methods and enable accurate morphological characterization of construction aggregates across diverse field scenarios.

Haohang Huang2026-03-05🤖 cs.AI

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

This paper introduces InEdit-Bench, the first benchmark designed to evaluate the ability of multimodal generative models to reason over intermediate logical pathways in complex image editing tasks, revealing significant shortcomings in current models' capacity for dynamic reasoning and causal understanding.

Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang + 6 more2026-03-05🤖 cs.AI

Machine Pareidolia: Protecting Facial Image with Emotional Editing

This paper introduces MAP, a novel facial privacy protection method that employs human emotion editing to disguise original identities as target identities, effectively overcoming the limitations of traditional countermeasures in black-box settings and across diverse demographics while maintaining high perceptual quality.

Binh M. Le, Simon S. Woo2026-03-05🤖 cs.LG

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

EvoPrune is an early-stage visual token pruning method that performs layer-wise pruning guided by token similarity, diversity, and attention importance during visual encoding, achieving a 2 $\times$ inference speedup with minimal performance degradation on high-resolution images and videos.

Yuhao Chen, Bin Shan, Xin Ye + 1 more2026-03-05🤖 cs.AI

Polyp Segmentation Using Wavelet-Based Cross-Band Integration for Enhanced Boundary Representation

This paper proposes a wavelet-based polyp segmentation model that integrates grayscale and RGB representations through complementary frequency-consistent interaction to overcome low-contrast challenges and achieve superior boundary precision, as validated by extensive experiments on four benchmark datasets.

Haesung Oh, Jaesung Lee2026-03-05💻 cs

Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance

This paper proposes Embedded Runge-Kutta Guidance (ERK-Guid), a novel sampling method that leverages solver-induced local truncation errors as a guidance signal to detect stiffness and stabilize diffusion model generation, thereby outperforming state-of-the-art methods on benchmarks like ImageNet.

Inho Kong, Sojin Lee, Youngjoon Hong + 1 more2026-03-05🤖 cs.AI

MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction

MPFlow is a zero-shot multi-modal MRI reconstruction framework that leverages a self-supervised pretraining strategy (PAMRI) to guide rectified flow sampling with auxiliary structural scans, thereby significantly reducing hallucinations and improving anatomical fidelity compared to single-modality baselines while requiring fewer sampling steps.

Seunghoi Kim, Chen Jin, Henry F. J. Tregidgo + 2 more2026-03-05🤖 cs.AI

Order Is Not Layout: Order-to-Space Bias in Image Generation

This paper identifies and quantifies "Order-to-Space Bias" (OTS), a systematic flaw in modern image generation models where the textual order of entities incorrectly dictates their spatial layout, and demonstrates that this data-driven issue can be effectively mitigated through targeted fine-tuning and early-stage interventions without compromising generation quality.

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang + 3 more2026-03-05🤖 cs.AI

Glass Segmentation with Fusion of Learned and General Visual Features

This paper introduces a novel dual-backbone architecture that fuses general visual features from a frozen DINOv3 model with task-specific features from a supervised Swin model to achieve state-of-the-art glass segmentation performance across multiple datasets while maintaining competitive inference speed.

Risto Ojala, Tristan Ellison, Mo Chen2026-03-05💻 cs

QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment

To address the generalization challenges in No-Reference Point Cloud Quality Assessment caused by data scarcity, this paper proposes QD-PCQA, a novel unsupervised domain adaptation framework that transfers quality priors from images to point clouds through a Rank-weighted Conditional Alignment strategy and a Quality-guided Feature Augmentation module to enhance perceptual quality ranking and feature alignment.

Guohua Zhang, Jian Jin, Meiqin Liu + 2 more2026-03-05💻 cs

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

The paper proposes PROSPECT, a unified streaming vision-language navigation agent that integrates CUT3R-based spatial encoding with SigLIP semantic features and employs latent predictive representation learning to achieve state-of-the-art performance and robustness in long-horizon navigation tasks.

Zehua Fan, Wenqi Lyu, Wenxuan Song + 12 more2026-03-05🤖 cs.AI

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

DAGE introduces a dual-stream transformer architecture that efficiently estimates accurate, view-consistent geometry and camera poses from uncalibrated multi-view inputs by disentangling global coherence in a low-resolution stream from fine details in a high-resolution stream, achieving state-of-the-art performance while supporting high resolutions and long sequences.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh + 4 more2026-03-05💻 cs

WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images

This paper proposes WSI-INR, a novel patch-free framework utilizing Implicit Neural Representations and multi-resolution hash grid encoding to model whole-slide images as continuous functions, thereby overcoming the spatial fragmentation and resolution sensitivity of existing methods to achieve robust and accurate lesion segmentation across varying scales.

Yunheng Wu, Wenqi Huang, Liangyi Wang + 4 more2026-03-05💻 cs

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

The paper introduces KFRA, a knowledge-augmented agent that emulates expert analysis through a three-stage closed reasoning loop to achieve superior open-set fine-grained visual understanding and interpretable, evidence-driven reasoning, validated by the newly constructed FGExpertBench.

Junhan Chen, Zilu Zhou, Yujun Tong + 3 more2026-03-05💻 cs

← Previous Next →