cs.CV papers | Gist.Science

Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

This study demonstrates that for pasture biomass regression on scarce agricultural data, prioritizing high-quality backbone pretraining and utilizing simple local fusion modules significantly outperforms complex global architectures like SSMs and cross-view attention transformers, a phenomenon termed "fusion complexity inversion."

Mridankan Mandal2026-03-10🤖 cs.LG

Transferable Optimization Network for Cross-Domain Image Reconstruction

This paper proposes a novel two-step transfer learning framework utilizing bi-level optimization to train a universal feature-extractor and a task-specific domain-adapter, enabling high-quality image reconstruction in data-scarce scenarios by effectively leveraging diverse cross-domain data.

Yunmei Chen, Chi Ding, Xiaojing Ye2026-03-10🤖 cs.LG

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

This paper introduces VRGaze, the first large-scale off-axis gaze estimation dataset for VR, and GazeShift, an unsupervised, attention-guided framework that achieves real-time, label-efficient gaze tracking with high accuracy on both the new dataset and standard benchmarks.

Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut2026-03-10💻 cs

Training-free Temporal Object Tracking in Surgical Videos

This paper proposes a novel, training-free framework for temporal object tracking in laparoscopic cholecystectomy videos that leverages pre-trained text-to-image diffusion models and cross-frame affinity mechanisms to achieve high-accuracy localization and tracking without requiring costly pixel-level annotations.

Subhadeep Koley, Abdolrahim Kadkhodamohammadi, Santiago Barbarisi, Danail Stoyanov, Imanol Luengo2026-03-10💻 cs

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

SoundWeaver is a training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting generation from semantically similar cached audio, achieving a 1.8–3.0× latency reduction while preserving perceptual quality.

Ayush Barik, Sofia Stoica, Nikhil Sarda, Arnav Kethana, Abhinav Khanduja, Muchen Xu, Fan Lai2026-03-10💻 cs

Toward Unified Multimodal Representation Learning for Autonomous Driving

This paper proposes a Contrastive Tensor Pre-training (CTP) framework that replaces traditional pairwise similarity alignment with a joint tensor-based approach to unify multiple modalities in a single embedding space, thereby enhancing scene understanding and end-to-end performance in autonomous driving.

Ximeng Tao, Dimitar Filev, Gaurav Pandey2026-03-10🤖 cs.LG

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

This paper introduces VLM-SubtleBench, a comprehensive benchmark spanning ten fine-grained difference types across diverse domains like industrial, medical, and aerial imagery, to evaluate and reveal the significant performance gaps between current vision-language models and humans in subtle comparative reasoning tasks.

Minkyu Kim, Sangheon Lee, Dongmin Park2026-03-10🤖 cs.LG

Structure and Progress Aware Diffusion for Medical Image Segmentation

This paper proposes Structure and Progress Aware Diffusion (SPAD), a novel framework for medical image segmentation that employs a progress-aware scheduler to guide a coarse-to-fine learning paradigm, utilizing semantic-concentrated and boundary-centralized diffusion modules to effectively balance stable anatomical structure understanding with the refinement of ambiguous target boundaries.

Siyuan Song, Guyue Hu, Chenglong Li, Dengdi Sun, Zhe Jin, Jin Tang2026-03-10💻 cs

Visualizing Coalition Formation: From Hedonic Games to Image Segmentation

This paper proposes using image segmentation as a visual diagnostic framework for hedonic games, demonstrating how granularization parameters influence coalition equilibrium structures and their ability to recover foreground ground-truth on the Weizmann benchmark.

Pedro Henrique de Paula França, Lucas Lopes Felipe, Daniel Sadoc Menasché2026-03-10💻 cs

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

The paper introduces MINT, a fine-tuning framework that enhances pathology foundation models by integrating spatial transcriptomics supervision to bridge the gap between tissue morphology and molecular states, achieving superior performance in both gene expression prediction and general pathology tasks.

Minsoo Lee, Jonghyun Kim, Juseung Yun, Sunwoo Yu, Jongseong Jang2026-03-10💻 cs

Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning

This paper introduces E $^2$ OAL, a unified and detector-free framework for open-set active learning that leverages labeled unknowns through label-guided clustering and a Dirichlet-calibrated auxiliary head to achieve superior accuracy, efficiency, and query precision compared to existing state-of-the-art methods.

Chen-Chen Zong, Yu-Qi Chi, Xie-Yang Wang, Yan Cui, Sheng-Jun Huang2026-03-10🤖 cs.LG

Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition

This paper proposes a Concept-Guided Bayesian Framework for zero-shot image recognition that enhances Vision-Language Models by treating class-specific concepts as latent variables, utilizing an LLM-driven synthesis pipeline with diversity enforcement and a training-free adaptive soft-trim likelihood to achieve superior performance over heuristic prompting methods.

Hui Liu, Kecheng Chen, Jialiang Wang, Xianming Liu, Wenya Wang, Haoliang Li2026-03-10💻 cs

Geometric Transformation-Embedded Mamba for Learned Video Compression

This paper proposes a streamlined learned video compression framework that replaces traditional motion estimation with a direct transform strategy, utilizing a cascaded Mamba module with embedded geometric transformations and a locality refinement network to achieve superior perceptual quality and temporal consistency at low bitrates.

Hao Wei, Yanhui Zhou, Chenyang Ge2026-03-10💻 cs

Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

This paper proposes an unmixing-based fusion framework that decouples spatial-spectral information and employs a coarse-to-fine deformable aggregation module to effectively mitigate registration errors and achieve state-of-the-art performance in unregistered hyperspectral image super-resolution.

Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu2026-03-10💻 cs

RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving

This paper presents RLPR, a robust framework for radar-to-LiDAR place recognition that employs a dual-stream network and a two-stage asymmetric cross-modal alignment strategy to achieve state-of-the-art accuracy and zero-shot generalization across diverse radar types and adverse weather conditions.

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Guangming Xiong2026-03-10💻 cs

IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation

The paper proposes IMSE, a test-time adaptation method that fine-tunes only the singular values of Vision Transformer linear layers via a spectral mixture of experts and a diversity maximization loss to prevent feature collapse, achieving state-of-the-art performance with significantly fewer trainable parameters.

Sunghyun Baek (Korea Advanced Institute of Science and Technology), Jaemyung Yu (Korea Advanced Institute of Science and Technology), Seunghee Koh (Korea Advanced Institute of Science and Technology), Minsu Kim (LG Energy Solution), Hyeonseong Jeon (LG Energy Solution), Junmo Kim (Korea Advanced Institute of Science and Technology)2026-03-10💻 cs

A Hybrid Vision Transformer Approach for Mathematical Expression Recognition

This paper proposes a Hybrid Vision Transformer approach with 2D positional encoding and a coverage attention decoder to address the complexities of mathematical expression recognition, achieving a state-of-the-art BLEU score of 89.94 on the IM2LATEX-100K dataset.

Anh Duy Le, Van Linh Pham, Vinh Loi Ly, Nam Quan Nguyen, Huu Thang Nguyen, Tuan Anh Tran2026-03-10💻 cs

Text to Automata Diagrams: Comparing TikZ Code Generation with Direct Image Synthesis

This study evaluates the effectiveness of vision-language and large language models in converting scanned student-drawn automata diagrams into TikZ code, finding that while direct image-to-text generation often yields errors, human-corrected descriptions significantly improve the accuracy of the resulting digital diagrams for educational applications like automated grading.

Ethan Young, Zichun Wang, Aiden Taylor, Chance Jewell, Julian Myers, Satya Sri Rajiteswari Nimmagadda, Anthony White, Aniruddha Maiti, Ananya Jana2026-03-10💻 cs

$L^3$ :Scene-agnostic Visual Localization in the Wild

The paper introduces $L^3$ , a novel map-free visual localization framework that achieves high accuracy and robustness in sparse, wild scenes by leveraging online feed-forward 3D reconstruction and two-stage pose refinement, thereby eliminating the need for offline scene preprocessing and storage.

Yu Zhang, Muhua Zhu, Yifei Xue, Tie Ji, Yizhen Lao2026-03-10💻 cs

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

VisualAD is a language-free, zero-shot anomaly detection framework that leverages a frozen Vision Transformer backbone with learnable normality and abnormality tokens, along with spatial-aware cross-attention and self-alignment modules, to achieve state-of-the-art performance across industrial and medical domains without relying on text encoders or cross-modal alignment.

Yanning Hou, Peiyuan Li, Zirui Liu, Yitong Wang, Yanran Ruan, Jianfeng Qiu, Ke Xu2026-03-10💻 cs

← Previous Next →

cs.CV