cs.CV papers | Gist.Science

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

This paper proposes a CLIP-guided multi-task regression framework that leverages level-aware vision-language embeddings to robustly predict plant age and leaf count from multi-view imagery, achieving significant accuracy improvements on the GroMo25 benchmark while simplifying the pipeline and handling incomplete inputs.

Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo + 2 more2026-03-05💻 cs

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

This paper introduces a training-free, capture-time frame curation method for always-on egocentric cameras that leverages gaze stability and pupil-derived novelty as complementary criteria to efficiently select high-quality, informative frames, achieving full-stream classification performance with only 10% of the data while respecting wearable device constraints.

Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish2026-03-05💻 cs

Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

This paper introduces HPENets, an efficient suite of MLP-based point cloud networks that leverage a two-stage abstraction-refinement paradigm, high-dimensional positional encoding, and non-local MLPs to achieve superior performance with significantly reduced computational costs compared to state-of-the-art models.

Yanmei Zou, Hongshan Yu, Yaonan Wang + 4 more2026-03-05🤖 cs.AI

Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

This paper proposes a disentangled representation learning framework for brain MRI to demonstrate that demographic predictability primarily stems from anatomical variation rather than acquisition-dependent contrast, highlighting the need for targeted mitigation strategies that address these distinct sources to ensure robust bias reduction.

Mehmet Yigit Avci, Akshit Achara, Andrew King + 1 more2026-03-05🤖 cs.AI

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

This paper introduces Any2Any, a unified latent diffusion framework that enables efficient and generalizable arbitrary modality translation in remote sensing by projecting heterogeneous inputs into a shared geometrically aligned latent space, supported by the newly proposed million-scale RST-1M dataset.

Haoyang Chen, Jing Zhang, Hebaixu Wang + 7 more2026-03-05💻 cs

TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

TextBoost addresses the challenge of preserving small-font scene text in ultra-low bitrate image compression by transmitting negligible OCR-derived semantic guidance to the decoder, where it is fused with image features and enforced via a regularizing loss to significantly improve text recognition fidelity without compromising global image quality.

Bingxin Wang, Yuan Lan, Zhaoyi Sun + 2 more2026-03-05💻 cs

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

This paper addresses the underexplored challenge of Few-Shot Open-Set Action Recognition in video data by proposing a Feature-Residual Discriminator (FR-Disc) that significantly improves unknown action rejection without sacrificing closed-set accuracy, establishing a new state-of-the-art benchmark across five datasets.

Stefano Berti, Giulia Pasquale, Lorenzo Natale2026-03-05💻 cs

Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab $^{+}$ is a scalable and unified audio-visual scene understanding model that overcomes the negative transfer issues of conventional multi-task methods by introducing the AV-UIE v2 dataset with explicit reasoning and an Interaction-aware LoRA mechanism to enable effective explicit cooperation across heterogeneous tasks.

Dongnuan Cai, Henghui Du, Chang Zhou + 5 more2026-03-05🤖 cs.AI

Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

This paper proposes an inference-time attention regulation framework that utilizes anatomy-aware gating and pathology-guided latent corrections to achieve anatomically consistent and precisely localized counterfactual chest X-ray synthesis, effectively overcoming the structural drift and unstable pathology expression issues common in standard diffusion-based editing methods.

Zichun Zhang, Weizhi Nie, Honglin Guo + 1 more2026-03-05💻 cs

HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

This paper proposes HBRB-BoW, a refined hierarchical training algorithm that integrates global real-valued flows to preserve high-fidelity descriptor information before final binarization, thereby overcoming the precision loss of traditional binary clustering and significantly enhancing the discriminative power and performance of ORB-SLAM in loop closing and relocalization tasks.

Minjae Lee, Sang-Min Choi, Gun-Woo Kim + 1 more2026-03-05💻 cs

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

This paper proposes a LISTA-Transformer model that integrates Learnable Iterative Shrinkage Threshold Algorithm-based sparse coding with the Transformer architecture to overcome the limitations of CNNs and standard Transformers in local and global feature modeling, achieving a 98.5% fault recognition rate on the CWRU dataset through time-frequency signal analysis.

Shuang Liu, Lina Zhao, Tian Wang + 1 more2026-03-05💻 cs

Degradation-based augmented training for robust individual animal re-identification

This paper introduces a degradation-based augmented training framework that artificially diversifies image degradations during training to significantly improve the robustness and accuracy of deep learning models for individual animal re-identification across various species and real-world conditions.

Thanos Polychronou, Lukáš Adam, Viktor Penchev + 1 more2026-03-05💻 cs

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

The paper introduces PlaneCycle, a training-free and adapter-free method that lifts pretrained 2D foundation models to 3D by cyclically distributing spatial aggregation across orthogonal planes, enabling strong 3D performance without architectural modifications or additional parameters.

Yinghong Yu, Guangyuan Li, Jiancheng Yang2026-03-05🤖 cs.AI

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

This paper establishes a principled theoretical framework for density aggregation by demonstrating that normalized generalized means with order $r \in [0,1]$ are the only rules guaranteeing systematic improvements in log-likelihood over individual distributions, thereby providing a unified justification for the widespread use of linear and geometric pooling in Deep Ensembles.

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso + 2 more2026-03-05🤖 cs.LG

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

The paper introduces Real5-OmniDocBench, the first benchmark that physically reconstructs the entire OmniDocBench v1.5 dataset across five real-world scenarios to rigorously evaluate and diagnose the performance gap of Vision-Language Models in physical document parsing.

Changda Zhou, Ziyue Gao, Xueqing Wang + 4 more2026-03-05💻 cs

Nearest-Neighbor Density Estimation for Dependency Suppression

This paper proposes a novel encoder-based approach that combines a specialized variational autoencoder with non-parametric nearest-neighbor density estimation to explicitly optimize for independence from sensitive variables, effectively removing unwanted dependencies while preserving essential data utility.

Kathleen Anderson, Thomas Martinetz2026-03-05🤖 cs.LG

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

This paper introduces DiverseDiT, a novel framework that enhances Diffusion Transformers by systematically analyzing and explicitly promoting representation diversity across blocks through long residual connections and a diversity loss, resulting in consistent performance gains and faster convergence across various model sizes and generation settings.

Mengping Yang, Zhiyu Tan, Binglei Li + 3 more2026-03-05💻 cs

DeNuC: Decoupling Nuclei Detection and Classification in Histopathology

The paper proposes DeNuC, a method that decouples nuclei detection and classification by using a lightweight model for localization and a Pathology Foundation Model for feature-based classification, thereby overcoming representation degradation and computational inefficiency to achieve state-of-the-art performance with significantly fewer trainable parameters.

Zijiang Yang, Chen Kuang, Dongmei Fu2026-03-05💻 cs

EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

EmbodiedSplat is an online, feed-forward 3D Gaussian Splatting framework that enables simultaneous, near real-time 3D reconstruction and open-vocabulary semantic understanding of streaming scenes by integrating a memory-efficient CLIP-based coefficient field with 3D geometric-aware feature aggregation.

Seungjun Lee, Zihan Wang, Yunsong Wang + 1 more2026-03-05💻 cs

A Hypertoroidal Covering for Perfect Color Equivariance

This paper introduces a novel color equivariant architecture that eliminates approximation artifacts in handling saturation and luminance by lifting interval-valued quantities to a circular double-cover, thereby achieving superior robustness, interpretability, and performance in tasks like fine-grained classification and medical imaging.

Yulong Yang, Zhikun Xu, Yaojun Li + 1 more2026-03-05💻 cs

← Previous Next →

cs.CV