cs.CV papers | Gist.Science

Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Crab $^{+}$ is a scalable and unified audio-visual scene understanding model that overcomes the negative transfer issues of conventional multi-task methods by introducing the AV-UIE v2 dataset with explicit reasoning and an Interaction-aware LoRA mechanism to enable effective explicit cooperation across heterogeneous tasks.

Dongnuan Cai, Henghui Du, Chang Zhou + 5 more2026-03-05🤖 cs.AI

Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

This paper proposes an inference-time attention regulation framework that utilizes anatomy-aware gating and pathology-guided latent corrections to achieve anatomically consistent and precisely localized counterfactual chest X-ray synthesis, effectively overcoming the structural drift and unstable pathology expression issues common in standard diffusion-based editing methods.

Zichun Zhang, Weizhi Nie, Honglin Guo + 1 more2026-03-05💻 cs

HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

This paper proposes HBRB-BoW, a refined hierarchical training algorithm that integrates global real-valued flows to preserve high-fidelity descriptor information before final binarization, thereby overcoming the precision loss of traditional binary clustering and significantly enhancing the discriminative power and performance of ORB-SLAM in loop closing and relocalization tasks.

Minjae Lee, Sang-Min Choi, Gun-Woo Kim + 1 more2026-03-05💻 cs

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

This paper proposes a LISTA-Transformer model that integrates Learnable Iterative Shrinkage Threshold Algorithm-based sparse coding with the Transformer architecture to overcome the limitations of CNNs and standard Transformers in local and global feature modeling, achieving a 98.5% fault recognition rate on the CWRU dataset through time-frequency signal analysis.

Shuang Liu, Lina Zhao, Tian Wang + 1 more2026-03-05💻 cs

Degradation-based augmented training for robust individual animal re-identification

This paper introduces a degradation-based augmented training framework that artificially diversifies image degradations during training to significantly improve the robustness and accuracy of deep learning models for individual animal re-identification across various species and real-world conditions.

Thanos Polychronou, Lukáš Adam, Viktor Penchev + 1 more2026-03-05💻 cs

PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

The paper introduces PlaneCycle, a training-free and adapter-free method that lifts pretrained 2D foundation models to 3D by cyclically distributing spatial aggregation across orthogonal planes, enabling strong 3D performance without architectural modifications or additional parameters.

Yinghong Yu, Guangyuan Li, Jiancheng Yang2026-03-05🤖 cs.AI

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

This paper establishes a principled theoretical framework for density aggregation by demonstrating that normalized generalized means with order $r \in [0,1]$ are the only rules guaranteeing systematic improvements in log-likelihood over individual distributions, thereby providing a unified justification for the widespread use of linear and geometric pooling in Deep Ensembles.

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso + 2 more2026-03-05🤖 cs.LG

Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

The paper introduces Real5-OmniDocBench, the first benchmark that physically reconstructs the entire OmniDocBench v1.5 dataset across five real-world scenarios to rigorously evaluate and diagnose the performance gap of Vision-Language Models in physical document parsing.

Changda Zhou, Ziyue Gao, Xueqing Wang + 4 more2026-03-05💻 cs

Nearest-Neighbor Density Estimation for Dependency Suppression

This paper proposes a novel encoder-based approach that combines a specialized variational autoencoder with non-parametric nearest-neighbor density estimation to explicitly optimize for independence from sensitive variables, effectively removing unwanted dependencies while preserving essential data utility.

Kathleen Anderson, Thomas Martinetz2026-03-05🤖 cs.LG

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

This paper introduces DiverseDiT, a novel framework that enhances Diffusion Transformers by systematically analyzing and explicitly promoting representation diversity across blocks through long residual connections and a diversity loss, resulting in consistent performance gains and faster convergence across various model sizes and generation settings.

Mengping Yang, Zhiyu Tan, Binglei Li + 3 more2026-03-05💻 cs

DeNuC: Decoupling Nuclei Detection and Classification in Histopathology

The paper proposes DeNuC, a method that decouples nuclei detection and classification by using a lightweight model for localization and a Pathology Foundation Model for feature-based classification, thereby overcoming representation degradation and computational inefficiency to achieve state-of-the-art performance with significantly fewer trainable parameters.

Zijiang Yang, Chen Kuang, Dongmei Fu2026-03-05💻 cs

EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

EmbodiedSplat is an online, feed-forward 3D Gaussian Splatting framework that enables simultaneous, near real-time 3D reconstruction and open-vocabulary semantic understanding of streaming scenes by integrating a memory-efficient CLIP-based coefficient field with 3D geometric-aware feature aggregation.

Seungjun Lee, Zihan Wang, Yunsong Wang + 1 more2026-03-05💻 cs

A Hypertoroidal Covering for Perfect Color Equivariance

This paper introduces a novel color equivariant architecture that eliminates approximation artifacts in handling saturation and luminance by lifting interval-valued quantities to a circular double-cover, thereby achieving superior robustness, interpretability, and performance in tasks like fine-grained classification and medical imaging.

Yulong Yang, Zhikun Xu, Yaojun Li + 1 more2026-03-05💻 cs

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

ViterbiPlanNet introduces a principled framework that injects procedural knowledge into instructional video planning via a Differentiable Viterbi Layer, achieving state-of-the-art performance with significantly fewer parameters and improved sample efficiency compared to existing large-scale models.

Luigi Seminara, Davide Moltisanti, Antonino Furnari2026-03-05💻 cs

SSR: A Generic Framework for Text-Aided Map Compression for Localization

This paper proposes SSR, a novel text-aided map compression framework that leverages lightweight text descriptions and complementary image feature vectors to achieve superior memory and bandwidth efficiency while maintaining high-fidelity localization performance across diverse indoor and outdoor environments.

Mohammad Omama, Po-han Li, Harsh Goel + 6 more2026-03-05💻 cs

A multi-center analysis of deep learning methods for video polyp detection and segmentation

This multi-center study evaluates deep learning methods for real-time video polyp detection and segmentation, demonstrating that integrating sequence data and temporal information significantly enhances diagnostic precision by addressing the challenges of variable polyp appearance and reducing missed detection rates in colonoscopy.

Noha Ghatwary, Pedro Chavarias Solano, Mohamed Ramzy Ibrahim + 24 more2026-03-05💻 cs

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

CubeComposer is a novel spatio-temporal autoregressive diffusion model that overcomes the computational limitations of existing methods to natively generate high-quality, seam-free 4K-resolution 360° videos from perspective inputs by decomposing them into cubemap representations and employing efficient context management and continuity-aware techniques.

Lingen Li, Guangzhi Wang, Xiaoyu Li + 5 more2026-03-05🤖 cs.AI

Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation

The paper proposes MMFA, a novel unsupervised method that decouples identity from motion information through self-supervised representation learning and a new keypoint computation strategy, enabling controllable and interpolatable face animation with realistic results.

Hong Li, Boyu Liu, Xuhui Liu + 1 more2026-03-05💻 cs

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

The paper introduces PromptAvatar, a framework utilizing dual diffusion models trained on a novel large-scale multi-modal dataset to generate high-fidelity, shading-free 3D avatars from text or image prompts in under 10 seconds, overcoming the slow inference and limited generalization of existing methods.

Hong Li, Yutang Feng, Minqi Meng + 3 more2026-03-05💻 cs

CRESTomics: Analyzing Carotid Plaques in the CREST-2 Trial with a New Additive Classification Model

This paper introduces CRESTomics, a novel kernel-based additive model with coherence loss and group-sparse regularization, to accurately and interpretably identify radiomics-based markers in carotid plaques from the CREST-2 trial that link B-mode ultrasound texture features to high clinical stroke risk.

Pranav Kulkarni, Brajesh K. Lal, Georges Jreij + 11 more2026-03-05🤖 cs.AI

← Previous Next →

cs.CV

Crab+^{+}+: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation