cs.CV papers | Gist.Science

Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features

This paper proposes an interpretable 3D Concept Bottleneck Model that integrates morphological and hemodynamic clinical features to classify intracranial aneurysms with high accuracy (up to 93.33%) while ensuring clinical transparency and regulatory compliance.

Toqa Khaled, Ahmad Al-KabbanyTue, 10 Ma💻 cs

VIVECaption: A Split Approach to Caption Quality Improvement

This paper introduces VIVECaption, a systematic two-sided approach that combines a comprehensive evaluation taxonomy with a gold-standard dataset creation methodology and model alignment strategy to significantly improve image-caption alignment quality for training text-to-image and text-to-video generative models using copyright-safe "vegan" data.

Varun Ananth, Baqiao Liu, Haoran CaiTue, 10 Ma💻 cs

Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

This paper addresses the lack of specialized dental datasets by proposing a framework that uses Vision-Language Models with guided prompts to generate high-quality, holistic captions for single-tooth RGB images, thereby enabling more comprehensive dental image analysis.

Anastasiia Sukhanova, Aiden Taylor, Julian Myers, Zichun Wang, Kartha Veerya Jammuladinne, Satya Sri Rajiteswari Nimmagadda, Aniruddha Maiti, Ananya JanaTue, 10 Ma💻 cs

UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration

The paper introduces UnSCAR, a scalable and controllable universal image restoration framework that utilizes a multi-branch mixture-of-experts architecture to overcome the limitations of catastrophic forgetting and performance degradation in existing all-in-one models when handling multiple real-world degradations.

Debabrata Mandal, Soumitri Chattopadhyay, Yujie Wang, Marc Niethammer, Praneeth ChakravarthulaTue, 10 Ma💻 cs

QdaVPR: A novel query-based domain-agnostic model for visual place recognition

This paper introduces QdaVPR, a novel query-based visual place recognition model that achieves state-of-the-art domain-agnostic performance across diverse environmental conditions by employing a dual-level adversarial learning framework and query-based triplet supervision trained on augmented synthetic domains.

Shanshan Wan, Lai Kang, Yingmei Wei, Tianrui Shen, Haixuan Wang, Chao ZuoTue, 10 Ma💻 cs

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

This paper proposes DTPSR, a novel diffusion-based image super-resolution framework that leverages disentangled textual priors across spatial and frequency dimensions, supported by a new large-scale dataset and multi-branch guidance, to achieve high perceptual quality and semantic controllability.

Lei Jiang, Xin Liu, Xinze Tong, Zhiliang Li, Jie Liu, Jie Tang, Gangshan WuTue, 10 Ma💻 cs

Generalization in Online Reinforcement Learning for Mobile Agents

This paper addresses the underexplored challenge of generalization in online reinforcement learning for mobile GUI agents by introducing the AndroidWorld-Generalization benchmark and a scalable GRPO-based training system, demonstrating that while RL significantly improves zero-shot performance on unseen task instances, generalization to new templates and applications remains difficult and benefits from test-time few-shot adaptation.

Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang WangTue, 10 Ma🤖 cs.LG

Data Agent: Learning to Select Data via End-to-End Dynamic Optimization

The paper proposes Data Agent, an end-to-end dynamic data selection framework that formulates sample selection as a training-aware sequential decision-making problem to accelerate model training while preserving performance across diverse datasets and tasks.

Suorong Yang, Fangjian Su, Hai Gan, Ziqi Ye, Jie Li, Baile Xu, Furao Shen, Soujanya PoriaTue, 10 Ma🤖 cs.LG

RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation

RPG-SAM is a training-free one-shot polyp segmentation framework that improves performance by addressing regional and response heterogeneity through reliability-weighted prototype mining, geometric adaptive threshold selection, and iterative boundary refinement, achieving a 5.56% mIoU gain on the Kvasir dataset.

Weikun Lin, Yunhao Bai, Yan WangTue, 10 Ma💻 cs

DogWeave: High-Fidelity 3D Canine Reconstruction from a Single Image via Normal Fusion and Conditional Inpainting

DogWeave is a novel framework that reconstructs high-fidelity 3D canine models from a single RGB image by refining parametric meshes into detailed SDF representations via diffusion-enhanced normal optimization and generating view-consistent textures through conditional inpainting, thereby overcoming challenges like self-occlusion and fur detail to outperform existing state-of-the-art methods.

Shufan Sun, Chenchen Wang, Zongfu YuTue, 10 Ma💻 cs

Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models

Med-Evo is a novel self-evolution framework for medical multimodal large language models that leverages label-free reinforcement learning, featuring Feature-driven Pseudo Labeling and Hard-Soft Reward mechanisms, to significantly enhance model performance on unlabeled test data without requiring additional annotated medical datasets.

Dunyuan Xu, Xikai Yang, Juzheng Miao, Yaoqian Li, Jinpeng Li, Pheng-Ann HengTue, 10 Ma💻 cs

SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition

The paper introduces SLNet, a super-lightweight 3D point cloud recognition network utilizing Nonparametric Adaptive Point Embedding (NAPE) and Geometric Modulation Units (GMU) to achieve state-of-the-art accuracy on benchmarks like ModelNet40 and ScanObjectNN with significantly fewer parameters and computational costs compared to existing models.

Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari, Mert D. PeséTue, 10 Ma🤖 cs.LG

Image Generation Models: A Technical History

This paper provides a comprehensive technical survey of the history and evolution of image generation models, detailing the objectives, architectures, and limitations of various approaches from VAEs to diffusion methods, while also addressing recent advancements in video generation and the critical challenges of robustness and responsible deployment.

Rouzbeh ShirvaniTue, 10 Ma💬 cs.CL

SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing

SIGMAE is a novel foundation model for multispectral remote sensing that enhances Masked Autoencoder pretraining by incorporating domain-specific spectral indices to guide dynamic token masking toward semantically salient regions, thereby achieving superior performance across various downstream tasks compared to existing geospatial models.

Xiaokang Zhang, Bo Li, Chufeng Zhou, Weikang Yu, Lefei ZhangTue, 10 Ma💻 cs

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

This paper introduces MonoSTL, a selective transfer learning framework that addresses the negative transfer caused by modality gaps in cross-modality distillation for monocular 3D object detection by employing similar architectures and novel depth-aware selective distillation modules to effectively transfer LiDAR depth information to image-based networks, achieving state-of-the-art performance on KITTI and NuScenes benchmarks.

Rui Ding, Meng Yang, Nanning ZhengTue, 10 Ma💻 cs

Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing

This paper introduces the ThingiPrint dataset and a contrastive fine-tuning approach that enables the classification of novel 3D-printed objects using their CAD models without requiring model retraining, thereby addressing a critical bottleneck in automating industrial post-production workflows.

Fanis Mathioulakis, Gorjan Radevski, Silke GC Cleuren, Michel Janssens, Brecht Das, Koen Schauwaert, Tinne TuytelaarsTue, 10 Ma💻 cs

FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation

FedEU is a novel federated learning framework that enhances remote sensing image segmentation by integrating evidential uncertainty quantification and client-specific feature embeddings to guide adaptive global aggregation, thereby improving model robustness and reliability across heterogeneous distributed datasets.

Xiaokang Zhang, Xuran Xiong, Jianzhong Huang, Lefei ZhangTue, 10 Ma💻 cs

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

This paper introduces Early Vision-Language Fusion (EVLF), a plug-and-play method that aligns textual and visual embeddings early in the diffusion process to overcome the visual dominance issues of late-stage guidance, thereby generating semantically faithful and visually coherent synthetic datasets that improve downstream classification accuracy.

Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao ZhangTue, 10 Ma💻 cs

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

This paper proposes a Multi-Modal Decouple and Recouple Network that enhances robust 3D object detection under data corruption by explicitly separating BEV features into invariant and specific components to enable cross-modal compensation, followed by an adaptive fusion of three specialized experts tailored to different corruption scenarios.

Rui Ding, Zhaonian Kuang, Yuzhe Ji, Meng Yang, Xinhu Zheng, Gang HuaTue, 10 Ma💻 cs

RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations

This paper introduces RobustSCI, a pioneering framework that shifts snapshot compressive imaging from simple reconstruction to robust restoration by proposing a novel network architecture and a large-scale benchmark to effectively recover pristine scenes from real-world degraded measurements caused by motion blur and low light.

Hao Wang, Yuanfan Li, Qi Zhou, Zhankuo Xu, Jiong Ni, Xin YuanTue, 10 Ma💻 cs

← Previous Next →