cs.CV papers | Gist.Science

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

SODA introduces a sensitivity-oriented dynamic acceleration framework for Diffusion Transformers that adaptively optimizes caching and pruning strategies through fine-grained sensitivity modeling and dynamic programming, achieving state-of-the-art generation fidelity under controllable acceleration ratios.

Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong SuTue, 10 Ma💻 cs

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

MedSteer is a training-free activation-steering framework that generates structurally preserved counterfactual endoscopic images by manipulating cross-attention activations in diffusion transformers, outperforming existing methods in concept editing and downstream medical detection tasks.

Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan LeTue, 10 Ma💻 cs

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

This paper introduces VirtueBench, a new benchmark designed to evaluate the trustworthiness of Vision-Language Models in long video understanding by distinguishing between answerable and unanswerable cases to prevent misleading accuracy scores caused by guessing under uncertainty.

Xueqing Yu, Bohan Li, Yan Li, Zhenheng YangTue, 10 Ma💻 cs

Physics-Guided VLM Priors for All-Cloud Removal

This paper introduces PhyVLM-CR, a novel unified framework that integrates Vision-Language Model semantic priors with physical scattering parameters to seamlessly remove both thin and thick clouds from optical remote sensing imagery without explicit cloud-type segmentation, thereby achieving high-fidelity, hallucination-free surface reconstruction.

Liying Xu, Huifang Li, Huanfeng ShenTue, 10 Ma💻 cs

Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network

This paper proposes PSG-UIENet, a novel underwater image enhancement network that integrates Retinex-based illumination correction with CLIP-derived textual semantics to overcome the limitations of existing methods, supported by the introduction of a new large-scale image-text dataset (LUIQD-TD) and a specialized semantic similarity loss function.

Shixuan Xu, Yabo Liu, Junyu Dong, Xinghui DongTue, 10 Ma💻 cs

Aligning What EEG Can See: Structural Representations for Brain-Vision Matching

This paper introduces a novel framework for EEG-based visual decoding that aligns brain signals with intermediate visual layers via a proposed "Neural Visibility" concept and a Hierarchically Complementary Fusion mechanism, achieving state-of-the-art performance by significantly reducing cross-modal information mismatch.

Jingyi Tang, Shuai Jiang, Fei Su, Zhicheng ZhaoTue, 10 Ma💻 cs

mAVE: A Watermark for Joint Audio-Visual Generation Models

The paper introduces mAVE, a novel watermarking framework that cryptographically binds audio and video latents in joint generation models to eliminate the "Binding Vulnerability" of existing methods and robustly defend against adversarial Swap Attacks without requiring model fine-tuning.

Luyang Si, Leyi Pan, Lijie WenTue, 10 Ma💻 cs

Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

This paper proposes a facial expression generation method for natural dyadic interaction that leverages human feedback within a vision-language-action framework and reinforcement learning strategy to produce contextually appropriate, identity-independent expressions aligned with human preferences.

Xu Chen, Rui Gao, Xinjie Zhang, Haoyu Zhang, Che Sun, Zhi Gao, Yuwei Wu, Yunde JiaTue, 10 Ma💻 cs

NuNext: Reframing Nucleus Detection as Next-Point Detection

NuNext reframes nucleus detection in histopathology as a next-point prediction task using a multimodal large language model trained with spatial-aware soft supervision and reinforcement fine-tuning to achieve superior performance across nine benchmarks.

Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, Cheng TanTue, 10 Ma💻 cs

Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning

This paper introduces Semantic-Partitioned Contrastive Learning (S-PCL), a streamlined self-supervised pre-training framework for Chest X-rays that achieves superior accuracy and computational efficiency by enforcing agreement between randomly partitioned semantic subsets, thereby eliminating the need for heavy augmentations, auxiliary decoders, or momentum encoders.

Wangyu Feng, Shawn Young, Lijian XuTue, 10 Ma💻 cs

TIQA: Human-Aligned Text Quality Assessment in Generated Images

This paper introduces TIQA, a human-aligned text quality assessment task and dataset for generated images, along with the ANTIQA method that significantly outperforms existing OCR and VLM-based metrics in predicting text rendering fidelity and improving downstream generation selection.

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia AntsiferovaTue, 10 Ma💻 cs

Inter-Image Pixel Shuffling for Multi-focus Image Fusion

This paper proposes Inter-image Pixel Shuffling (IPS), a novel multi-focus image fusion method that synthesizes training data by shuffling pixels between clear and low-pass filtered images to enable deep learning models to learn fusion without real multi-focus datasets, while utilizing a hybrid cross-image network combining CNNs and state space models to achieve superior fusion quality.

Huangxing Lin, Rongrong Ma, Cheng WangTue, 10 Ma💻 cs

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

This paper introduces EyExIn, a data-efficient framework that enhances retinal Vision Language Models by employing a dual-stream encoding strategy and a deep expert injection mechanism to bridge perception and reasoning gaps, thereby achieving state-of-the-art precision in ophthalmic diagnosis while preventing hallucinations.

Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi LiTue, 10 Ma💻 cs

The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating

The paper introduces AutoSelect, a training-free token pruning method for vision-language models that reformulates token selection as capacity-constrained communication using a noise-gating mechanism to identify and retain only the most informative visual tokens, thereby significantly accelerating inference while preserving nearly all model accuracy.

Landi He, Xiaoyu Yang, Lijian XuTue, 10 Ma💻 cs

PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection

The paper proposes PDD, a novel framework that unifies global contextual and local structural priors from dual frozen encoders into a shared manifold to distill diverse knowledge into complementary student networks, achieving state-of-the-art performance in medical image anomaly detection across multiple datasets.

Xijun Lu, Hongying Liu, Fanhua Shang, Yanming Hui, Liang WanTue, 10 Ma💻 cs

CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose

The paper introduces CanoVerse, a massive dataset of 320K canonicalized 3D objects and a high-throughput framework that resolves directional ambiguity to significantly improve 3D generation stability, cross-modal retrieval, and zero-shot orientation estimation.

Li Jin, Yuchen Yang, Weikai Chen, Yujie Wang, Dehao Hao, Tanghui Jia, Yingda Yin, Zeyu Hu, Runze Zhang, Keyang Luo, Li Yuan, Long Quan, Xin Wang, Xueying QinTue, 10 Ma💻 cs

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

This paper introduces LiveWorld, a novel framework that addresses the "out-of-sight dynamics" limitation in generative video world models by maintaining a persistent global state where unobserved entities continue to evolve, thereby enabling truly continuous 4D world simulation and long-term scene consistency.

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, Lingqiao LiuTue, 10 Ma💻 cs

PromptGate Client Adaptive Vision Language Gating for Open Set Federated Active Learning

PromptGate is a dynamic, federated vision-language framework that utilizes adaptive, learnable prompts to effectively filter out-of-distribution noise from unlabeled medical data pools, thereby significantly improving the efficiency and privacy of open-set active learning across resource-constrained institutions.

Adea Nesturi, David Dueñas Gaviria, Jiajun Zeng, Shadi AlbarqouniTue, 10 Ma💻 cs

ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels

The paper proposes ACD-U, an asymmetric co-teaching framework that combines a CLIP-pretrained Vision Transformer with a CNN and incorporates machine unlearning to actively correct selection errors and achieve state-of-the-art robustness against noisy labels.

Reo Fukunaga, Soh Yoshida, Mitsuji MuneyasuTue, 10 Ma💻 cs

Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology

This paper introduces a framework to evaluate class visualizations and activation atlases for transformer-based pathology models, revealing that while these feature visualization methods effectively capture coarse tissue-level concepts, their ability to represent fine-grained cancer subclasses is limited by intrinsic pathological complexity and reduced inter-observer agreement.

Marco Gustav, Fabian Wolf, Christina Glasner, Nic G. Reitsam, Stefan Schulz, Kira Aschenbroich, Bruno Märkl, Sebastian Foersch, Jakob Nikolas KatherTue, 10 Ma💻 cs

← Previous Next →