cs.CV papers | Gist.Science

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

This paper introduces LEMON, a large-scale endoscopic monocular dataset comprising 938 hours of high-resolution surgical footage, and LemonFM, a foundation model pretrained on this data using self-supervised augmented knowledge distillation that significantly outperforms existing models across multiple surgical perception tasks.

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera2026-03-24💻 cs

Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

This study investigates the impact of nine common image augmentations on CLIP's embedding shifts using various similarity and distance metrics, revealing that noise, perspective transforms, and scaling cause the most drastic changes to provide foundational insights for improving Vision Language Model robustness and mechanistic interpretability.

Ashim Dahal, Saydul Akbar Murad, Nick Rahimi2026-03-24💻 cs

All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning

This paper proposes the Panoptic Patch Learning (PPL) framework to overcome the "Few-Patch Bias" in AI-generated image detection by enforcing the utilization of artifacts across all image patches through random patch replacement and patch-wise contrastive learning, thereby significantly enhancing detection robustness and generalization.

Zheng Yang, Ruoxin Chen, Zhiyuan Yan, Ke-Yue Zhang, Xinghe Fu, Shuang Wu, Xiujun Shu, Taiping Yao, Shouhong Ding, Zequn Qin, Xi Li2026-03-24💻 cs

Tiny Neural Networks for Multi-Object Tracking in a Modular Kalman Framework

This paper introduces a modular, production-ready multi-object tracking framework for embedded automotive systems that integrates three compact, task-specific neural networks (SPENT, SANT, and MANTa) into a Kalman filter pipeline to significantly improve prediction accuracy and association performance while maintaining real-time suitability, interpretability, and drop-in compatibility.

Christian Alexander Holz, Christian Bader, Markus Enzweiler, Matthias Drüppel2026-03-24🤖 cs.LG

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

UniAnimate-DiT is an advanced human image animation framework that leverages the Wan2.1 video diffusion transformer with Low-Rank Adaptation and a lightweight pose encoder to generate high-fidelity, temporally consistent animations that generalize from 480p training data to 720p inference.

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang2026-03-24💻 cs

Subject Information Extraction for Novelty Detection with Domain Shifts

This paper proposes a novel unsupervised novelty detection method that improves robustness against domain shifts by disentangling subject information from background variations using mutual information minimization and a deep Gaussian mixture model, thereby enabling accurate detection based solely on invariant subject representations.

Yangyang Qu, Dazhi Fu, Jicong Fan2026-03-24💻 cs

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

This paper introduces Patho-R1, a multimodal reinforcement learning-based pathology expert that leverages high-quality, reasoning-oriented datasets derived from textbooks and experts, and is trained through a three-stage pipeline of knowledge infusion, supervised fine-tuning, and reinforcement learning to significantly improve diagnostic accuracy and reasoning plausibility across various pathology tasks.

Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu2026-03-24🤖 cs.AI

CompBench: Benchmarking Complex Instruction-guided Image Editing

This paper introduces CompBench, a large-scale benchmark featuring fine-grained instructions and an MLLM-human collaborative framework to rigorously evaluate and expose the limitations of current models in complex, instruction-guided image editing tasks.

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan (…)2026-03-24💻 cs

SPKLIP: Aligning Spike Video Streams with Natural Language

The paper introduces SPKLIP, the first architecture designed to align sparse spike video streams with natural language through hierarchical feature extraction and contrastive learning, achieving state-of-the-art few-shot performance and enhanced energy efficiency for neuromorphic deployment.

Yongchang Gao, Meiling Jin, Zhaofei Yu, Tiejun Huang, Guozhang Chen2026-03-24💻 cs

Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models

This paper introduces Foresight Diffusion (ForeDiff), a framework that enhances sampling consistency in predictive diffusion models by decoupling condition understanding from target denoising through a separate deterministic predictive stream, thereby improving both accuracy and consistency in robot video and scientific spatiotemporal forecasting tasks.

Yu Zhang, Xingzhuo Guo, Haoran Xu, Jialong Wu, Mingsheng Long2026-03-24💻 cs

← Previous Next →