cs.CV papers | Gist.Science

NuNext: Reframing Nucleus Detection as Next-Point Detection

NuNext reframes nucleus detection in histopathology as a next-point prediction task using a multimodal large language model trained with spatial-aware soft supervision and reinforcement fine-tuning to achieve superior performance across nine benchmarks.

Zhongyi Shui, Honglin Li, Xiaozhong Ji, Ye Zhang, Zijiang Yang, Chenglu Zhu, Yuxuan Sun, Kai Yao, Conghui He, Cheng Tan2026-03-10💻 cs

Efficient Chest X-ray Representation Learning via Semantic-Partitioned Contrastive Learning

This paper introduces Semantic-Partitioned Contrastive Learning (S-PCL), a streamlined self-supervised pre-training framework for Chest X-rays that achieves superior accuracy and computational efficiency by enforcing agreement between randomly partitioned semantic subsets, thereby eliminating the need for heavy augmentations, auxiliary decoders, or momentum encoders.

Wangyu Feng, Shawn Young, Lijian Xu2026-03-10💻 cs

TIQA: Human-Aligned Text Quality Assessment in Generated Images

This paper introduces TIQA, a human-aligned text quality assessment task and dataset for generated images, along with the ANTIQA method that significantly outperforms existing OCR and VLM-based metrics in predicting text rendering fidelity and improving downstream generation selection.

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova2026-03-10💻 cs

Inter-Image Pixel Shuffling for Multi-focus Image Fusion

This paper proposes Inter-image Pixel Shuffling (IPS), a novel multi-focus image fusion method that synthesizes training data by shuffling pixels between clear and low-pass filtered images to enable deep learning models to learn fusion without real multi-focus datasets, while utilizing a hybrid cross-image network combining CNNs and state space models to achieve superior fusion quality.

Huangxing Lin, Rongrong Ma, Cheng Wang2026-03-10💻 cs

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

This paper introduces EyExIn, a data-efficient framework that enhances retinal Vision Language Models by employing a dual-stream encoding strategy and a deep expert injection mechanism to bridge perception and reasoning gaps, thereby achieving state-of-the-art precision in ophthalmic diagnosis while preventing hallucinations.

Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li2026-03-10💻 cs

The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating

The paper introduces AutoSelect, a training-free token pruning method for vision-language models that reformulates token selection as capacity-constrained communication using a noise-gating mechanism to identify and retain only the most informative visual tokens, thereby significantly accelerating inference while preserving nearly all model accuracy.

Landi He, Xiaoyu Yang, Lijian Xu2026-03-10💻 cs

PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection

The paper proposes PDD, a novel framework that unifies global contextual and local structural priors from dual frozen encoders into a shared manifold to distill diverse knowledge into complementary student networks, achieving state-of-the-art performance in medical image anomaly detection across multiple datasets.

Xijun Lu, Hongying Liu, Fanhua Shang, Yanming Hui, Liang Wan2026-03-10💻 cs

CanoVerse: 3D Object Scalable Canonicalization and Dataset for Generation and Pose

The paper introduces CanoVerse, a massive dataset of 320K canonicalized 3D objects and a high-throughput framework that resolves directional ambiguity to significantly improve 3D generation stability, cross-modal retrieval, and zero-shot orientation estimation.

Li Jin, Yuchen Yang, Weikai Chen, Yujie Wang, Dehao Hao, Tanghui Jia, Yingda Yin, Zeyu Hu, Runze Zhang, Keyang Luo, Li Yuan, Long Quan, Xin Wang, Xueying Qin2026-03-10💻 cs

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

This paper introduces LiveWorld, a novel framework that addresses the "out-of-sight dynamics" limitation in generative video world models by maintaining a persistent global state where unobserved entities continue to evolve, thereby enabling truly continuous 4D world simulation and long-term scene consistency.

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, Lingqiao Liu2026-03-10💻 cs

PromptGate Client Adaptive Vision Language Gating for Open Set Federated Active Learning

PromptGate is a dynamic, federated vision-language framework that utilizes adaptive, learnable prompts to effectively filter out-of-distribution noise from unlabeled medical data pools, thereby significantly improving the efficiency and privacy of open-set active learning across resource-constrained institutions.

Adea Nesturi, David Dueñas Gaviria, Jiajun Zeng, Shadi Albarqouni2026-03-10💻 cs

ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels

The paper proposes ACD-U, an asymmetric co-teaching framework that combines a CLIP-pretrained Vision Transformer with a CNN and incorporates machine unlearning to actively correct selection errors and achieve state-of-the-art robustness against noisy labels.

Reo Fukunaga, Soh Yoshida, Mitsuji Muneyasu2026-03-10💻 cs

Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology

This paper introduces a framework to evaluate class visualizations and activation atlases for transformer-based pathology models, revealing that while these feature visualization methods effectively capture coarse tissue-level concepts, their ability to represent fine-grained cancer subclasses is limited by intrinsic pathological complexity and reduced inter-observer agreement.

Marco Gustav, Fabian Wolf, Christina Glasner, Nic G. Reitsam, Stefan Schulz, Kira Aschenbroich, Bruno Märkl, Sebastian Foersch, Jakob Nikolas Kather2026-03-10💻 cs

FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation

The paper introduces FreeFly-Thinking, an end-to-end Vision-Language Navigation framework for UAVs that leverages a two-stage training strategy and explicit chain-of-thought reasoning to achieve robust and efficient navigation in complex outdoor urban environments.

Jiaxu Zhou, Shaobo Wang, Zhiyuan Yang, Zhenjun Yu, Tao Li2026-03-10💻 cs

FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

FastSTAR is a training-free acceleration framework for Spacetime Autoregressive (STAR) video generation that utilizes a novel Spatiotemporal Token Pruning strategy to identify and skip redundant computations, achieving up to a 2.01x speedup with minimal quality degradation.

Sungwoong Yune, Suheon Jeong, Joo-Young Kim2026-03-10💻 cs

Shaping Parameter Contribution Patterns for Out-of-Distribution Detection

This paper proposes Shaping Parameter Contribution Patterns (SPCP), a training-time method that enhances out-of-distribution detection by encouraging classifiers to adopt dense, boundary-oriented parameter contribution patterns instead of relying on sparse, brittle ones that lead to overconfident predictions on anomalous inputs.

Haonan Xu, Yang Yang2026-03-10🤖 cs.LG

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

VINO is a self-supervised learning framework that overcomes the "co-occurrence trap" in dense video by using a teacher-student distillation approach with structural priors to force representations to focus on foreground objects rather than background context, achieving state-of-the-art unsupervised object discovery performance.

Seul-Ki Yeom, Marcel Simon, Eunbin Lee, Tae-Ho Kim2026-03-10💻 cs

LightMedSeg: Lightweight 3D Medical Image Segmentation with Learned Spatial Anchors

LightMedSeg is a lightweight, modular 3D medical image segmentation architecture that leverages anatomical priors, adaptive context modeling, and computational efficiency techniques to achieve high accuracy with minimal parameters and FLOPs, making it a deployable solution for resource-constrained clinical environments.

Kavyansh Tyagi, Vishwas Rathi, Puneet Goyal2026-03-10🤖 cs.LG

Single Image Super-Resolution via Bivariate `A Trous Wavelet Diffusion

This paper introduces BATDiff, an unsupervised single-image super-resolution model that leverages bivariate A-trous wavelet transforms and cross-scale parent-child dependencies to generate sharper, more structurally consistent high-frequency details while minimizing artifacts and dataset-driven hallucinations.

Heidari Maryam, Anantrasirichai Nantheera, Achim Alin2026-03-10💻 cs

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

The paper proposes HY-WU, a memory-first adaptation framework that replaces static weight overwriting with a functional neural memory module to synthesize instance-specific weight updates on-the-fly, thereby enabling robust continual learning and personalization without degrading previously learned behaviors.

Tencent HY Team2026-03-10💻 cs

FabricGen: Microstructure-Aware Woven Fabric Generation

FabricGen is an end-to-end framework that generates realistic woven fabric materials from text by decoupling macro-scale texture synthesis via fine-tuned diffusion models from micro-scale weaving pattern generation using a specialized LLM-driven procedural geometric model.

Yingjie Tang, Di Luo, Zixiong Wang, Xiaoli Ling, jian Yang, Beibei Wang2026-03-10💻 cs

← Previous Next →