Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

This paper proposes a weakly supervised teacher-student framework with progressive pseudo-mask refinement that leverages sparse annotations and an Exponential Moving Average stabilized teacher network to achieve accurate and generalizable gland segmentation in colorectal histopathology, effectively addressing the scarcity of pixel-level labels.

Hikmat Khan, Wei Chen, Muhammad Khalid Khan Niazi2026-03-10💻 cs

Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

The paper introduces RAF (Retrieval-Augmented Faces), a training-time augmentation method that enhances the expression generalization and robustness of template-free animatable head avatars by dynamically replacing subject features with nearest-neighbor expressions from a large unlabeled bank, thereby improving fidelity in both self-driving and cross-driving scenarios without requiring additional data or architectural changes.

Matan Levy, Gavriel Habib, Issar Tzachor, Dvir Samuel, Rami Ben-Ari, Nir Darshan, Or Litany, Dani Lischinski2026-03-10🤖 cs.LG

Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information

This paper proposes a pose-aware in-context visual learning (PA-ICVL) framework that enhances Vision-Language Models' ability to detect semantic structural visual hallucinations in non-photorealistic cartoon images by integrating pose information alongside RGB data, achieving significant performance improvements over RGB-only baselines.

Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun Seo2026-03-09🤖 cs.AI

Fuse4Seg: Image Fusion for Multi-Modal Medical Segmentation via Bi-level Optimization

Fuse4Seg introduces a novel bi-level optimization framework for multi-modal medical image fusion that dynamically aligns feature extraction with downstream segmentation tasks through semantic gradients, thereby overcoming the limitations of traditional visual-centric methods to achieve superior tumor boundary preservation and clinical interpretability.

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su2026-03-09💻 cs

FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition

FALCON is a unified self-supervised pretraining framework for UAV action recognition that overcomes spatial imbalance in aerial footage by combining object-aware masked autoencoding with object-centric dual-horizon future reconstruction, achieving superior accuracy and faster inference without requiring additional preprocessing at test time.

Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha2026-03-09🤖 cs.AI

AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

AuthFace is a novel blind face restoration framework that achieves highly authentic results by fine-tuning a text-to-image diffusion model on a curated 1.5K high-resolution professional photography dataset with photography-guided annotations, while employing a time-aware latent facial feature loss to minimize artifacts in critical facial areas.

Guoqiang Liang, Qingnan Fan, Bingtao Fu, Jinwei Chen, Hong Gu, Lin Wang2026-03-09💻 cs

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

This paper introduces LEO, a streamlined multimodal large language model architecture that employs a lightweight fusion strategy of post-adaptation projectors, tile-level sequence interleaving, and dynamic tiling to significantly enhance visual understanding across diverse benchmarks and specialized domains like autonomous driving.

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki2026-03-09💬 cs.CL

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

This survey provides a comprehensive overview of the emerging ecosystem of large language models and tools that support researchers across the scientific lifecycle, covering key tasks from literature search and idea generation to content creation, experimentation, and evaluation, while addressing associated datasets, methods, limitations, and ethical concerns.

Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller2026-03-09🤖 cs.AI

Escaping The Big Data Paradigm in Self-Supervised Representation Learning

This paper introduces SCOTT, a sparse convolutional tokenizer combined with a MIM-JEPA training framework, which enables Vision Transformers to learn robust self-supervised representations from scratch on small-scale, fine-grained datasets, thereby challenging the necessity of big data and massive computational resources for effective vision representation learning.

Carlos Vélez García, Miguel Cazorla, Jorge Pomares2026-03-09💻 cs

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

The paper introduces NAMI, a Bridged Progressive Rectified Flow Transformer framework that significantly accelerates image generation and reduces inference time by 64% through a multi-resolution, spatially cascaded architecture with a BridgeFlow module, while maintaining state-of-the-art quality and introducing the NAMI-1K benchmark for evaluation.

Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin2026-03-09💻 cs

ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement

ECLARE is an open-source, self-supervised super-resolution method that enhances anisotropic 2D MR volumes by estimating slice profiles and learning in-plane mappings without external data, thereby overcoming domain shift and outperforming existing techniques in both signal recovery and downstream tasks.

Samuel W. Remedios, Shuwen Wei, Shuo Han, Jinwei Zhang, Aaron Carass, Kurt G. Schilling, Dzung L. Pham, Jerry L. Prince, Blake E. Dewey2026-03-09💻 cs

EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

The paper introduces EarthScape, a multimodal dataset and reproducible pipeline designed to automate surficial geologic mapping by integrating diverse geospatial data sources, demonstrating that terrain features provide the most robust predictive signal while highlighting the dataset's utility for benchmarking multimodal fusion and domain adaptation.

Matthew Massey, Nusrat Munia, Abdullah-Al-Zubaer Imran2026-03-09💻 cs

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

This paper introduces a new framework of psychophysical tests based on low-level vision principles—specifically contrast sensitivity, masking, and matching—to evaluate and reveal the perceptual strengths and weaknesses of 34 existing image and video quality metrics, demonstrating that standard evaluation protocols often fail to capture these fundamental human visual properties.

Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. Mantiuk2026-03-09💻 cs