cs.CV papers | Gist.Science

Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

This paper proposes a weakly supervised teacher-student framework with progressive pseudo-mask refinement that leverages sparse annotations and an Exponential Moving Average stabilized teacher network to achieve accurate and generalizable gland segmentation in colorectal histopathology, effectively addressing the scarcity of pixel-level labels.

Hikmat Khan, Wei Chen, Muhammad Khalid Khan Niazi2026-03-10💻 cs

Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

The paper introduces RAF (Retrieval-Augmented Faces), a training-time augmentation method that enhances the expression generalization and robustness of template-free animatable head avatars by dynamically replacing subject features with nearest-neighbor expressions from a large unlabeled bank, thereby improving fidelity in both self-driving and cross-driving scenarios without requiring additional data or architectural changes.

Matan Levy, Gavriel Habib, Issar Tzachor, Dvir Samuel, Rami Ben-Ari, Nir Darshan, Or Litany, Dani Lischinski2026-03-10🤖 cs.LG

RBF Weighted Hyper-Involution for RGB-D Object Detection

This paper proposes a real-time two-stream RGB-D object detection model featuring a dynamic RBF-weighted depth-based hyper-involution and a trainable fusion layer to effectively overcome challenges in extracting and combining photometric and depth features, achieving state-of-the-art performance on the NYU Depth V2 benchmark.

Mehfuz A Rahman, Khushal Das, Jiju Poovvancheri, Neil London, Dong Chen2026-03-09💻 cs

Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information

This paper proposes a pose-aware in-context visual learning (PA-ICVL) framework that enhances Vision-Language Models' ability to detect semantic structural visual hallucinations in non-photorealistic cartoon images by integrating pose information alongside RGB data, achieving significant performance improvements over RGB-only baselines.

Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun Seo2026-03-09🤖 cs.AI

Fuse4Seg: Image Fusion for Multi-Modal Medical Segmentation via Bi-level Optimization

Fuse4Seg introduces a novel bi-level optimization framework for multi-modal medical image fusion that dynamically aligns feature extraction with downstream segmentation tasks through semantic gradients, thereby overcoming the limitations of traditional visual-centric methods to achieve superior tumor boundary preservation and clinical interpretability.

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su2026-03-09💻 cs

PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization

The paper proposes PACE, a novel parameter-efficient fine-tuning method that enhances model generalization and preserves pre-trained knowledge by employing consistency regularization with multiplicative noise to implicitly reduce gradient norms and align fine-tuned models with their pre-trained counterparts.

Yao Ni, Shan Zhang, Piotr Koniusz2026-03-09🤖 cs.LG

FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition

FALCON is a unified self-supervised pretraining framework for UAV action recognition that overcomes spatial imbalance in aerial footage by combining object-aware masked autoencoding with object-centric dual-horizon future reconstruction, achieving superior accuracy and faster inference without requiring additional preprocessing at test time.

Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha2026-03-09🤖 cs.AI

AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

AuthFace is a novel blind face restoration framework that achieves highly authentic results by fine-tuning a text-to-image diffusion model on a curated 1.5K high-resolution professional photography dataset with photography-guided annotations, while employing a time-aware latent facial feature loss to minimize artifacts in critical facial areas.

Guoqiang Liang, Qingnan Fan, Bingtao Fu, Jinwei Chen, Hong Gu, Lin Wang2026-03-09💻 cs

An Efficient Self-supervised Seismic Data Reconstruction Method Based on Self-Consistency Learning

This paper proposes a self-supervised, lightweight deep learning method that leverages self-consistency learning and inter-component correlations to achieve high-quality reconstruction of irregularly acquired seismic data without requiring external training datasets.

Mingwei Wang, Junheng Peng, Yingtian Liu, Yong Li2026-03-09🤖 cs.LG

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

PPLLaVA addresses the computational inefficiency of long-video understanding by introducing a prompt-guided pooling strategy that aggressively compresses visual tokens while preserving instruction-relevant semantics, achieving state-of-the-art performance with up to 18x token reduction.

Shangkun Sun, Ruyang Liu, Haoran Tang, Yixiao Ge, Haibo Lu, Jiankun Yang, Chen Li2026-03-09💻 cs

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

The paper proposes Ditto, a real-time diffusion-based framework for controllable talking head synthesis that achieves fine-grained motion control and low-latency streaming inference by optimizing a motion-space diffusion transformer to resolve issues of motion-identity disentanglement and internal representation discrepancies.

Tianqi Li, Ruobing Zheng, Minghui Yang + 2 more2026-03-09⚡ eess

Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

This paper introduces LEO, a streamlined multimodal large language model architecture that employs a lightweight fusion strategy of post-adaptation projectors, tile-level sequence interleaving, and dynamic tiling to significantly enhance visual understanding across diverse benchmarks and specialized domains like autonomous driving.

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki2026-03-09💬 cs.CL

FeatureGS: Eigenvalue-Feature Optimization in 3D Gaussian Splatting for Geometrically Accurate and Artifact-Reduced Reconstruction

FeatureGS enhances 3D Gaussian Splatting by introducing an eigenvalue-based geometric loss term that significantly improves geometric accuracy, reduces floater artifacts and storage requirements by 90%, and enables direct mesh reconstruction while maintaining high photometric quality.

Miriam Jäger, Markus Hillemann, Boris Jutzi2026-03-09💻 cs

PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression

This paper introduces PoI, a framework that enhances Scene Coordinate Regression for visual localization by combining 3D Gaussian Splatting with diffusion-based refinement and a progressive pixel-level filtering strategy to generate and selectively utilize reliable novel views for robust training.

Feifei Li, Qi Song, Chi Zhang, Hui Shuai, Rui Huang2026-03-09💻 cs

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

This survey provides a comprehensive overview of the emerging ecosystem of large language models and tools that support researchers across the scientific lifecycle, covering key tasks from literature search and idea generation to content creation, experimentation, and evaluation, while addressing associated datasets, methods, limitations, and ethical concerns.

Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller2026-03-09🤖 cs.AI

Escaping The Big Data Paradigm in Self-Supervised Representation Learning

This paper introduces SCOTT, a sparse convolutional tokenizer combined with a MIM-JEPA training framework, which enables Vision Transformers to learn robust self-supervised representations from scratch on small-scale, fine-grained datasets, thereby challenging the necessity of big data and massive computational resources for effective vision representation learning.

Carlos Vélez García, Miguel Cazorla, Jorge Pomares2026-03-09💻 cs

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

The paper introduces NAMI, a Bridged Progressive Rectified Flow Transformer framework that significantly accelerates image generation and reduces inference time by 64% through a multi-resolution, spatially cascaded architecture with a BridgeFlow module, while maintaining state-of-the-art quality and introducing the NAMI-1K benchmark for evaluation.

Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin2026-03-09💻 cs

ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement

ECLARE is an open-source, self-supervised super-resolution method that enhances anisotropic 2D MR volumes by estimating slice profiles and learning in-plane mappings without external data, thereby overcoming domain shift and outperforming existing techniques in both signal recovery and downstream tasks.

Samuel W. Remedios, Shuwen Wei, Shuo Han, Jinwei Zhang, Aaron Carass, Kurt G. Schilling, Dzung L. Pham, Jerry L. Prince, Blake E. Dewey2026-03-09💻 cs

EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

The paper introduces EarthScape, a multimodal dataset and reproducible pipeline designed to automate surficial geologic mapping by integrating diverse geospatial data sources, demonstrating that terrain features provide the most robust predictive signal while highlighting the dataset's utility for benchmarking multimodal fusion and domain adaptation.

Matthew Massey, Nusrat Munia, Abdullah-Al-Zubaer Imran2026-03-09💻 cs

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

This paper introduces a new framework of psychophysical tests based on low-level vision principles—specifically contrast sensitivity, masking, and matching—to evaluate and reveal the perceptual strengths and weaknesses of 34 existing image and video quality metrics, demonstrating that standard evaluation protocols often fail to capture these fundamental human visual properties.

Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. Mantiuk2026-03-09💻 cs

← Previous Next →