cs.CV papers | Gist.Science

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

This paper investigates the behavioral gaps between humans and Vision-Language Models in recognizing ambiguous Japanese characters by demonstrating that their decision boundaries differ in shape-only tasks, though contextual information can partially improve human alignment in VLMs.

Daichi Haraguchi2026-03-02💻 cs

Unsupervised Causal Prototypical Networks for De-biased Interpretable Dermoscopy Diagnosis

This paper proposes CausalProto, an unsupervised causal prototypical network that leverages structural causal modeling and information bottleneck constraints to disentangle pathological features from environmental confounders, thereby achieving de-biased, interpretable, and high-accuracy dermoscopy diagnosis without compromising performance.

Junhao Jia, Yueyi Wu, Huangwei Chen + 4 more2026-03-02⚡ eess

Neural Image Space Tessellation

Neural Image-Space Tessellation (NIST) is a lightweight, screen-space post-processing technique that uses multi-scale neural operators to deform image contours and reassign appearance information, effectively simulating the visual fidelity of geometric tessellation on low-polygon meshes with constant computational cost independent of scene complexity.

Youyang Du, Junqiu Zhu, Zheng Zeng + 2 more2026-03-02💻 cs

Learning Accurate Segmentation Purely from Self-Supervision

The paper introduces Selfment, a fully self-supervised framework that achieves state-of-the-art object segmentation and zero-shot generalization to camouflaged objects by iteratively refining self-supervised patch features to generate high-quality pseudo-labels without any manual annotations.

Zuyao You, Zuxuan Wu, Yu-Gang Jiang2026-03-02💻 cs

OPTIAGENT: A Physics-Driven Agentic Framework for Automated Optical Design

This paper introduces OPTIAGENT, a physics-driven agentic framework that leverages Large Language Models enhanced with a specialized dataset, hybrid training objectives, and a physics-guided reward system to automate the design of functional lens systems, effectively bridging the gap between human expertise and automated optical engineering.

Yuyu Geng, Lei Sun, Yao Gao + 6 more2026-03-02🤖 cs.LG

VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video

The paper introduces VideoPulse, a comprehensive dataset and end-to-end deep learning pipeline that enables accurate, contact-free estimation of neonatal heart rate and SpO2 from facial video, offering a low-cost, non-invasive alternative to traditional adhesive monitoring methods in intensive care settings.

Deependra Dewagiri, Kamesh Anuradha, Pabadhi Liyanage + 6 more2026-03-02⚡ eess

Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

This paper proposes a novel few-shot 3D vessel segmentation framework that adapts the pre-trained DINOv3 foundation model with specialized 3D components to achieve superior performance and robustness in data-scarce and out-of-distribution clinical scenarios, significantly outperforming state-of-the-art methods like nnU-Net with only five training samples.

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka + 1 more2026-03-02⚡ eess

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

This paper introduces FluoCLIP, a stain-aware focus quality assessment framework supported by the new FluoMix dataset, which addresses the limitations of existing stain-agnostic models by leveraging a vision-language approach to accurately evaluate focus quality across diverse fluorescent stains and tissue types.

Hyejin Park, Jiwon Yoon, Sumin Park + 5 more2026-03-02⚡ eess

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

The paper proposes EMO-R3, a reflective reinforcement learning framework that enhances the emotional reasoning capabilities of Multimodal Large Language Models by introducing Structured Emotional Thinking and a Reflective Emotional Reward to improve interpretability and alignment with human emotional cognition.

Yiyang Fang, Wenke Huang, Pei Fu + 5 more2026-03-02🤖 cs.AI

BiM-GeoAttn-Net: Linear-Time Depth Modeling with Geometry-Aware Attention for 3D Aortic Dissection CTA Segmentation

The paper proposes BiM-GeoAttn-Net, a lightweight framework that combines Bidirectional Depth Mamba for efficient cross-slice modeling and a Geometry-Aware Vessel Attention module to achieve robust, high-accuracy 3D segmentation of aortic dissection lumens in CTA scans.

Yuan Zhang, Lei Liu, Jialin Zhang + 3 more2026-03-02⚡ eess

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

The paper proposes Sea $^2$ , an unsupervised cross-domain adaptation framework that employs a VLM-guided agent to actively navigate and select optimal viewpoints for frozen perception models, thereby significantly improving performance on tasks like visual grounding, segmentation, and 3D box estimation without requiring downstream labels or model retraining.

Tianci Tang, Tielong Cai, Hongwei Wang + 1 more2026-03-02🤖 cs.AI

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

This paper proposes a bimanual manipulation framework that leverages a pre-trained 3D geometric foundation model to fuse RGB-based 3D latents, 2D semantics, and proprioception within a diffusion policy, enabling the joint prediction of actions and future 3D scene evolution to achieve state-of-the-art performance without relying on explicit point clouds.

Chongyang Xu, Haipeng Li, Shen Cheng + 4 more2026-03-02💻 cs

Footprint-Guided Exemplar-Free Continual Histopathology Report Generation

This paper introduces an exemplar-free continual learning framework for histopathology report generation that prevents catastrophic forgetting by using compact domain footprints to synthesize pseudo-WSI representations and distill linguistic styles, enabling effective adaptation to evolving clinical data without storing raw slides.

Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour + 3 more2026-03-02💻 cs

Denoising-Enhanced YOLO for Robust SAR Ship Detection

This paper proposes CPN-YOLO, a robust SAR ship detection framework that enhances YOLOv8 through a learnable large-kernel denoising module, a PPA-based feature extraction strategy, and a Gaussian similarity loss, achieving superior precision and recall on HRSID and SSDD datasets.

Xiaojing Zhao, Shiyang Li, Zena Chu + 5 more2026-03-02💻 cs

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

This paper proposes a robust end-to-end multimodal framework for DICOM series classification that leverages bi-directional cross-attention and a sparse, missingness-aware dictionary learning encoder to effectively handle heterogeneous image content, variable series lengths, and incomplete metadata without requiring imputation, thereby outperforming existing baselines in both in-domain and out-of-domain settings.

Tuan Truong, Melanie Dohmen, Sara Lorio + 1 more2026-03-02⚡ eess

Polarization Uncertainty-Guided Diffusion Model for Color Polarization Image Demosaicking

This paper proposes a Polarization Uncertainty-Guided Diffusion Model that leverages image diffusion priors and explicitly models polarization uncertainty to accurately reconstruct high-fidelity color polarization images, effectively overcoming the limitations of existing network-based methods in recovering polarization characteristics due to data scarcity.

Chenggong Li, Yidong Luo, Junchao Zhang + 1 more2026-03-02⚡ eess

NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

The NAU-QMUL team proposed a multi-modal multi-task model leveraging pre-trained BERT and CLIP encoders with cross-modal fusion and pseudo-labeling data augmentation to achieve fifth place in both detection and source identification tasks of the CT2 AI-Generated Image Detection competition.

Xiaoyu Guo, Arkaitz Zubiaga2026-03-02💬 cs.CL

Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

This paper introduces ReSeg-CLIP, a training-free open-vocabulary semantic segmentation method for remote sensing that achieves state-of-the-art performance by combining hierarchical attention masking with SAM-generated masks and a novel model composition strategy that averages multiple RS-specific CLIP variants.

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga + 2 more2026-03-02💻 cs

Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles

This paper proposes a bandwidth-adaptive, cloud-assisted framework for autonomous vehicles that dynamically splits transformer-based 360-degree 3D perception tasks between the vehicle and the cloud using feature compression and quantization, achieving a 72% latency reduction and up to 20% accuracy improvement over static methods under fluctuating network conditions.

Faisal Hawladera, Rui Meireles, Gamal Elghazaly + 2 more2026-03-02🤖 cs.LG

Altitude-Aware Visual Place Recognition in Top-Down View

This paper proposes a hardware-free, vision-only approach for aerial visual place recognition that estimates relative altitude through ground feature density analysis to generate canonical images, significantly improving localization accuracy and robustness across diverse terrains and large altitude variations compared to traditional sensor-dependent or depth estimation methods.

Xingyu Shao, Mengfan He, Chunyu Li + 2 more2026-03-02💻 cs

← Previous Next →