cs.CV papers | Gist.Science

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

This paper introduces VGUBench to demonstrate that while Unified Multimodal Large Language Models exhibit strong textual reasoning and visual rendering capabilities individually, they fail to maintain semantic equivalence when required to generate visual answers, revealing a critical breakdown in cross-modal semantic alignment rather than a lack of generation fidelity.

Hongbo Jiang, Jie Li, Yunhang Shen + 4 more2026-03-02💻 cs

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

StemVLA is an open-source Vision-Language-Action model that enhances robot manipulation performance on long-horizon tasks by explicitly integrating predicted future 3D spatial geometry and aggregated 4D historical spatiotemporal representations to improve spatial reasoning and decision-making in dynamic environments.

Jiasong Xiao, Yutao She, Kai Li + 3 more2026-03-02💻 cs

A Difference-in-Difference Approach to Detecting AI-Generated Images

This paper proposes a novel difference-in-difference method that improves AI-generated image detection by utilizing second-order differences in reconstruction error to achieve superior generalization and accuracy compared to existing first-order approaches.

Xinyi Qi, Kai Ye, Chengchun Shi + 3 more2026-03-02💻 cs

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

UTPTrack introduces a simple, unified token pruning framework that jointly compresses the search region and both dynamic and static templates via an attention-guided strategy, achieving state-of-the-art accuracy-efficiency trade-offs in visual tracking while preserving baseline performance across RGB and multimodal scenarios.

Hao Wu, Xudong Wang, Jialiang Zhang + 5 more2026-03-02💬 cs.CL

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

U-Mind is a pioneering unified framework that enables real-time, high-intelligence multimodal interaction by jointly modeling language, speech, motion, and video synthesis through a novel alignment and reasoning strategy to achieve coherent, synchronized, and expressive conversational agents.

Xiang Deng, Feng Gao, Yong Zhang + 5 more2026-03-02💻 cs

Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

This paper investigates the behavioral gaps between humans and Vision-Language Models in recognizing ambiguous Japanese characters by demonstrating that their decision boundaries differ in shape-only tasks, though contextual information can partially improve human alignment in VLMs.

Daichi Haraguchi2026-03-02💻 cs

Unsupervised Causal Prototypical Networks for De-biased Interpretable Dermoscopy Diagnosis

This paper proposes CausalProto, an unsupervised causal prototypical network that leverages structural causal modeling and information bottleneck constraints to disentangle pathological features from environmental confounders, thereby achieving de-biased, interpretable, and high-accuracy dermoscopy diagnosis without compromising performance.

Junhao Jia, Yueyi Wu, Huangwei Chen + 4 more2026-03-02⚡ eess

Neural Image Space Tessellation

Neural Image-Space Tessellation (NIST) is a lightweight, screen-space post-processing technique that uses multi-scale neural operators to deform image contours and reassign appearance information, effectively simulating the visual fidelity of geometric tessellation on low-polygon meshes with constant computational cost independent of scene complexity.

Youyang Du, Junqiu Zhu, Zheng Zeng + 2 more2026-03-02💻 cs

Learning Accurate Segmentation Purely from Self-Supervision

The paper introduces Selfment, a fully self-supervised framework that achieves state-of-the-art object segmentation and zero-shot generalization to camouflaged objects by iteratively refining self-supervised patch features to generate high-quality pseudo-labels without any manual annotations.

Zuyao You, Zuxuan Wu, Yu-Gang Jiang2026-03-02💻 cs

OPTIAGENT: A Physics-Driven Agentic Framework for Automated Optical Design

This paper introduces OPTIAGENT, a physics-driven agentic framework that leverages Large Language Models enhanced with a specialized dataset, hybrid training objectives, and a physics-guided reward system to automate the design of functional lens systems, effectively bridging the gap between human expertise and automated optical engineering.

Yuyu Geng, Lei Sun, Yao Gao + 6 more2026-03-02🤖 cs.LG

VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video

The paper introduces VideoPulse, a comprehensive dataset and end-to-end deep learning pipeline that enables accurate, contact-free estimation of neonatal heart rate and SpO2 from facial video, offering a low-cost, non-invasive alternative to traditional adhesive monitoring methods in intensive care settings.

Deependra Dewagiri, Kamesh Anuradha, Pabadhi Liyanage + 6 more2026-03-02⚡ eess

Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

This paper proposes a novel few-shot 3D vessel segmentation framework that adapts the pre-trained DINOv3 foundation model with specialized 3D components to achieve superior performance and robustness in data-scarce and out-of-distribution clinical scenarios, significantly outperforming state-of-the-art methods like nnU-Net with only five training samples.

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka + 1 more2026-03-02⚡ eess

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

This paper introduces FluoCLIP, a stain-aware focus quality assessment framework supported by the new FluoMix dataset, which addresses the limitations of existing stain-agnostic models by leveraging a vision-language approach to accurately evaluate focus quality across diverse fluorescent stains and tissue types.

Hyejin Park, Jiwon Yoon, Sumin Park + 5 more2026-03-02⚡ eess

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

The paper proposes EMO-R3, a reflective reinforcement learning framework that enhances the emotional reasoning capabilities of Multimodal Large Language Models by introducing Structured Emotional Thinking and a Reflective Emotional Reward to improve interpretability and alignment with human emotional cognition.

Yiyang Fang, Wenke Huang, Pei Fu + 5 more2026-03-02🤖 cs.AI

BiM-GeoAttn-Net: Linear-Time Depth Modeling with Geometry-Aware Attention for 3D Aortic Dissection CTA Segmentation

The paper proposes BiM-GeoAttn-Net, a lightweight framework that combines Bidirectional Depth Mamba for efficient cross-slice modeling and a Geometry-Aware Vessel Attention module to achieve robust, high-accuracy 3D segmentation of aortic dissection lumens in CTA scans.

Yuan Zhang, Lei Liu, Jialin Zhang + 3 more2026-03-02⚡ eess

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

The paper proposes Sea $^2$ , an unsupervised cross-domain adaptation framework that employs a VLM-guided agent to actively navigate and select optimal viewpoints for frozen perception models, thereby significantly improving performance on tasks like visual grounding, segmentation, and 3D box estimation without requiring downstream labels or model retraining.

Tianci Tang, Tielong Cai, Hongwei Wang + 1 more2026-03-02🤖 cs.AI

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

This paper proposes a bimanual manipulation framework that leverages a pre-trained 3D geometric foundation model to fuse RGB-based 3D latents, 2D semantics, and proprioception within a diffusion policy, enabling the joint prediction of actions and future 3D scene evolution to achieve state-of-the-art performance without relying on explicit point clouds.

Chongyang Xu, Haipeng Li, Shen Cheng + 4 more2026-03-02💻 cs

Footprint-Guided Exemplar-Free Continual Histopathology Report Generation

This paper introduces an exemplar-free continual learning framework for histopathology report generation that prevents catastrophic forgetting by using compact domain footprints to synthesize pseudo-WSI representations and distill linguistic styles, enabling effective adaptation to evolving clinical data without storing raw slides.

Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour + 3 more2026-03-02💻 cs

Denoising-Enhanced YOLO for Robust SAR Ship Detection

This paper proposes CPN-YOLO, a robust SAR ship detection framework that enhances YOLOv8 through a learnable large-kernel denoising module, a PPA-based feature extraction strategy, and a Gaussian similarity loss, achieving superior precision and recall on HRSID and SSDD datasets.

Xiaojing Zhao, Shiyang Li, Zena Chu + 5 more2026-03-02💻 cs

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

This paper proposes a robust end-to-end multimodal framework for DICOM series classification that leverages bi-directional cross-attention and a sparse, missingness-aware dictionary learning encoder to effectively handle heterogeneous image content, variable series lengths, and incomplete metadata without requiring imputation, thereby outperforming existing baselines in both in-domain and out-of-domain settings.

Tuan Truong, Melanie Dohmen, Sara Lorio + 1 more2026-03-02⚡ eess

← Previous Next →