cs.CV papers | Gist.Science

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

GraspLDP enhances the precision and generalization of imitation-learned robotic grasping by integrating grasp pose priors and a self-supervised reconstruction objective into a latent diffusion policy framework.

Enda Xiang, Haoxiang Ma, Xinzhu Ma + 2 more2026-02-27💻 cs

SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

SO3UFormer addresses the failure of standard panoramic segmentation models under 3D rotations by introducing a rotation-robust architecture that learns intrinsic spherical features through gravity-independent representations, quadrature-consistent attention, and gauge-aware positional encoding, achieving superior stability on the proposed Pose35 benchmark compared to existing state-of-the-art methods.

Qinfeng Zhu, Yunxi Jiang, Lei Fan2026-02-27💻 cs

Towards Multimodal Domain Generalization with Few Labels

This paper introduces the Semi-Supervised Multimodal Domain Generalization (SSMDG) problem and proposes a unified framework with consensus-driven regularization, disagreement-aware learning, and cross-modal prototype alignment to achieve robust generalization from multi-source data with few labels, alongside establishing the first benchmarks for this task.

Hongzhao Li, Hao Dong, Hualei Wan + 3 more2026-02-27💻 cs

Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

This paper introduces Chain of Flow (COF), a foundational generative framework that reconstructs patient-specific 4D cardiac anatomy and motion from single-cycle 12-lead ECGs by integrating cine-CMR data, thereby transforming cardiac digital twins from task-specific predictors into fully manipulable virtual hearts for diverse clinical simulations.

Haofan Wu, Nay Aung, Theodoros N. Arvanitis + 3 more2026-02-27💻 cs

OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality

This paper introduces OSDaR-AR, a public dataset and multi-modal augmented reality framework that utilizes Unreal Engine 5, LiDAR, and refined INS/GNSS data to bridge the sim-to-real gap by generating photorealistic, spatio-temporally coherent augmented railway sequences for training safety-critical perception systems.

Federico Nesti, Gianluca D'Amico, Mauro Marinoni + 1 more2026-02-27💻 cs

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

This paper introduces WaterVideoQA, a large-scale video question answering benchmark for all-waterway environments, and NaviMind, a multi-agent neuro-symbolic system that enables Autonomous Surface Vessels to transition from passive perception to regulation-compliant, interpretable cognitive reasoning through adaptive semantic routing and self-reflective verification.

Runwei Guan, Shaofeng Liang, Ningwei Ouyang + 9 more2026-02-27💻 cs

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

This paper introduces MSJoE, a novel framework that jointly evolves a multimodal large language model and a lightweight key-frame sampler via reinforcement learning to efficiently select informative frames for long-form video understanding, achieving state-of-the-art performance on multiple benchmarks.

Wenhui Tan, Xiaoyi Yu, Jiaze Li + 5 more2026-02-27💻 cs

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

The paper proposes pMoE, a novel parameter-efficient fine-tuning method that integrates diverse domain knowledge through expert-specialized prompt tokens and a dynamic dispatcher, significantly outperforming existing approaches across 47 visual adaptation tasks while maintaining computational efficiency.

Shentong Mo, Xufang Luo, Dongsheng Li2026-02-27🤖 cs.AI

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

This paper presents an automated video-based framework that leverages YOLOv8, U-net calibration, and optical flow to accurately reconstruct canoe sprint team boats' velocity and stroke rate from panned and zoomed recordings, achieving high agreement with GPS data without requiring on-boat sensors.

Julian Ziegler, Daniel Matthes, Finn Gerdts + 5 more2026-02-27💻 cs

Cross-Task Benchmarking of CNN Architectures

This project demonstrates that dynamic CNN architectures, particularly the omni-directional CNN (ODConv), outperform conventional models in accuracy and efficiency across image classification, segmentation, and time series tasks by leveraging adaptive kernel modulation and attention mechanisms to enhance feature representation and cross-task generalization.

Kamal Sherawat, Vikrant Bhati2026-02-27💻 cs

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

The paper introduces MM-NeuroOnco, a large-scale multimodal instruction dataset and benchmark for MRI-based brain tumor diagnosis that addresses annotation scarcity through an automated pipeline, demonstrating significant improvements in clinically grounded diagnostic reasoning via the proposed NeuroOnco-GPT model.

Feng Guo, Jiaxiang Liu, Yang Li + 2 more2026-02-27🤖 cs.AI

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

This pilot study evaluates the zero-shot capabilities of multimodal large language model agents in distinguishing visually confounded diseases like melanoma versus atypical nevus and pulmonary edema versus pneumonia, finding that a proposed multi-agent framework with contrastive adjudication improves accuracy and reduces unsupported claims, though performance remains insufficient for immediate clinical deployment.

Zihao Zhao, Frederik Hauke, Juliana De Castilhos + 2 more2026-02-27💻 cs

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

UCM is a novel framework that unifies long-term memory and precise camera control in world models through a time-aware positional encoding warping mechanism and an efficient dual-stream diffusion transformer, achieving superior scene consistency and controllability in high-fidelity video generation.

Tianxing Xu, Zixuan Wang, Guangyuan Wang + 5 more2026-02-27💻 cs

An automatic counting algorithm for the quantification and uncertainty analysis of the number of microglial cells trainable in small and heterogeneous datasets

This paper proposes a flexible, non-parametric automatic kernel counter that enables accurate microglial cell counting and uncertainty estimation in small, heterogeneous datasets by bypassing traditional cell detection in favor of a tailored feature extraction and single hyper-parameter training approach.

L. Martino, M. M. Garcia, P. S. Paradas + 1 more2026-02-27⚡ eess

Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images

This paper proposes an enhanced small object detection model for aerial images that integrates a Spatial Laplacian Pyramid Attention module to highlight local regions, a Multi-Scale Feature Enhancement Module to improve semantic representation, and deformable convolutions to align features within the Feature Pyramid Network, demonstrating superior performance on the VisDrone and DOTA datasets.

Zhangjian Ji, Huijia Yan, Shaotong Qiao + 2 more2026-02-27💻 cs

D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

This paper introduces D-FINE-seg, an open-source framework that extends the D-FINE detector with a lightweight mask head and specialized training strategies to achieve state-of-the-art real-time instance segmentation performance while providing a unified, multi-backend deployment pipeline for ONNX, TensorRT, and OpenVINO.

Argo Saakyan, Dmitry Solntsev2026-02-27💻 cs

GeoWorld: Geometric World Models

GeoWorld introduces a geometric world model that leverages Hyperbolic JEPA and Geometric Reinforcement Learning to preserve latent structural hierarchies and enable stable long-horizon visual planning, achieving state-of-the-art performance on multi-step tasks.

Zeyu Zhang, Danning Li, Ian Reid + 1 more2026-02-27💻 cs

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

To address the scarcity of 4D datasets and the limitations of transferring 3D models, this paper proposes PointATA, a parameter-efficient "Align then Adapt" paradigm that first bridges the 3D-4D modality gap via optimal transport and then enhances temporal modeling to achieve state-of-the-art performance in 4D perception tasks.

Yiding Sun, Jihua Zhu, Haozhe Cheng + 4 more2026-02-27💻 cs

Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

This paper proposes a weakly supervised vision-language framework that generates natural language descriptions of human brain cytoarchitecture by linking microscopy images to synthetic text captions derived from literature via area labels, thereby enabling interactive analysis without requiring scarce paired image-text data.

Matthew Sutton, Katrin Amunts, Timo Dickscheid + 1 more2026-02-27💻 cs

Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

This paper introduces Locally Adaptive Decay Surfaces (LADS), a novel event representation that dynamically modulates temporal decay based on local signal dynamics to overcome the limitations of fixed-parameter methods, thereby achieving state-of-the-art face detection and landmark localization accuracy at high frequencies while enabling the use of lighter network architectures.

Paul Kielty, Timothy Hanley, Peter Corcoran2026-02-27💻 cs

← Previous Next →