cs.CV papers | Gist.Science

Specificity-aware reinforcement learning for fine-grained open-world classification

This paper proposes SpeciaRL, a specificity-aware reinforcement learning framework that fine-tunes reasoning Large Multimodal Models to achieve an optimal balance between correctness and specificity in open-world fine-grained image classification by employing a dynamic, verifier-based reward signal.

Samuele Angheben, Davide Berasi, Alessandro Conti + 2 more2026-03-05💻 cs

Deep Sketch-Based 3D Modeling: A Survey

This paper presents a comprehensive survey of Deep Sketch-Based 3D Modeling (DS-3DM) by introducing the novel MORPHEUS design space, which categorizes recent advancements within an Input-Model-Output framework to highlight current limitations and identify future interdisciplinary opportunities for enhancing user-centered, controllable, and information-rich 3D creation.

Alberto Tono, Jiajun Wu, Gordon Wetzstein + 4 more2026-03-05💻 cs

The Influence of Iconicity in Transfer Learning for Sign Language Recognition

This study demonstrates that leveraging the iconicity of signs in transfer learning from Chinese to Arabic and Greek to Flemish significantly improves sign language recognition performance, particularly yielding a 7.02% gain for Arabic, by utilizing MediaPipe-extracted spatial and temporal features processed through MLP and GRU architectures.

Keren Artiaga, Conor Lynch, Haithem Afli + 1 more2026-03-05🤖 cs.AI

mHC-HSI: Clustering-Guided Hyper-Connection Mamba for Hyperspectral Image Classification

This paper introduces mHC-HSI, a clustering-guided Hyper-Connection Mamba model that enhances hyperspectral image classification accuracy and interpretability by integrating spatial-spectral feature learning, soft cluster-based residual matrices, and physically-meaningful spectral band grouping.

Yimin Zhu, Zack Dewis, Quinn Ledingham + 6 more2026-03-05💻 cs

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

This paper introduces a counterfactual evaluation framework revealing that while reinforcement learning with verifiable rewards improves accuracy on medical VQA benchmarks, it often degrades genuine visual grounding by enabling models to rely on text shortcuts and hallucinate visual reasoning, necessitating new evaluation metrics and training objectives that explicitly enforce visual dependence.

Anas Zafar, Leema Krishna Murali, Ashish Vashist2026-03-05💻 cs

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

This paper introduces Proact-VL, a general framework designed to transform multimodal language models into proactive, real-time AI companions that overcome latency and decision-making challenges, validated through the new Live Gaming Benchmark across commentary and guidance scenarios.

Weicai Yan, Yuhong Dai, Qi Ran + 6 more2026-03-05💻 cs

Impact of Localization Errors on Label Quality for Online HD Map Construction

This paper investigates how various localization errors degrade label quality in online HD map construction, revealing that heading angle errors have a more significant impact than position errors and that model performance decreases non-linearly with increasing noise, while also proposing a distance-based metric to better evaluate these effects.

Alexander Blumberg, Jonas Merkert, Richard Fehler + 4 more2026-03-05💻 cs

Beyond Pixel Histories: World Models with Persistent 3D State

The paper introduces PERSIST, a novel world model paradigm that simulates the evolution of a latent 3D scene to overcome the spatial memory and consistency limitations of existing video generation methods, thereby enabling coherent, long-horizon interactive experiences with persistent 3D state and geometry-aware control.

Samuel Garcin, Thomas Walker, Steven McDonagh + 5 more2026-03-05🤖 cs.AI

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

This paper introduces Phys4D, a three-stage training pipeline that transforms appearance-driven video diffusion models into physics-consistent 4D world representations by combining pseudo-supervised pretraining, simulation-grounded fine-tuning, and reinforcement learning to achieve fine-grained spatiotemporal and physical consistency.

Haoran Lu, Shang Wu, Jianshu Zhang + 9 more2026-03-05🤖 cs.AI

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

This study proposes a novel Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer that fuses Sentinel-1, RCM, and AMSR2 data to generate 200m resolution pan-Arctic sea ice concentration maps with reliable uncertainty estimates, effectively overcoming challenges related to subtle feature extraction, inexact labels, and data heterogeneity.

Mabel Heffring, Lincoln Linlin Xu2026-03-05🤖 cs.LG

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt introduces a two-stage reinforcement learning framework that automatically refines text-to-video prompts through physics-focused fine-tuning and a dynamic reward curriculum, significantly enhancing physical plausibility and semantic adherence across diverse models while outperforming much larger general-purpose LLMs.

Shang Wu, Chenwei Xu, Zhuofan Xia + 6 more2026-03-05🤖 cs.AI

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

This paper introduces PinCLIP, a large-scale foundational multimodal representation model for Pinterest that employs a novel hybrid Vision Transformer architecture and neighbor alignment objectives to overcome VLM integration challenges, resulting in significant improvements in multi-modal retrieval accuracy, cold-start content distribution, and overall user engagement.

Josh Beal, Eric Kim, Jinfeng Rao + 3 more2026-03-05💻 cs

Modeling Cross-vision Synergy for Unified Large Vision Model

This paper introduces PolyV, a unified large vision model that achieves cross-vision synergy across images, videos, and 3D data through a sparse Mixture-of-Experts architecture with dynamic routing and a synergy-aware training paradigm, resulting in significant performance improvements over existing models.

Shengqiong Wu, Lanhu Wu, Mingyang Bao + 5 more2026-03-05💻 cs

Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

This paper proposes a novel confidence-aware monocular depth estimation framework for minimally invasive surgery that leverages calibrated confidence targets and a specialized loss function to improve depth accuracy and provide reliable per-pixel confidence maps, thereby addressing challenges posed by endoscopic image artifacts like smoke and blur.

Muhammad Asad, Emanuele Colleoni, Pritesh Mehta + 7 more2026-03-05💻 cs

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

This paper introduces L2G-Det, a novel framework that detects and segments specific object instances in open-world scenes by leveraging dense local patch matching to generate candidate points, which are then refined and used to prompt an augmented Segment Anything Model for robust mask reconstruction without relying on traditional object proposals.

Qifan Zhang, Sai Haneesh Allu, Jikai Wang + 2 more2026-03-05💻 cs

Spectrum Shortage for Radio Sensing? Leveraging Ambient 5G Signals for Human Activity Detection

This paper introduces Ambient Radio Sensing (ARS), a novel ISAC approach that repurposes ambient 5G signals for human activity detection via a passive self-mixing hardware architecture and a cross-modal learning framework, effectively overcoming spectrum scarcity while preserving privacy.

Kunzhe Song, Maxime Zingraff, Huacheng Zeng2026-03-05💻 cs

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

This paper proposes a VQA-inspired data augmentation framework that generates natural-language questions about character-level attributes to enhance scene and handwritten text recognition models, resulting in significant improvements in transcription accuracy on benchmark datasets.

Xu Yao, Lei Kang2026-03-05💻 cs

Hazard-Aware Traffic Scene Graph Generation

This paper introduces a novel Traffic Scene Graph Generation framework that leverages accident data and depth cues to model safety-relevant relations between hazards and the ego vehicle, thereby enhancing situational awareness in complex driving scenarios.

Yaoqi Huang, Julie Stephany Berrio, Mao Shan + 1 more2026-03-05💻 cs

DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization

This paper proposes DM-CFO, a diffusion model-based framework that integrates text and graph constraints for layout generation with collision-free optimization via 3D Gaussian updates and distance regularization to produce realistic, intersection-free compositional 3D tooth designs.

Yan Tian, Pengcheng Xue, Weiping Ding + 5 more2026-03-05💻 cs

Detection and Identification of Penguins Using Appearance and Motion Features

This paper proposes a framework that enhances penguin detection and identification in animal facilities by integrating motion cues into a modified YOLO11 detector for improved temporal consistency and employing tracklet-based contrastive learning to generate coherent feature embeddings for individual recognition.

Kasumi Seko, Hiroki Kinoshita, Raj Rajeshwar Malinda + 1 more2026-03-05💻 cs

← Previous Next →