cs.CV papers | Gist.Science

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

This paper introduces a counterfactual evaluation framework revealing that while reinforcement learning with verifiable rewards improves accuracy on medical VQA benchmarks, it often degrades genuine visual grounding by enabling models to rely on text shortcuts and hallucinate visual reasoning, necessitating new evaluation metrics and training objectives that explicitly enforce visual dependence.

Anas Zafar, Leema Krishna Murali, Ashish Vashist2026-03-05💻 cs

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

This paper introduces Proact-VL, a general framework designed to transform multimodal language models into proactive, real-time AI companions that overcome latency and decision-making challenges, validated through the new Live Gaming Benchmark across commentary and guidance scenarios.

Weicai Yan, Yuhong Dai, Qi Ran + 6 more2026-03-05💻 cs

Impact of Localization Errors on Label Quality for Online HD Map Construction

This paper investigates how various localization errors degrade label quality in online HD map construction, revealing that heading angle errors have a more significant impact than position errors and that model performance decreases non-linearly with increasing noise, while also proposing a distance-based metric to better evaluate these effects.

Alexander Blumberg, Jonas Merkert, Richard Fehler + 4 more2026-03-05💻 cs

Beyond Pixel Histories: World Models with Persistent 3D State

The paper introduces PERSIST, a novel world model paradigm that simulates the evolution of a latent 3D scene to overcome the spatial memory and consistency limitations of existing video generation methods, thereby enabling coherent, long-horizon interactive experiences with persistent 3D state and geometry-aware control.

Samuel Garcin, Thomas Walker, Steven McDonagh + 5 more2026-03-05🤖 cs.AI

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

This paper introduces Phys4D, a three-stage training pipeline that transforms appearance-driven video diffusion models into physics-consistent 4D world representations by combining pseudo-supervised pretraining, simulation-grounded fine-tuning, and reinforcement learning to achieve fine-grained spatiotemporal and physical consistency.

Haoran Lu, Shang Wu, Jianshu Zhang + 9 more2026-03-05🤖 cs.AI

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

This study proposes a novel Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer that fuses Sentinel-1, RCM, and AMSR2 data to generate 200m resolution pan-Arctic sea ice concentration maps with reliable uncertainty estimates, effectively overcoming challenges related to subtle feature extraction, inexact labels, and data heterogeneity.

Mabel Heffring, Lincoln Linlin Xu2026-03-05🤖 cs.LG

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt introduces a two-stage reinforcement learning framework that automatically refines text-to-video prompts through physics-focused fine-tuning and a dynamic reward curriculum, significantly enhancing physical plausibility and semantic adherence across diverse models while outperforming much larger general-purpose LLMs.

Shang Wu, Chenwei Xu, Zhuofan Xia + 6 more2026-03-05🤖 cs.AI

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

This paper introduces PinCLIP, a large-scale foundational multimodal representation model for Pinterest that employs a novel hybrid Vision Transformer architecture and neighbor alignment objectives to overcome VLM integration challenges, resulting in significant improvements in multi-modal retrieval accuracy, cold-start content distribution, and overall user engagement.

Josh Beal, Eric Kim, Jinfeng Rao + 3 more2026-03-05💻 cs

Modeling Cross-vision Synergy for Unified Large Vision Model

This paper introduces PolyV, a unified large vision model that achieves cross-vision synergy across images, videos, and 3D data through a sparse Mixture-of-Experts architecture with dynamic routing and a synergy-aware training paradigm, resulting in significant performance improvements over existing models.

Shengqiong Wu, Lanhu Wu, Mingyang Bao + 5 more2026-03-05💻 cs

Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

This paper proposes a novel confidence-aware monocular depth estimation framework for minimally invasive surgery that leverages calibrated confidence targets and a specialized loss function to improve depth accuracy and provide reliable per-pixel confidence maps, thereby addressing challenges posed by endoscopic image artifacts like smoke and blur.

Muhammad Asad, Emanuele Colleoni, Pritesh Mehta + 7 more2026-03-05💻 cs

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

This paper introduces L2G-Det, a novel framework that detects and segments specific object instances in open-world scenes by leveraging dense local patch matching to generate candidate points, which are then refined and used to prompt an augmented Segment Anything Model for robust mask reconstruction without relying on traditional object proposals.

Qifan Zhang, Sai Haneesh Allu, Jikai Wang + 2 more2026-03-05💻 cs

Spectrum Shortage for Radio Sensing? Leveraging Ambient 5G Signals for Human Activity Detection

This paper introduces Ambient Radio Sensing (ARS), a novel ISAC approach that repurposes ambient 5G signals for human activity detection via a passive self-mixing hardware architecture and a cross-modal learning framework, effectively overcoming spectrum scarcity while preserving privacy.

Kunzhe Song, Maxime Zingraff, Huacheng Zeng2026-03-05💻 cs

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

This paper proposes a VQA-inspired data augmentation framework that generates natural-language questions about character-level attributes to enhance scene and handwritten text recognition models, resulting in significant improvements in transcription accuracy on benchmark datasets.

Xu Yao, Lei Kang2026-03-05💻 cs

Hazard-Aware Traffic Scene Graph Generation

This paper introduces a novel Traffic Scene Graph Generation framework that leverages accident data and depth cues to model safety-relevant relations between hazards and the ego vehicle, thereby enhancing situational awareness in complex driving scenarios.

Yaoqi Huang, Julie Stephany Berrio, Mao Shan + 1 more2026-03-05💻 cs

DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization

This paper proposes DM-CFO, a diffusion model-based framework that integrates text and graph constraints for layout generation with collision-free optimization via 3D Gaussian updates and distance regularization to produce realistic, intersection-free compositional 3D tooth designs.

Yan Tian, Pengcheng Xue, Weiping Ding + 5 more2026-03-05💻 cs

Detection and Identification of Penguins Using Appearance and Motion Features

This paper proposes a framework that enhances penguin detection and identification in animal facilities by integrating motion cues into a modified YOLO11 detector for improved temporal consistency and employing tracklet-based contrastive learning to generate coherent feature embeddings for individual recognition.

Kasumi Seko, Hiroki Kinoshita, Raj Rajeshwar Malinda + 1 more2026-03-05💻 cs

Tracking Feral Horses in Aerial Video Using Oriented Bounding Boxes

This paper proposes a robust method for tracking feral horses in aerial video by employing oriented bounding boxes and a novel head-orientation estimation technique using multi-detector voting to resolve 180° flipping ambiguities, thereby achieving 99.3% accuracy in distinguishing head from tail for continuous trajectory analysis.

Saeko Takizawa, Tamao Maeda, Shinya Yamamoto + 1 more2026-03-05💻 cs

Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression

The paper proposes ParaHydra, a novel distributed multi-view image compression framework featuring an OmniParallax Attention Mechanism and a Parallax Multi Information Fusion Module that adaptively aligns and integrates inter-view correlations, enabling it to significantly outperform state-of-the-art multi-view codecs in both bitrate efficiency and computational speed.

Haotian Zhang, Feiyue Long, Yixin Yu + 7 more2026-03-05💻 cs

LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark

This paper introduces LeafInst, a novel instance segmentation network designed for fine-grained forestry leaf analysis in open-field UAV imagery, and validates its superior performance on the newly constructed Poplar-leaf benchmark and the public PhenoBench dataset.

Taige Luo, Junru Xie, Chenyang Fan + 5 more2026-03-05💻 cs

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

This paper introduces RAGTrack, a novel Retrieval-Augmented Generation framework that enhances RGB-Thermal tracking by integrating textual descriptions via Multi-modal Large Language Models and employing adaptive token fusion with context-aware reasoning to overcome appearance variations and modality gaps.

Hao Li, Yuhao Wang, Wenning Hao + 3 more2026-03-05💻 cs

← Previous Next →