cs.CV papers | Gist.Science

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

This paper introduces Phys4D, a three-stage training pipeline that transforms appearance-driven video diffusion models into physics-consistent 4D world representations by combining pseudo-supervised pretraining, simulation-grounded fine-tuning, and reinforcement learning to achieve fine-grained spatiotemporal and physical consistency.

Haoran Lu, Shang Wu, Jianshu Zhang + 9 more2026-03-05🤖 cs.AI

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

This study proposes a novel Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer that fuses Sentinel-1, RCM, and AMSR2 data to generate 200m resolution pan-Arctic sea ice concentration maps with reliable uncertainty estimates, effectively overcoming challenges related to subtle feature extraction, inexact labels, and data heterogeneity.

Mabel Heffring, Lincoln Linlin Xu2026-03-05🤖 cs.LG

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt introduces a two-stage reinforcement learning framework that automatically refines text-to-video prompts through physics-focused fine-tuning and a dynamic reward curriculum, significantly enhancing physical plausibility and semantic adherence across diverse models while outperforming much larger general-purpose LLMs.

Shang Wu, Chenwei Xu, Zhuofan Xia + 6 more2026-03-05🤖 cs.AI

PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

This paper introduces PinCLIP, a large-scale foundational multimodal representation model for Pinterest that employs a novel hybrid Vision Transformer architecture and neighbor alignment objectives to overcome VLM integration challenges, resulting in significant improvements in multi-modal retrieval accuracy, cold-start content distribution, and overall user engagement.

Josh Beal, Eric Kim, Jinfeng Rao + 3 more2026-03-05💻 cs

Modeling Cross-vision Synergy for Unified Large Vision Model

This paper introduces PolyV, a unified large vision model that achieves cross-vision synergy across images, videos, and 3D data through a sparse Mixture-of-Experts architecture with dynamic routing and a synergy-aware training paradigm, resulting in significant performance improvements over existing models.

Shengqiong Wu, Lanhu Wu, Mingyang Bao + 5 more2026-03-05💻 cs

Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

This paper proposes a novel confidence-aware monocular depth estimation framework for minimally invasive surgery that leverages calibrated confidence targets and a specialized loss function to improve depth accuracy and provide reliable per-pixel confidence maps, thereby addressing challenges posed by endoscopic image artifacts like smoke and blur.

Muhammad Asad, Emanuele Colleoni, Pritesh Mehta + 7 more2026-03-05💻 cs

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

This paper introduces L2G-Det, a novel framework that detects and segments specific object instances in open-world scenes by leveraging dense local patch matching to generate candidate points, which are then refined and used to prompt an augmented Segment Anything Model for robust mask reconstruction without relying on traditional object proposals.

Qifan Zhang, Sai Haneesh Allu, Jikai Wang + 2 more2026-03-05💻 cs

Spectrum Shortage for Radio Sensing? Leveraging Ambient 5G Signals for Human Activity Detection

This paper introduces Ambient Radio Sensing (ARS), a novel ISAC approach that repurposes ambient 5G signals for human activity detection via a passive self-mixing hardware architecture and a cross-modal learning framework, effectively overcoming spectrum scarcity while preserving privacy.

Kunzhe Song, Maxime Zingraff, Huacheng Zeng2026-03-05💻 cs

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

This paper proposes a VQA-inspired data augmentation framework that generates natural-language questions about character-level attributes to enhance scene and handwritten text recognition models, resulting in significant improvements in transcription accuracy on benchmark datasets.

Xu Yao, Lei Kang2026-03-05💻 cs

Hazard-Aware Traffic Scene Graph Generation

This paper introduces a novel Traffic Scene Graph Generation framework that leverages accident data and depth cues to model safety-relevant relations between hazards and the ego vehicle, thereby enhancing situational awareness in complex driving scenarios.

Yaoqi Huang, Julie Stephany Berrio, Mao Shan + 1 more2026-03-05💻 cs

DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization

This paper proposes DM-CFO, a diffusion model-based framework that integrates text and graph constraints for layout generation with collision-free optimization via 3D Gaussian updates and distance regularization to produce realistic, intersection-free compositional 3D tooth designs.

Yan Tian, Pengcheng Xue, Weiping Ding + 5 more2026-03-05💻 cs

Detection and Identification of Penguins Using Appearance and Motion Features

This paper proposes a framework that enhances penguin detection and identification in animal facilities by integrating motion cues into a modified YOLO11 detector for improved temporal consistency and employing tracklet-based contrastive learning to generate coherent feature embeddings for individual recognition.

Kasumi Seko, Hiroki Kinoshita, Raj Rajeshwar Malinda + 1 more2026-03-05💻 cs

Tracking Feral Horses in Aerial Video Using Oriented Bounding Boxes

This paper proposes a robust method for tracking feral horses in aerial video by employing oriented bounding boxes and a novel head-orientation estimation technique using multi-detector voting to resolve 180° flipping ambiguities, thereby achieving 99.3% accuracy in distinguishing head from tail for continuous trajectory analysis.

Saeko Takizawa, Tamao Maeda, Shinya Yamamoto + 1 more2026-03-05💻 cs

Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression

The paper proposes ParaHydra, a novel distributed multi-view image compression framework featuring an OmniParallax Attention Mechanism and a Parallax Multi Information Fusion Module that adaptively aligns and integrates inter-view correlations, enabling it to significantly outperform state-of-the-art multi-view codecs in both bitrate efficiency and computational speed.

Haotian Zhang, Feiyue Long, Yixin Yu + 7 more2026-03-05💻 cs

LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark

This paper introduces LeafInst, a novel instance segmentation network designed for fine-grained forestry leaf analysis in open-field UAV imagery, and validates its superior performance on the newly constructed Poplar-leaf benchmark and the public PhenoBench dataset.

Taige Luo, Junru Xie, Chenyang Fan + 5 more2026-03-05💻 cs

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

This paper introduces RAGTrack, a novel Retrieval-Augmented Generation framework that enhances RGB-Thermal tracking by integrating textual descriptions via Multi-modal Large Language Models and employing adaptive token fusion with context-aware reasoning to overcome appearance variations and modality gaps.

Hao Li, Yuhao Wang, Wenning Hao + 3 more2026-03-05💻 cs

CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing

The paper introduces CoRe-BT, a multimodal benchmark comprising 310 patients with MRI, histopathology, and pathology reports, designed to evaluate robust brain tumor typing under realistic conditions of missing clinical data.

Juampablo E. Heras Rivera, Daniel K. Low, Xavier Xiong + 5 more2026-03-05💻 cs

Extending Neural Operators: Robust Handling of Functions Beyond the Training Set

This paper presents a rigorous framework that extends neural operators to robustly handle out-of-distribution input functions by leveraging kernel approximations and Reproducing Kernel Hilbert Space theory to ensure accurate prediction of both function values and derivatives, validated through solutions of elliptic partial differential equations on manifolds.

Blaine Quackenbush, Paul J. Atzberger2026-03-05🤖 cs.LG

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

This paper introduces Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial instructions into natural images to hijack Multimodal Large Language Models, demonstrating a 64% success rate in manipulating model outputs while remaining visually imperceptible to humans.

Neha Nagaraja, Lan Zhang, Zhilong Wang + 2 more2026-03-05🤖 cs.AI

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

The paper presents InfinityStory, a novel framework, dataset, and model that overcome key limitations in long-form video generation by ensuring background and character consistency across shots while enabling seamless multi-subject transitions for hour-long narratives.

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen + 27 more2026-03-05💻 cs

← Previous Next →