Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

This paper proposes STE-VLN, a novel approach that enhances Vision-Language Navigation in unseen environments by constructing the YE-KG, a large-scale multimodal spatiotemporal knowledge graph derived from real-world indoor videos, and integrating it via a Coarse-to-Fine Hierarchical Retrieval mechanism to improve long-horizon reasoning and handle coarse-grained instructions.

Haoxuan Xu, Tianfu Li, Wenbo Chen + 4 more2026-03-02💻 cs

GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

This paper introduces GDA-YOLO11, a novel amodal instance segmentation framework that significantly enhances occlusion-robust robotic fruit harvesting by inferring complete fruit shapes and accurately estimating picking points, achieving superior performance metrics and higher success rates under varying occlusion levels compared to existing models.

Caner Beldek, Emre Sariyildiz, Son Lam Phung + 1 more2026-03-02💻 cs

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

This paper proposes Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables multimodal large language models to perform precise region-grounded reasoning by generating continuous numerical coordinates as actions, thereby overcoming the limitations of discrete text-based or fixed-patch approaches while improving localization accuracy and training efficiency.

Kesen Zhao, Beier Zhu, Junbao Zhou + 3 more2026-03-02💻 cs

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

The paper proposes SR3R, a feed-forward framework that reformulates 3D super-resolution as a direct mapping from sparse low-resolution views to high-resolution 3D Gaussian Splatting representations, enabling robust zero-shot generalization and superior reconstruction fidelity by autonomously learning 3D-specific high-frequency details from large-scale multi-scene data.

Xiang Feng, Xiangbo Wang, Tieshi Zhong + 7 more2026-03-02💻 cs

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

This paper proposes SteerVAD, a novel tuning-free framework that enhances video anomaly detection in frozen multi-modal LLMs by identifying latent anomaly experts and employing a hierarchical meta-controller to dynamically steer and rectify their internal representations, thereby achieving state-of-the-art performance with minimal training data.

Zhaolin Cai, Fan Li, Huiyu Duan + 2 more2026-03-02💻 cs