GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

The paper introduces GeoSolver, a framework that enhances remote sensing reasoning by leveraging a large-scale process supervision dataset (Geo-PRM-2M) and a novel Process-Aware Tree-GRPO algorithm to train a token-level reward model (GeoPRM), thereby enabling verifiable, step-by-step reasoning and robust test-time scaling for both specialized and general-purpose Vision-Language Models.

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo YangWed, 11 Ma💻 cs

A comprehensive study of time-of-flight non-line-of-sight imaging

This paper presents a comprehensive study of Time-of-Flight non-line-of-sight imaging methods by unifying their theoretical formulations and hardware implementations to establish a common framework for analysis and demonstrate that, under equal constraints, existing techniques share similar performance limitations despite method-specific differences.

Julio Marco, Adrian Jarabo, Ji Hyun Nam, Alberto Tosi, Diego Gutierrez, Andreas VeltenWed, 11 Ma💻 cs

Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

This paper proposes a large language model-driven method for generating dynamic, semantically aligned speech and gestures for pedagogical agents in virtual reality, demonstrating through user experience experiments that such multimodal expressions significantly enhance learning effectiveness, engagement, and social presence while reducing fatigue and boredom.

Ninghao Wan, Jiarun Song, Fuzheng YangWed, 11 Ma💻 cs

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

This paper proposes DCAU-Net, a novel medical image segmentation framework that combines Differential Cross Attention to efficiently model long-range dependencies while reducing computational complexity, and a Channel-Spatial Feature Fusion strategy to adaptively integrate semantic and spatial details, thereby achieving enhanced segmentation accuracy and robustness.

Yanxin Li, Hui Wan, Libin LanWed, 11 Ma💻 cs

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

This paper introduces RuleSafe, a new long-horizon articulated manipulation benchmark featuring non-Markovian safe-unlocking tasks, and proposes VQ-Memory, a vector-quantized temporal representation that significantly enhances the planning, generalization, and efficiency of Vision-Language-Action models in complex robotic simulations.

Wang Honghui, Jing Zhi, Ao Jicong, Song Shiji, Li Xuelong, Huang Gao, Bai ChenjiaWed, 11 Ma💻 cs

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

This paper investigates the reliability of Vision-Language Models (VLMs) in autonomous driving by exposing their tendencies toward response inconsistency and weak temporal reasoning, and subsequently proposes the FutureVQA benchmark and a self-supervised chain-of-thought tuning method to enhance grounded future scene reasoning without requiring temporal labels.

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain PaganiWed, 11 Ma💻 cs

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

The paper proposes SurgFed, a language-guided multi-task federated learning framework that utilizes Language-guided Channel Selection and Language-guided Hyper Aggregation to overcome tissue and task diversity challenges, thereby improving surgical video segmentation and depth estimation across heterogeneous clinical environments.

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming JinWed, 11 Ma💻 cs

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

This paper proposes a novel component-aware, self-refining framework that combines a Self-Attention-based Autoencoder, a Coordinate-Preserving Gated Fusion module, and a Spatially Adaptive Refinement Revisor to generate high-fidelity, semantically accurate photorealistic images from freehand sketches, significantly outperforming existing GAN and diffusion models across diverse facial and non-facial datasets.

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz QureshiWed, 11 Ma💻 cs