VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

The paper introduces VocSegMRI, a multimodal framework that leverages cross-attention fusion and contrastive learning to integrate video, audio, and phonological signals, achieving state-of-the-art vocal tract segmentation in real-time MRI with a Dice score of 0.95 and robust performance even when audio is unavailable.

Daiqi Liu, Tomás Arias-Vergara, Johannes Enk, Fangxu Xing, Maureen Stone, Jerry L. Prince, Jana Hutter, Andreas Maier, Jonghye Woo, Paula Andrea Pérez-ToroWed, 11 Ma💻 cs

CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus

CoRe-GS is a coarse-to-refine Gaussian Splatting framework that accelerates 3D reconstruction for robotic applications by selectively optimizing only task-relevant points of interest, thereby significantly reducing training time and mitigating artifacts while maintaining high-quality semantic segmentation.

Hannah Schieber, Dominik Frischmann, Victor Schaack, Simon Boche, Angela Schoellig, Stefan Leutenegger, Daniel RothWed, 11 Ma💻 cs

Improving Large Vision-Language Models' Understanding for Flow Field Data

This paper introduces FieldLVLM, a novel framework that enhances Large Vision-Language Models' ability to interpret complex scientific field data by combining a specialized pipeline for extracting physical features into structured text with a data-compressed tuning strategy, resulting in superior performance on scientific benchmarks.

Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang ZhangWed, 11 Ma💻 cs

SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding

This paper proposes SpikeSMOKE, a low-power monocular 3D object detection framework based on Spiking Neural Networks that introduces a Cross-Scale Gating Coding mechanism and a lightweight residual block to overcome information loss and computational inefficiency, achieving superior performance on KITTI and other datasets while significantly reducing energy consumption and model complexity compared to traditional ANN-based approaches.

Xuemei Chen, Huamin Wang, Jing Peng, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen HuangWed, 11 Ma💻 cs

M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for optical-SAR Object Detection

This paper introduces M4-SAR, a large-scale, multi-resolution, multi-polarization, and multi-source dataset with nearly one million labeled instances, alongside a unified benchmarking toolkit and a novel end-to-end fusion framework (E2E-OSDet) that collectively advance optical-SAR object detection by demonstrating significant performance gains over single-source methods in complex environments.

Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei LuoWed, 11 Ma💻 cs

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

This paper introduces BR-Gen, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, and proposes NFA-ViT, a noise-guided Vision Transformer that amplifies subtle forgery traces to significantly improve the detection and generalization of localized AI-generated image forgeries.

Lvpan Cai, Haowei Wang, Jiayi Ji, Yanshu Zhoumen, Shen Chen, Taiping Yao, Xiaoshuai SunWed, 11 Ma💻 cs

A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects

This survey provides a comprehensive review of over 200 papers on Wi-Fi sensing generalizability, offering a structured taxonomy of techniques to address domain shifts, summarizing key datasets, and outlining future research directions and community resources.

Fei Wang, Tingting Zhang, Wei Xi, Han Ding, Ge Wang, Di Zhang, Yuanhao Cui, Fan Liu, Jinsong Han, Jie Xu, Tony Xiao HanWed, 11 Ma💻 cs

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

This paper introduces iMarkers, a novel class of invisible fiducial markers detectable only by robots and AR devices, which overcome the visual aesthetic limitations of traditional markers while offering customizable production, robust detection algorithms, and proven effectiveness across diverse robotics scenarios.

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger VoosWed, 11 Ma💻 cs

ARSGaussian: 3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis

This paper introduces ARSGaussian, a novel view synthesis method for aerial remote sensing that integrates LiDAR constraints, distortion-aware coordinate transformations, and geometric consistency losses to mitigate floaters and overgrowth while achieving high-precision geo-alignment, supported by the newly released AIR-LONGYAN dataset.

Yiling Yao, Bing Zhang, Wenjuan Zhang, Lianru Gao, Dailiang Peng, Bocheng Li, Yaning Wang, Bowen WangWed, 11 Ma💻 cs

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

TIMotion is an efficient framework for human-human motion generation that improves upon existing methods by introducing Causal Interactive Injection, Role-Evolving Scanning, and Localized Pattern Amplification to better model temporal dynamics and interactive roles, thereby achieving superior performance on benchmark datasets.

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, Yong LiuWed, 11 Ma💻 cs

Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading

This paper introduces the concept of Whole Slide Difficulty (WSD), derived from diagnostic disagreements between expert and non-expert pathologists, and demonstrates that leveraging this metric through multi-task learning or weighted loss functions significantly improves the accuracy of prostate cancer Gleason grading in Multiple Instance Learning models, particularly for higher-grade cases.

Marie Arrivat, Rémy Peyret, Elsa Angelini, Pietro GoriWed, 11 Ma💻 cs

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

This paper proposes an interpretable text-motion retrieval framework that represents 3D human motion as joint-angle pseudo-images processed by Vision Transformers and aligns them with text via a token-wise late interaction mechanism, thereby overcoming the limitations of global-embedding methods by capturing fine-grained correspondences and improving retrieval accuracy.

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu XiaoWed, 11 Ma💻 cs