Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

This paper proposes a novel plug-and-play ranking architecture that leverages Large Vision-Language Models (LVLMs) and a relational-aware loss function to explicitly model cross-view interactions, thereby significantly enhancing the accuracy and stability of UAV-to-satellite image geolocalization.

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao2026-03-10💻 cs

TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery

The paper proposes TALON, a test-time adaptive learning framework for on-the-fly category discovery that overcomes the limitations of static hash-based methods by dynamically updating semantic prototypes and the feature encoder to continuously integrate new knowledge, while employing margin-aware logit calibration to prevent category explosion and significantly improve novel-class accuracy.

Yanan Wu, Yuhan Yan, Tailai Chen, Zhixiang Chi, ZiZhang Wu, Yi Jin, Yang Wang, Zhenbo Li2026-03-10💻 cs

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

This paper introduces DSH-Bench, a comprehensive benchmark featuring a hierarchical subject taxonomy, granular difficulty and scenario classification, and a novel Subject Identity Consistency Score (SICS) metric to systematically evaluate and diagnose subject-driven text-to-image generation models.

Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, Chao Deng, Peng Shu, Huan Yu, Jie Jiang2026-03-10💻 cs

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

The paper proposes SAMoE-VLA, a novel Vision-Language-Action framework for autonomous driving that replaces unstable token-level Mixture-of-Experts with a scene-adaptive mechanism driven by bird's-eye-view features and a conditional cross-modal causal attention module, achieving state-of-the-art performance with fewer parameters on both open-loop and closed-loop benchmarks.

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang2026-03-10💻 cs

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

UniGround introduces a novel, training-free framework for universal 3D visual grounding that leverages global candidate filtering and local precision reasoning to achieve state-of-the-art zero-shot performance in localizing arbitrary objects within complex 3D environments without relying on pre-trained models or 3D supervision.

Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu2026-03-10💻 cs

Fast Low-light Enhancement and Deblurring for 3D Dark Scenes

FLED-GS is a fast framework for novel view synthesis in 3D dark scenes that addresses compound low-light, noise, and motion blur degradations by reformulating restoration as an alternating cycle of 2D deblurring and noise-aware 3D Gaussian Splatting reconstruction, achieving superior performance with significantly faster training and rendering speeds compared to state-of-the-art methods.

Feng Zhang, Jinglong Wang, Ze Li, Yanghong Zhou, Yang Chen, Lei Chen, Xiatian Zhu2026-03-10💻 cs

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

This paper introduces MV-Fashion, a large-scale multi-view video dataset featuring 3,273 sequences with pixel-level annotations, ground-truth material properties, and paired flat/worn garment images, designed to bridge the realism and annotation gaps in existing datasets for virtual try-on and size estimation tasks.

Hunor Laczkó, Libang Jia, Loc-Phat Truong, Diego Hernández, Sergio Escalera, Jordi Gonzalez, Meysam Madadi2026-03-10💻 cs

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

The paper introduces MERLIN, a novel training framework accompanied by the EM-100k dataset and EM-Bench benchmark, to overcome data scarcity, evaluation gaps, and low-SNR fragility in building robust Multimodal Large Language Models for electromagnetic signals.

Junyu Shen, Zhendong She, Chenghanyu Zhang, Yuchuang Sun, Luqing Luo, Dingwei Tan, Zonghao Guo, Bo Guo, Zehua Han, Wupeng Xie, Yaxin Mu, Peng Zhang, Peipei Li, Fengxiang Wang, Yangang Sun, Maosong Sun2026-03-10💻 cs

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

This paper proposes MM-TS, a novel framework for multi-modal contrastive learning that dynamically adjusts temperature and margin schedules based on local data distribution to address long-tail imbalances, unifying InfoNCE and max-margin objectives to achieve state-of-the-art performance across multiple image- and video-language datasets.

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele, Hilde Kuehne, Anna Kukleva2026-03-10💻 cs