Detection and Measurement of Hailstones with Multimodal Large Language Models

This study demonstrates that pre-trained multimodal large language models, particularly when enhanced with two-stage prompting strategies that leverage reference objects, can effectively detect and measure hailstone diameters from crowdsourced social media images with an average error of 1.12cm, offering a promising complement to traditional hail sensors for rapid severe weather assessment.

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

The paper proposes FlowRVS, a novel one-stage generative framework that reformulates Referring Video Object Segmentation as a language-guided continuous flow deformation problem, leveraging pretrained text-to-video models to achieve state-of-the-art performance by directly mapping video representations to target masks while overcoming the limitations of traditional cascaded approaches.

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

This study demonstrates that self-supervised deep learning, specifically the "Bootstrap Your Own Latent" strategy, enables highly accurate statewide 1-meter land cover classification using only 1,000 annotated patches, effectively overcoming the data scarcity barrier for large-scale, high-resolution mapping.

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira + 1 more2026-02-27💻 cs

Q2^2: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

This paper introduces Q2^2, a training-only framework that addresses performance degradation in low-bit quantization for complex visual tasks by mitigating gradient imbalance at feature fusion stages through dynamic gradient balancing and attention distribution alignment, thereby significantly improving object detection and image segmentation accuracy without inference-time overhead.

Zhaoyang Wang, Dong Wang2026-02-27🤖 cs.AI

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

The paper proposes FiNDR, a novel framework that leverages reasoning-augmented large multi-modal models to achieve state-of-the-art, vocabulary-free fine-grained image recognition by automatically generating, filtering, and utilizing descriptive candidate labels, thereby surpassing traditional methods that rely on fixed human-defined vocabularies.

Dmitry Demidov, Zaigham Zaheer, Zongyan Han + 2 more2026-02-27💻 cs

Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

UniPath is a novel framework that overcomes limitations in computational pathology image generation by leveraging mature diagnostic understanding to produce controllable, semantics-driven images via multi-stream control (raw text, diagnostic semantic tokens, and morphological prototypes) and a curated large-scale dataset, achieving state-of-the-art performance and fine-grained semantic fidelity.

Minghao Han, Yichen Liu, Yizhou Liu + 5 more2026-02-27💻 cs

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

This paper introduces WebGym, a large-scale open-source environment with nearly 300,000 realistic web tasks and a high-throughput asynchronous rollout system, which enables reinforcement learning to significantly improve the performance of visual web agents on out-of-distribution websites, surpassing both proprietary models and prior open-source approaches.

Hao Bai, Alexey Taymanov, Tong Zhang + 2 more2026-02-27🤖 cs.LG

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

This paper introduces ThinkRL-Edit, a reasoning-centric reinforcement learning framework that enhances instruction-driven image editing by decoupling visual reasoning from synthesis through Chain-of-Thought sampling, unbiased reward grouping, and binary checklist-based VLM evaluation to overcome limitations in exploration, reward fusion, and reward stability.

Hengjia Li, Liming Jiang, Qing Yan + 6 more2026-02-27💻 cs

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

This paper proposes a generic Visible Light Positioning (VLP) algorithm called LC-VLP that utilizes Lamé curves as a unified representation for diverse LED shapes, enabling accurate camera pose estimation through a correspondence-free initialization and nonlinear optimization, which achieves superior performance over state-of-the-art methods with sub-4 cm average position accuracy.

Wenxuan Pan, Yang Yang, Dong Wei + 4 more2026-02-27⚡ eess