cs.CV papers | Gist.Science

Detection and Measurement of Hailstones with Multimodal Large Language Models

This study demonstrates that pre-trained multimodal large language models, particularly when enhanced with two-stage prompting strategies that leverage reference objects, can effectively detect and measure hailstone diameters from crowdsourced social media images with an average error of 1.12cm, offering a promising complement to traditional hail sensors for rapid severe weather assessment.

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

The paper proposes FlowRVS, a novel one-stage generative framework that reformulates Referring Video Object Segmentation as a language-guided continuous flow deformation problem, leveraging pretrained text-to-video models to achieve state-of-the-art performance by directly mapping video representations to target masks while overcoming the limitations of traditional cascaded approaches.

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

G4Splat is a novel 3D reconstruction method that leverages accurate metric-scale geometry derived from planar structures to guide a generative prior, effectively resolving multi-view inconsistencies and enabling high-quality scene completion in both observed and unobserved regions.

Junfeng Ni, Yixin Chen, Zhifei Yang + 4 more2026-02-27💻 cs

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

This paper introduces PoSh, a scene graph-guided LLM-as-a-Judge metric for evaluating detailed image descriptions, and validates it through the new DOCENT benchmark, demonstrating superior correlation with human judgments and robustness across diverse image types compared to existing metrics.

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford + 7 more2026-02-27💬 cs.CL

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

This study demonstrates that self-supervised deep learning, specifically the "Bootstrap Your Own Latent" strategy, enables highly accurate statewide 1-meter land cover classification using only 1,000 annotated patches, effectively overcoming the data scarcity barrier for large-scale, high-resolution mapping.

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira + 1 more2026-02-27💻 cs

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

This paper introduces Q $^2$ , a training-only framework that addresses performance degradation in low-bit quantization for complex visual tasks by mitigating gradient imbalance at feature fusion stages through dynamic gradient balancing and attention distribution alignment, thereby significantly improving object detection and image segmentation accuracy without inference-time overhead.

Zhaoyang Wang, Dong Wang2026-02-27🤖 cs.AI

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

This paper proposes USF-Net, a unified spatiotemporal fusion network that integrates adaptive large-kernel convolutions and low-complexity attention mechanisms to overcome limitations in existing cloud image extrapolation methods, achieving superior accuracy and efficiency while introducing the new ASI-CIS dataset.

Penghui Niu, Taotao Cai, Suqi Zhang + 4 more2026-02-27💻 cs

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

This paper identifies and addresses the "visual shortcuts" plaguing existing Multimodal Knowledge-Based Visual Question Answering benchmarks by introducing the RETINA dataset, which forces models to reason about related entities, and proposing the MIMIR model that leverages multi-image retrieval to overcome these limitations.

Dosung Lee, Sangwon Jung, Boyoung Kim + 4 more2026-02-27💻 cs

Diffusion Model in Latent Space for Medical Image Segmentation Task

The paper proposes MedSegLatDiff, an efficient latent-space diffusion framework that combines a VAE with a weighted cross-entropy loss to generate diverse, uncertainty-aware medical image segmentation hypotheses while achieving state-of-the-art performance on multiple clinical datasets.

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son + 1 more2026-02-27🤖 cs.AI

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

This paper introduces ClimaDrive, a framework for generating physically realistic and weather-diverse synthetic anomaly data, and leverages it to build the ClimaOoD benchmark, which significantly enhances the generalization and robustness of anomaly segmentation models in open-world autonomous driving scenarios.

Yuxing Liu, Zheng Li, Huanhuan Liang + 3 more2026-02-27💻 cs

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

VLM-Pruner is a training-free token pruning algorithm that enhances efficient Vision-Language Model inference by introducing a centrifugal selection paradigm and a Buffering for Spatial Sparsity criterion to balance redundancy reduction with spatial coverage, while selectively fusing discarded token information to maintain performance.

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni + 4 more2026-02-27🤖 cs.LG

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

The paper introduces TIMAR, a causal turn-level framework that models interleaved audio-visual contexts to generate expressive and temporally coherent 3D conversational head dynamics, significantly outperforming existing methods on the DualTalk benchmark.

Junjie Chen, Fei Wang, Zhihao Huang + 5 more2026-02-27💻 cs

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

The paper proposes FiNDR, a novel framework that leverages reasoning-augmented large multi-modal models to achieve state-of-the-art, vocabulary-free fine-grained image recognition by automatically generating, filtering, and utilizing descriptive candidate labels, thereby surpassing traditional methods that rely on fixed human-defined vocabularies.

Dmitry Demidov, Zaigham Zaheer, Zongyan Han + 2 more2026-02-27💻 cs

Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

UniPath is a novel framework that overcomes limitations in computational pathology image generation by leveraging mature diagnostic understanding to produce controllable, semantics-driven images via multi-stream control (raw text, diagnostic semantic tokens, and morphological prototypes) and a curated large-scale dataset, achieving state-of-the-art performance and fine-grained semantic fidelity.

Minghao Han, Yichen Liu, Yizhou Liu + 5 more2026-02-27💻 cs

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

This paper introduces WebGym, a large-scale open-source environment with nearly 300,000 realistic web tasks and a high-throughput asynchronous rollout system, which enables reinforcement learning to significantly improve the performance of visual web agents on out-of-distribution websites, surpassing both proprietary models and prior open-source approaches.

Hao Bai, Alexey Taymanov, Tong Zhang + 2 more2026-02-27🤖 cs.LG

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

This paper introduces ThinkRL-Edit, a reasoning-centric reinforcement learning framework that enhances instruction-driven image editing by decoupling visual reasoning from synthesis through Chain-of-Thought sampling, unbiased reward grouping, and binary checklist-based VLM evaluation to overcome limitations in exploration, reward fusion, and reward stability.

Hengjia Li, Liming Jiang, Qing Yan + 6 more2026-02-27💻 cs

MERGETUNE: Continued Fine-Tuning of Vision-Language Models

This paper introduces MERGETUNE, a model-agnostic continued fine-tuning strategy that leverages linear mode connectivity and a second-order surrogate to recover pretrained knowledge in vision-language models after adaptation, thereby mitigating catastrophic forgetting and achieving state-of-the-art performance without additional parameters or data replay.

Wenqing Wang, Da Li, Xiatian Zhu + 1 more2026-02-27💻 cs

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new family of open-weight vision-language models that achieves state-of-the-art performance in video understanding and pixel-level grounding by leveraging seven newly collected video datasets and a novel training recipe, all developed without relying on proprietary models.

Christopher Clark, Jieyu Zhang, Zixian Ma + 18 more2026-02-27🤖 cs.AI

A Pragmatic VLA Foundation Model

This paper introduces LingBot-VLA, a pragmatic Vision-Language-Action foundation model trained on 20,000 hours of real-world dual-arm robot data that demonstrates superior generalization and training efficiency across multiple platforms while releasing its code, model, and benchmarks to advance the field of robot learning.

Wei Wu, Fan Lu, Yunnan Wang + 22 more2026-02-27💻 cs

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

This paper proposes a generic Visible Light Positioning (VLP) algorithm called LC-VLP that utilizes Lamé curves as a unified representation for diverse LED shapes, enabling accurate camera pose estimation through a correspondence-free initialization and nonlinear optimization, which achieves superior performance over state-of-the-art methods with sub-4 cm average position accuracy.

Wenxuan Pan, Yang Yang, Dong Wei + 4 more2026-02-27⚡ eess

← Previous Next →

cs.CV