cs.CV papers | Gist.Science

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

This paper identifies and addresses the "visual shortcuts" plaguing existing Multimodal Knowledge-Based Visual Question Answering benchmarks by introducing the RETINA dataset, which forces models to reason about related entities, and proposing the MIMIR model that leverages multi-image retrieval to overcome these limitations.

Dosung Lee, Sangwon Jung, Boyoung Kim + 4 more2026-02-27💻 cs

Diffusion Model in Latent Space for Medical Image Segmentation Task

The paper proposes MedSegLatDiff, an efficient latent-space diffusion framework that combines a VAE with a weighted cross-entropy loss to generate diverse, uncertainty-aware medical image segmentation hypotheses while achieving state-of-the-art performance on multiple clinical datasets.

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son + 1 more2026-02-27🤖 cs.AI

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

This paper introduces ClimaDrive, a framework for generating physically realistic and weather-diverse synthetic anomaly data, and leverages it to build the ClimaOoD benchmark, which significantly enhances the generalization and robustness of anomaly segmentation models in open-world autonomous driving scenarios.

Yuxing Liu, Zheng Li, Huanhuan Liang + 3 more2026-02-27💻 cs

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

VLM-Pruner is a training-free token pruning algorithm that enhances efficient Vision-Language Model inference by introducing a centrifugal selection paradigm and a Buffering for Spatial Sparsity criterion to balance redundancy reduction with spatial coverage, while selectively fusing discarded token information to maintain performance.

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni + 4 more2026-02-27🤖 cs.LG

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

The paper introduces TIMAR, a causal turn-level framework that models interleaved audio-visual contexts to generate expressive and temporally coherent 3D conversational head dynamics, significantly outperforming existing methods on the DualTalk benchmark.

Junjie Chen, Fei Wang, Zhihao Huang + 5 more2026-02-27💻 cs

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

The paper proposes FiNDR, a novel framework that leverages reasoning-augmented large multi-modal models to achieve state-of-the-art, vocabulary-free fine-grained image recognition by automatically generating, filtering, and utilizing descriptive candidate labels, thereby surpassing traditional methods that rely on fixed human-defined vocabularies.

Dmitry Demidov, Zaigham Zaheer, Zongyan Han + 2 more2026-02-27💻 cs

Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

UniPath is a novel framework that overcomes limitations in computational pathology image generation by leveraging mature diagnostic understanding to produce controllable, semantics-driven images via multi-stream control (raw text, diagnostic semantic tokens, and morphological prototypes) and a curated large-scale dataset, achieving state-of-the-art performance and fine-grained semantic fidelity.

Minghao Han, Yichen Liu, Yizhou Liu + 5 more2026-02-27💻 cs

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

This paper introduces WebGym, a large-scale open-source environment with nearly 300,000 realistic web tasks and a high-throughput asynchronous rollout system, which enables reinforcement learning to significantly improve the performance of visual web agents on out-of-distribution websites, surpassing both proprietary models and prior open-source approaches.

Hao Bai, Alexey Taymanov, Tong Zhang + 2 more2026-02-27🤖 cs.LG

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

This paper introduces ThinkRL-Edit, a reasoning-centric reinforcement learning framework that enhances instruction-driven image editing by decoupling visual reasoning from synthesis through Chain-of-Thought sampling, unbiased reward grouping, and binary checklist-based VLM evaluation to overcome limitations in exploration, reward fusion, and reward stability.

Hengjia Li, Liming Jiang, Qing Yan + 6 more2026-02-27💻 cs

MERGETUNE: Continued Fine-Tuning of Vision-Language Models

This paper introduces MERGETUNE, a model-agnostic continued fine-tuning strategy that leverages linear mode connectivity and a second-order surrogate to recover pretrained knowledge in vision-language models after adaptation, thereby mitigating catastrophic forgetting and achieving state-of-the-art performance without additional parameters or data replay.

Wenqing Wang, Da Li, Xiatian Zhu + 1 more2026-02-27💻 cs

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new family of open-weight vision-language models that achieves state-of-the-art performance in video understanding and pixel-level grounding by leveraging seven newly collected video datasets and a novel training recipe, all developed without relying on proprietary models.

Christopher Clark, Jieyu Zhang, Zixian Ma + 18 more2026-02-27🤖 cs.AI

A Pragmatic VLA Foundation Model

This paper introduces LingBot-VLA, a pragmatic Vision-Language-Action foundation model trained on 20,000 hours of real-world dual-arm robot data that demonstrates superior generalization and training efficiency across multiple platforms while releasing its code, model, and benchmarks to advance the field of robot learning.

Wei Wu, Fan Lu, Yunnan Wang + 22 more2026-02-27💻 cs

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

This paper proposes a generic Visible Light Positioning (VLP) algorithm called LC-VLP that utilizes Lamé curves as a unified representation for diverse LED shapes, enabling accurate camera pose estimation through a correspondence-free initialization and nonlinear optimization, which achieves superior performance over state-of-the-art methods with sub-4 cm average position accuracy.

Wenxuan Pan, Yang Yang, Dong Wei + 4 more2026-02-27⚡ eess

VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

This paper proposes VQ-Style, a novel framework that leverages Residual Vector Quantized Variational Autoencoders combined with contrastive learning and an information leakage loss to effectively disentangle human motion into coarse content and fine style representations, enabling zero-shot style transfer and other applications through a simple Quantized Code Swapping technique.

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann + 3 more2026-02-27🤖 cs.AI

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

The paper introduces OneVision-Encoder, a multimodal architecture that aligns with video codec principles by focusing computation on sparse, high-entropy regions rather than uniform pixel grids, thereby achieving superior efficiency and accuracy across image, video, and document understanding benchmarks.

Feilong Tang, Xiang An, Yunyao Yan + 16 more2026-02-27💻 cs

HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

The paper proposes HLGFA, an unsupervised industrial anomaly detection framework that identifies defects by modeling cross-resolution feature consistency between high and low-resolution representations of normal samples, achieving state-of-the-art performance on the MVTec AD dataset without relying on pixel-level reconstruction.

Han Zhou, Yuxuan Gao, Yinchao Du + 1 more2026-02-27💻 cs

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

The paper introduces GigaBrain-0.5M*, a Vision-Language-Action model that leverages world model-based reinforcement learning via the RAMP framework to overcome limitations in scene understanding and future anticipation, achieving significant performance gains and reliable long-horizon execution on complex robotic manipulation tasks.

GigaBrain Team, Boyuan Wang, Bohan Li + 23 more2026-02-27💻 cs

PCReg-Net: Progressive Contrast-Guided Registration for Cross-Domain Image Alignment

PCReg-Net is a lightweight, progressive contrast-guided deep learning framework that achieves real-time, high-fidelity deformable image registration across heterogeneous domains by employing a coarse-to-fine strategy with multi-scale contrast analysis to overcome appearance variations and geometric misalignments.

Jiahao Qin2026-02-27🤖 cs.AI

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

This paper presents a large-scale systematic benchmark of seven video foundation models on a novel dataset of 32,847 videos from 1,888 participants, revealing that model performance for remote Parkinson's disease screening is highly task-dependent and establishing a rigorous baseline with AUCs up to 85.3% while highlighting the need for task-aware calibration to improve sensitivity.

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader + 11 more2026-02-27💻 cs

Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

This paper proposes the Deferred Visual Ingestion (DVI) framework, which replaces the lossy pre-embedding of visual content with a structure-based hierarchical indexing and deferred VLM analysis strategy, achieving significantly higher accuracy on visual-dense engineering document QA by overcoming the retrieval and detail-loss limitations of existing Pre-Ingestion methods.

Tao Xu2026-02-27💬 cs.CL

← Previous Next →