cs.CV papers | Gist.Science

MERGETUNE: Continued Fine-Tuning of Vision-Language Models

This paper introduces MERGETUNE, a model-agnostic continued fine-tuning strategy that leverages linear mode connectivity and a second-order surrogate to recover pretrained knowledge in vision-language models after adaptation, thereby mitigating catastrophic forgetting and achieving state-of-the-art performance without additional parameters or data replay.

Wenqing Wang, Da Li, Xiatian Zhu + 1 more2026-02-27💻 cs

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new family of open-weight vision-language models that achieves state-of-the-art performance in video understanding and pixel-level grounding by leveraging seven newly collected video datasets and a novel training recipe, all developed without relying on proprietary models.

Christopher Clark, Jieyu Zhang, Zixian Ma + 18 more2026-02-27🤖 cs.AI

A Pragmatic VLA Foundation Model

This paper introduces LingBot-VLA, a pragmatic Vision-Language-Action foundation model trained on 20,000 hours of real-world dual-arm robot data that demonstrates superior generalization and training efficiency across multiple platforms while releasing its code, model, and benchmarks to advance the field of robot learning.

Wei Wu, Fan Lu, Yunnan Wang + 22 more2026-02-27💻 cs

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

This paper proposes a generic Visible Light Positioning (VLP) algorithm called LC-VLP that utilizes Lamé curves as a unified representation for diverse LED shapes, enabling accurate camera pose estimation through a correspondence-free initialization and nonlinear optimization, which achieves superior performance over state-of-the-art methods with sub-4 cm average position accuracy.

Wenxuan Pan, Yang Yang, Dong Wei + 4 more2026-02-27⚡ eess

VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

This paper proposes VQ-Style, a novel framework that leverages Residual Vector Quantized Variational Autoencoders combined with contrastive learning and an information leakage loss to effectively disentangle human motion into coarse content and fine style representations, enabling zero-shot style transfer and other applications through a simple Quantized Code Swapping technique.

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann + 3 more2026-02-27🤖 cs.AI

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

The paper introduces OneVision-Encoder, a multimodal architecture that aligns with video codec principles by focusing computation on sparse, high-entropy regions rather than uniform pixel grids, thereby achieving superior efficiency and accuracy across image, video, and document understanding benchmarks.

Feilong Tang, Xiang An, Yunyao Yan + 16 more2026-02-27💻 cs

HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

The paper proposes HLGFA, an unsupervised industrial anomaly detection framework that identifies defects by modeling cross-resolution feature consistency between high and low-resolution representations of normal samples, achieving state-of-the-art performance on the MVTec AD dataset without relying on pixel-level reconstruction.

Han Zhou, Yuxuan Gao, Yinchao Du + 1 more2026-02-27💻 cs

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

The paper introduces GigaBrain-0.5M*, a Vision-Language-Action model that leverages world model-based reinforcement learning via the RAMP framework to overcome limitations in scene understanding and future anticipation, achieving significant performance gains and reliable long-horizon execution on complex robotic manipulation tasks.

GigaBrain Team, Boyuan Wang, Bohan Li + 23 more2026-02-27💻 cs

PCReg-Net: Progressive Contrast-Guided Registration for Cross-Domain Image Alignment

PCReg-Net is a lightweight, progressive contrast-guided deep learning framework that achieves real-time, high-fidelity deformable image registration across heterogeneous domains by employing a coarse-to-fine strategy with multi-scale contrast analysis to overcome appearance variations and geometric misalignments.

Jiahao Qin2026-02-27🤖 cs.AI

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

This paper presents a large-scale systematic benchmark of seven video foundation models on a novel dataset of 32,847 videos from 1,888 participants, revealing that model performance for remote Parkinson's disease screening is highly task-dependent and establishing a rigorous baseline with AUCs up to 85.3% while highlighting the need for task-aware calibration to improve sensitivity.

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader + 11 more2026-02-27💻 cs

Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

This paper proposes the Deferred Visual Ingestion (DVI) framework, which replaces the lossy pre-embedding of visual content with a structure-based hierarchical indexing and deferred VLM analysis strategy, achieving significantly higher accuracy on visual-dense engineering document QA by overcoming the retrieval and detail-loss limitations of existing Pre-Ingestion methods.

Tao Xu2026-02-27💬 cs.CL

Depth from Defocus via Direct Optimization

This paper presents a feasible global optimization approach for depth from defocus that utilizes alternating minimization between convex optimization and parallel grid search to achieve high-resolution depth recovery on both synthetic and real datasets, outperforming current deep learning methods.

Holly Jackson, Caleb Adams, Ignacio Lopez-Francos + 1 more2026-02-27💻 cs

Compact Hadamard Latent Codes for Efficient Spectral Rendering

This paper introduces Hadamard spectral codes, a compact latent representation that enables efficient spectral rendering by approximating complex wavelength-dependent interactions through a small number of standard RGB rendering passes while preserving linear operations and approximating multiplicative spectral relationships.

Jiaqi Yu, Dar'ya Guarnera, Giuseppe Claudio Guarnera2026-02-27💻 cs

Automated Disentangling Analysis of Skin Colour for Lesion Images

This paper proposes a novel skin-colour disentangling framework that utilizes randomized decolourization and geometry-aligned post-processing to learn a structured latent space for generating diverse, equitable skin lesion datasets, thereby improving machine learning model performance across varying skin tones and imaging conditions.

Wenbo Yang, Eman Rezk, Walaa M. Moursi + 1 more2026-02-27⚡ eess

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

The paper introduces FUSAR-GPT, a specialized Visual Language Model for SAR imagery that overcomes existing limitations by leveraging an inaugural SAR Image-Text-AlphaEarth dataset, embedding multi-source spatiotemporal features via "spatiotemporal anchors," and employing a two-stage decoupled training strategy to achieve state-of-the-art performance in remote sensing interpretation.

Xiaokun Zhang, Yi Yang, Ziqi Ye + 6 more2026-02-27🤖 cs.AI

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

This paper introduces DICArt, a novel framework that advances category-level articulated object pose estimation by formulating the task as a conditional discrete diffusion process enhanced with a flexible flow decider and hierarchical kinematic coupling to overcome the limitations of existing continuous-space methods.

Li Zhang, Mingyu Mei, Ailing Wang + 7 more2026-02-27🤖 cs.AI

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

TextPecker addresses the critical bottleneck of structural anomaly perception in Visual Text Rendering by introducing a plug-and-play RL strategy supported by a specialized recognition dataset and stroke-editing synthesis engine, which significantly enhances the structural fidelity and semantic alignment of text-to-image models.

Hanshen Zhu, Yuliang Liu, Xuecheng Wu + 7 more2026-02-27💻 cs

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NORD is a data-efficient Vision-Language-Action model for autonomous driving that achieves competitive performance on Waymo and NAVSIM benchmarks using less than 60% of the training data and no reasoning annotations by addressing the difficulty bias in standard Group Relative Policy Optimization through the Dr. GRPO algorithm.

Ishaan Rawal, Shubh Gupta, Yihan Hu + 1 more2026-02-27🤖 cs.AI

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

The paper proposes Durian, a difficulty-aware group normalization method that re-groups multimodal samples by perceptual complexity and reasoning uncertainty to stabilize reward normalization and enhance reasoning performance in multimodal large language models.

Jinghan Li, Junfeng Fang, Jinda Lu + 5 more2026-02-27💻 cs

EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion

The paper proposes EndoDDC, a novel diffusion-based framework that integrates image features with sparse depth and gradient information to achieve robust and accurate dense depth reconstruction for endoscopic robotic navigation, effectively overcoming challenges like weak textures and light reflections.

Yinheng Lin, Yiming Huang, Beilei Cui + 4 more2026-02-27💻 cs

← Previous Next →