cs.CV papers | Gist.Science

VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

This paper proposes VQ-Style, a novel framework that leverages Residual Vector Quantized Variational Autoencoders combined with contrastive learning and an information leakage loss to effectively disentangle human motion into coarse content and fine style representations, enabling zero-shot style transfer and other applications through a simple Quantized Code Swapping technique.

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann + 3 more2026-02-27🤖 cs.AI

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

The paper introduces OneVision-Encoder, a multimodal architecture that aligns with video codec principles by focusing computation on sparse, high-entropy regions rather than uniform pixel grids, thereby achieving superior efficiency and accuracy across image, video, and document understanding benchmarks.

Feilong Tang, Xiang An, Yunyao Yan + 16 more2026-02-27💻 cs

HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

The paper proposes HLGFA, an unsupervised industrial anomaly detection framework that identifies defects by modeling cross-resolution feature consistency between high and low-resolution representations of normal samples, achieving state-of-the-art performance on the MVTec AD dataset without relying on pixel-level reconstruction.

Han Zhou, Yuxuan Gao, Yinchao Du + 1 more2026-02-27💻 cs

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

The paper introduces GigaBrain-0.5M*, a Vision-Language-Action model that leverages world model-based reinforcement learning via the RAMP framework to overcome limitations in scene understanding and future anticipation, achieving significant performance gains and reliable long-horizon execution on complex robotic manipulation tasks.

GigaBrain Team, Boyuan Wang, Bohan Li + 23 more2026-02-27💻 cs

PCReg-Net: Progressive Contrast-Guided Registration for Cross-Domain Image Alignment

PCReg-Net is a lightweight, progressive contrast-guided deep learning framework that achieves real-time, high-fidelity deformable image registration across heterogeneous domains by employing a coarse-to-fine strategy with multi-scale contrast analysis to overcome appearance variations and geometric misalignments.

Jiahao Qin2026-02-27🤖 cs.AI

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

This paper presents a large-scale systematic benchmark of seven video foundation models on a novel dataset of 32,847 videos from 1,888 participants, revealing that model performance for remote Parkinson's disease screening is highly task-dependent and establishing a rigorous baseline with AUCs up to 85.3% while highlighting the need for task-aware calibration to improve sensitivity.

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader + 11 more2026-02-27💻 cs

Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

This paper proposes the Deferred Visual Ingestion (DVI) framework, which replaces the lossy pre-embedding of visual content with a structure-based hierarchical indexing and deferred VLM analysis strategy, achieving significantly higher accuracy on visual-dense engineering document QA by overcoming the retrieval and detail-loss limitations of existing Pre-Ingestion methods.

Tao Xu2026-02-27💬 cs.CL

Depth from Defocus via Direct Optimization

This paper presents a feasible global optimization approach for depth from defocus that utilizes alternating minimization between convex optimization and parallel grid search to achieve high-resolution depth recovery on both synthetic and real datasets, outperforming current deep learning methods.

Holly Jackson, Caleb Adams, Ignacio Lopez-Francos + 1 more2026-02-27💻 cs

Compact Hadamard Latent Codes for Efficient Spectral Rendering

This paper introduces Hadamard spectral codes, a compact latent representation that enables efficient spectral rendering by approximating complex wavelength-dependent interactions through a small number of standard RGB rendering passes while preserving linear operations and approximating multiplicative spectral relationships.

Jiaqi Yu, Dar'ya Guarnera, Giuseppe Claudio Guarnera2026-02-27💻 cs

Automated Disentangling Analysis of Skin Colour for Lesion Images

This paper proposes a novel skin-colour disentangling framework that utilizes randomized decolourization and geometry-aligned post-processing to learn a structured latent space for generating diverse, equitable skin lesion datasets, thereby improving machine learning model performance across varying skin tones and imaging conditions.

Wenbo Yang, Eman Rezk, Walaa M. Moursi + 1 more2026-02-27⚡ eess

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

The paper introduces FUSAR-GPT, a specialized Visual Language Model for SAR imagery that overcomes existing limitations by leveraging an inaugural SAR Image-Text-AlphaEarth dataset, embedding multi-source spatiotemporal features via "spatiotemporal anchors," and employing a two-stage decoupled training strategy to achieve state-of-the-art performance in remote sensing interpretation.

Xiaokun Zhang, Yi Yang, Ziqi Ye + 6 more2026-02-27🤖 cs.AI

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

This paper introduces DICArt, a novel framework that advances category-level articulated object pose estimation by formulating the task as a conditional discrete diffusion process enhanced with a flexible flow decider and hierarchical kinematic coupling to overcome the limitations of existing continuous-space methods.

Li Zhang, Mingyu Mei, Ailing Wang + 7 more2026-02-27🤖 cs.AI

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

TextPecker addresses the critical bottleneck of structural anomaly perception in Visual Text Rendering by introducing a plug-and-play RL strategy supported by a specialized recognition dataset and stroke-editing synthesis engine, which significantly enhances the structural fidelity and semantic alignment of text-to-image models.

Hanshen Zhu, Yuliang Liu, Xuecheng Wu + 7 more2026-02-27💻 cs

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NORD is a data-efficient Vision-Language-Action model for autonomous driving that achieves competitive performance on Waymo and NAVSIM benchmarks using less than 60% of the training data and no reasoning annotations by addressing the difficulty bias in standard Group Relative Policy Optimization through the Dr. GRPO algorithm.

Ishaan Rawal, Shubh Gupta, Yihan Hu + 1 more2026-02-27🤖 cs.AI

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

The paper proposes Durian, a difficulty-aware group normalization method that re-groups multimodal samples by perceptual complexity and reasoning uncertainty to stabilize reward normalization and enhance reasoning performance in multimodal large language models.

Jinghan Li, Junfeng Fang, Jinda Lu + 5 more2026-02-27💻 cs

EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion

The paper proposes EndoDDC, a novel diffusion-based framework that integrates image features with sparse depth and gradient information to achieve robust and accurate dense depth reconstruction for endoscopic robotic navigation, effectively overcoming challenges like weak textures and light reflections.

Yinheng Lin, Yiming Huang, Beilei Cui + 4 more2026-02-27💻 cs

CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation

CoLoGen is a unified diffusion framework that resolves the representational conflict between conceptual understanding and spatial localization in conditional image generation by employing a progressive learning curriculum and a novel Progressive Representation Weaving module to dynamically integrate specialized expert features.

YuXin Song, Yu Lu, Haoyuan Sun + 6 more2026-02-27💻 cs

Solaris: Building a Multiplayer Video World Model in Minecraft

The paper introduces Solaris, a multiplayer video world model for Minecraft that leverages a novel automated data collection system and a staged training pipeline to overcome the limitations of single-agent models by simulating consistent multi-view observations and complex multi-agent interactions.

Georgy Savva, Oscar Michel, Daohan Lu + 6 more2026-02-27💻 cs

Adaptive Prefiltering for High-Dimensional Similarity Search: A Frequency-Aware Approach

This paper proposes an adaptive prefiltering framework for high-dimensional similarity search that dynamically allocates computational budgets based on query frequency patterns and cluster coherence, achieving equivalent recall with 20.4% fewer distance computations than static methods while maintaining sub-millisecond latency.

Teodor-Ioan Calin2026-02-27💻 cs

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

CrossLLM-Mamba is a novel, scalable framework that leverages bidirectional Mamba encoders to model RNA interaction prediction as a dynamic state-space alignment problem, achieving state-of-the-art performance across RNA-protein, RNA-small molecule, and RNA-RNA tasks by capturing context-dependent molecular binding more effectively than static fusion methods.

Rabeya Tus Sadia, Qiang Ye, Qiang Cheng2026-02-27🧬 q-bio

← Previous Next →