cs.CV papers | Gist.Science

Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing

This paper addresses the generalization limitations of traditional Face Anti-Spoofing by introducing FaceCoT, the first large-scale Visual Question Answering dataset enriched with Chain-of-Thought reasoning and generated via reinforcement learning, alongside a CEPL training strategy that collectively enable Multimodal Large Language Models to achieve superior robustness and interpretability across diverse spoofing attacks.

Honglu Zhang, Zhiqin Fang, Ningning Zhao + 4 more2026-03-03💻 cs

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

This paper introduces OmniSpatial, a comprehensive benchmark grounded in cognitive psychology with over 8.4K annotated samples across four major categories, which reveals significant limitations in current vision-language models' spatial reasoning capabilities and explores strategies like PointGraph and SpatialCoT to address them.

Mengdi Jia, Zekun Qi, Shaochen Zhang + 5 more2026-03-03💬 cs.CL

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

The paper proposes UniCUE, a unified framework that directly generates speech from Chinese Cued Speech videos by integrating recognition and generation tasks to overcome the limitations of text-intermediate pipelines, supported by the newly constructed large-scale UniCUE-HI dataset.

Jinting Wang, Shan Yang, Chenxing Li + 2 more2026-03-03⚡ eess

Improving Wildlife Out-of-Distribution Detection: Africas Big Five

This study addresses the challenge of overconfident predictions in closed-world animal classification by demonstrating that feature-based out-of-distribution detection methods, particularly Nearest Class Mean with ImageNet pre-trained features, significantly outperform existing techniques in identifying unknown wildlife species within the context of Africa's Big Five.

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson + 1 more2026-03-03🤖 cs.AI

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

The paper proposes Meta-Adaptive Prompt Distillation, a meta-learning framework that distills task-relevant visual features into adaptable soft prompts via an attention-mapper module, significantly outperforming standard in-context learning and parameter-efficient fine-tuning for few-shot Visual Question Answering in Large Multimodal Models.

Akash Gupta, Amos Storkey, Mirella Lapata2026-03-03💬 cs.CL

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

BitVLA introduces a fully native 1-bit Vision-Language-Action model for robotic manipulation that achieves performance comparable to full-precision baselines while significantly reducing memory footprint and latency through native ternary parameter design and a novel Quantize-then-Distill strategy for the vision backbone.

Hongyu Wang, Chuyan Xiong, Ruiping Wang + 1 more2026-03-03💻 cs

PD $^{2}$ GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting

The paper introduces PD $^{2}$ GS, a self-supervised framework that leverages Gaussian Splatting to achieve accurate part-level decoupling and continuous deformation modeling of articulated objects by learning a shared canonical field, while also releasing the RS-Art dataset for real-world evaluation.

Haowen Wang, Xiaoping Yuan, Zhao Jin + 6 more2026-03-03💻 cs

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

VITA introduces a zero-shot value function learning method that enhances the generalization and temporal reasoning of frozen Vision-Language Models through test-time adaptation and dissimilarity-based sampling, enabling robust performance in diverse robotic tasks and improving offline reinforcement learning policies.

Christos Ziakas, Alessandra Russo2026-03-03🤖 cs.AI

VINCIE: Unlocking In-context Image Editing from Video

The paper introduces VINCIE, a block-causal diffusion transformer trained exclusively on video-derived multimodal sequences through three proxy tasks, which achieves state-of-the-art performance in in-context image editing and demonstrates strong capabilities in multi-concept composition and story generation.

Leigang Qu, Feng Cheng, Ziyan Yang + 7 more2026-03-03💬 cs.CL

NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis

This paper introduces **NIC-RobustBench**, an open-source framework designed to evaluate the adversarial robustness of neural image compression models and their impact on downstream tasks, addressing the current gap in benchmarks that primarily focus only on rate-distortion performance.

Georgii Bychkov, Khaled Abud, Egor Kovalev + 4 more2026-03-03⚡ eess

Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning

This paper proposes Consistency-driven Calibration and Matching (ConCM), a novel framework for Few-Shot Class-Incremental Learning that mitigates knowledge conflicts by integrating memory-aware prototype calibration and dynamic structure matching to achieve state-of-the-art performance on large-scale benchmarks.

Qinzhe Wang, Zixuan Chen, Keke Huang + 3 more2026-03-03🤖 cs.LG

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

This paper introduces VisionDrop, a training-free, visual-only token pruning framework that addresses cross-modal misalignment by selecting informative visual tokens through intra-modal attention and progressive merging, achieving significant inference efficiency gains while maintaining high performance in Large Vision-Language Models.

Rui Xu, Yunke Wang, Yong Luo + 1 more2026-03-03💻 cs

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

EchoMimicV3 is an efficient, unified framework that leverages a 1.3B parameter model and novel multi-task and multi-modal strategies to overcome the computational and speed limitations of existing large-scale human animation methods while delivering competitive performance across diverse tasks.

Rang Meng, Yan Wang, Weipeng Wu + 3 more2026-03-03💻 cs

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

This paper introduces CLiFT, a neural rendering framework that compresses multi-view scene information into adaptive tokens, enabling compute-efficient novel view synthesis with controllable trade-offs between data size, rendering quality, and speed.

Zhengqing Wang, Yuefan Wu, Jiacheng Chen + 2 more2026-03-03💻 cs

Advancing Complex Video Object Segmentation via Progressive Concept Construction

The paper introduces Segment Concept (SeC), a novel video object segmentation framework that leverages Large Vision-Language Models to progressively construct high-level object-centric representations, achieving state-of-the-art performance on a new Semantic Complex Scenarios benchmark (SeCVOS) by significantly outperforming existing methods like SAM 2.

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong + 7 more2026-03-03🤖 cs.AI

Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying

This paper presents a unified, closed-loop digital and robotic twinning framework that integrates faster-than-real-time simulations with Stanford's robotic testbeds to validate and verify the performance of spacecraft guidance, navigation, and control systems for rendezvous and formation flying missions.

Z. Ahmed, E. Bates, P. Francesch Huc + 5 more2026-03-03💻 cs

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

MonoFusion presents a method for reconstructing dynamic scenes from sparse-view videos by aligning independent monocular reconstructions to achieve high-quality, time- and view-consistent results without the need for expensive dense multi-camera setups.

Zihan Wang, Jeff Tan, Tarasha Khurana + 2 more2026-03-03💻 cs

HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis

This paper proposes HGTS-Former, a novel hierarchical hypergraph Transformer that effectively models complex multivariate time series couplings through patch embedding and hypergraph-based aggregation, achieving state-of-the-art performance on various tasks including a new large-scale dataset for nuclear fusion Edge-Localized Mode recognition.

Hao Si, Xiao Wang, Fan Zhang + 5 more2026-03-03🤖 cs.AI

Fast Magnetic Resonance Simulation Using Combined Update with Grouped Isochromats

This paper proposes a fast magnetic resonance simulation method that groups isochromats with identical properties to enable shared computations, achieving a 3 to 72-fold reduction in processing time compared to conventional individual isochromat simulations.

Hidenori Takeshima2026-03-03⚡ eess

Learning Robust Intervention Representations with Delta Embeddings

This paper proposes a method for improving out-of-distribution robustness by learning invariant and sparse "Causal Delta Embeddings" to represent interventions in the latent space, enabling the unsupervised learning of causal representations from image pairs that significantly outperform baselines in both synthetic and real-world benchmarks.

Panagiotis Alimisis, Christos Diou2026-03-03🤖 cs.AI

← Previous Next →

cs.CV