cs.CV papers | Gist.Science

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

BitVLA introduces a fully native 1-bit Vision-Language-Action model for robotic manipulation that achieves performance comparable to full-precision baselines while significantly reducing memory footprint and latency through native ternary parameter design and a novel Quantize-then-Distill strategy for the vision backbone.

Hongyu Wang, Chuyan Xiong, Ruiping Wang + 1 more2026-03-03💻 cs

PD $^{2}$ GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting

The paper introduces PD $^{2}$ GS, a self-supervised framework that leverages Gaussian Splatting to achieve accurate part-level decoupling and continuous deformation modeling of articulated objects by learning a shared canonical field, while also releasing the RS-Art dataset for real-world evaluation.

Haowen Wang, Xiaoping Yuan, Zhao Jin + 6 more2026-03-03💻 cs

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

VITA introduces a zero-shot value function learning method that enhances the generalization and temporal reasoning of frozen Vision-Language Models through test-time adaptation and dissimilarity-based sampling, enabling robust performance in diverse robotic tasks and improving offline reinforcement learning policies.

Christos Ziakas, Alessandra Russo2026-03-03🤖 cs.AI

VINCIE: Unlocking In-context Image Editing from Video

The paper introduces VINCIE, a block-causal diffusion transformer trained exclusively on video-derived multimodal sequences through three proxy tasks, which achieves state-of-the-art performance in in-context image editing and demonstrates strong capabilities in multi-concept composition and story generation.

Leigang Qu, Feng Cheng, Ziyan Yang + 7 more2026-03-03💬 cs.CL

NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis

This paper introduces **NIC-RobustBench**, an open-source framework designed to evaluate the adversarial robustness of neural image compression models and their impact on downstream tasks, addressing the current gap in benchmarks that primarily focus only on rate-distortion performance.

Georgii Bychkov, Khaled Abud, Egor Kovalev + 4 more2026-03-03⚡ eess

Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning

This paper proposes Consistency-driven Calibration and Matching (ConCM), a novel framework for Few-Shot Class-Incremental Learning that mitigates knowledge conflicts by integrating memory-aware prototype calibration and dynamic structure matching to achieve state-of-the-art performance on large-scale benchmarks.

Qinzhe Wang, Zixuan Chen, Keke Huang + 3 more2026-03-03🤖 cs.LG

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

This paper introduces VisionDrop, a training-free, visual-only token pruning framework that addresses cross-modal misalignment by selecting informative visual tokens through intra-modal attention and progressive merging, achieving significant inference efficiency gains while maintaining high performance in Large Vision-Language Models.

Rui Xu, Yunke Wang, Yong Luo + 1 more2026-03-03💻 cs

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

EchoMimicV3 is an efficient, unified framework that leverages a 1.3B parameter model and novel multi-task and multi-modal strategies to overcome the computational and speed limitations of existing large-scale human animation methods while delivering competitive performance across diverse tasks.

Rang Meng, Yan Wang, Weipeng Wu + 3 more2026-03-03💻 cs

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

This paper introduces CLiFT, a neural rendering framework that compresses multi-view scene information into adaptive tokens, enabling compute-efficient novel view synthesis with controllable trade-offs between data size, rendering quality, and speed.

Zhengqing Wang, Yuefan Wu, Jiacheng Chen + 2 more2026-03-03💻 cs

Advancing Complex Video Object Segmentation via Progressive Concept Construction

The paper introduces Segment Concept (SeC), a novel video object segmentation framework that leverages Large Vision-Language Models to progressively construct high-level object-centric representations, achieving state-of-the-art performance on a new Semantic Complex Scenarios benchmark (SeCVOS) by significantly outperforming existing methods like SAM 2.

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong + 7 more2026-03-03🤖 cs.AI

Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying

This paper presents a unified, closed-loop digital and robotic twinning framework that integrates faster-than-real-time simulations with Stanford's robotic testbeds to validate and verify the performance of spacecraft guidance, navigation, and control systems for rendezvous and formation flying missions.

Z. Ahmed, E. Bates, P. Francesch Huc + 5 more2026-03-03💻 cs

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

MonoFusion presents a method for reconstructing dynamic scenes from sparse-view videos by aligning independent monocular reconstructions to achieve high-quality, time- and view-consistent results without the need for expensive dense multi-camera setups.

Zihan Wang, Jeff Tan, Tarasha Khurana + 2 more2026-03-03💻 cs

HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis

This paper proposes HGTS-Former, a novel hierarchical hypergraph Transformer that effectively models complex multivariate time series couplings through patch embedding and hypergraph-based aggregation, achieving state-of-the-art performance on various tasks including a new large-scale dataset for nuclear fusion Edge-Localized Mode recognition.

Hao Si, Xiao Wang, Fan Zhang + 5 more2026-03-03🤖 cs.AI

Fast Magnetic Resonance Simulation Using Combined Update with Grouped Isochromats

This paper proposes a fast magnetic resonance simulation method that groups isochromats with identical properties to enable shared computations, achieving a 3 to 72-fold reduction in processing time compared to conventional individual isochromat simulations.

Hidenori Takeshima2026-03-03⚡ eess

Learning Robust Intervention Representations with Delta Embeddings

This paper proposes a method for improving out-of-distribution robustness by learning invariant and sparse "Causal Delta Embeddings" to represent interventions in the latent space, enabling the unsupervised learning of causal representations from image pairs that significantly outperform baselines in both synthetic and real-world benchmarks.

Panagiotis Alimisis, Christos Diou2026-03-03🤖 cs.AI

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Uni-CoT introduces a unified Chain-of-Thought framework that leverages a two-level reasoning paradigm and structured training to enable efficient, coherent multimodal reasoning across text and vision, achieving state-of-the-art performance on image generation and editing benchmarks with limited computational resources.

Luozheng Qin, Jia Gong, Yuqing Sun + 6 more2026-03-03💬 cs.CL

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

This paper presents ImagiDrive, a unified end-to-end autonomous driving framework that synergistically integrates a Vision-Language Model-based driving agent with a Driving World Model-based scene imaginer to iteratively refine planning decisions through a closed-loop imagination-and-planning process, demonstrating superior robustness and performance on nuScenes and NAVSIM datasets.

Jingyu Li, Bozhou Zhang, Xin Jin + 3 more2026-03-03💻 cs

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

This paper introduces CineTrans, a novel framework that leverages a newly constructed Cine250K dataset and a training-free, mask-based control mechanism derived from attention map analysis to generate coherent, cinematic multi-shot videos with stable, film-style transitions, significantly outperforming existing baselines in transition control and temporal consistency.

Xiaoxue Wu, Bingjie Gao, Yu Qiao + 2 more2026-03-03💻 cs

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

This paper introduces MOON, the first generative Multimodal Large Language Model designed for e-commerce product understanding, which leverages guided Mixture-of-Experts, semantic region detection, and specialized negative sampling to overcome existing alignment and noise challenges while establishing a new large-scale benchmark for evaluation.

Daoze Zhang, Chenghan Fu, Zhanheng Nie + 7 more2026-03-03🤖 cs.AI

Next Visual Granularity Generation

This paper introduces Next Visual Granularity (NVG), a novel image generation framework that progressively refines images from global layouts to fine details through a structured sequence of varying token granularities, achieving state-of-the-art performance on ImageNet with FID scores superior to the VAR series.

Yikai Wang, Zhouxia Wang, Zhonghua Wu + 3 more2026-03-03🤖 cs.AI

← Previous Next →

cs.CV