cs.CV papers | Gist.Science

Local-Global Prompt Learning via Sparse Optimal Transport

The paper proposes SOT-GLP, a novel few-shot adaptation method for vision-language models that employs shared sparse optimal transport to partition visual regions among class-specific local prompts while maintaining global alignment, thereby achieving state-of-the-art performance in both classification accuracy and out-of-distribution detection by preserving the native feature geometry.

Deniz Kizaro\u{g}lu, Ülku Tuncer Küçüktas, Emre Çakmakyurdu, Alptekin Temizel2026-03-10💻 cs

$\Delta$ VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

This paper introduces $\Delta$ VLA, a prior-guided framework that enhances robotic manipulation by modeling discrete world-knowledge variations relative to an explicit current state prior, rather than predicting absolute future states, thereby achieving state-of-the-art performance and efficiency through its novel components: the Prior-Guided World Knowledge Extractor, Latent World Variation Quantization, and Conditional Variation Attention.

Yijie Zhu, Jie He, Rui Shao, Kaishen Yuan, Tao Tan, Xiaochen Yuan, Zitong Yu2026-03-10💻 cs

Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

This paper introduces UniDiffDA, a unified analytical framework that decomposes diffusion-based data augmentation into three core components to enable a systematic, fair benchmarking of diverse methods across low-data classification tasks, ultimately offering practical design insights and reproducible code.

Zekun Li, Yinghuan Shi, Yang Gao, Dong Xu2026-03-10💻 cs

This Looks Distinctly Like That: Grounding Interpretable Recognition in Stiefel Geometry against Neural Collapse

This paper introduces Adaptive Manifold Prototypes (AMP), a framework that leverages Stiefel manifold optimization to represent class prototypes as orthonormal bases, thereby preventing prototype collapse caused by Neural Collapse while achieving state-of-the-art accuracy and improved causal faithfulness in fine-grained recognition.

Junhao Jia, Jiaqi Wang, Yunyou Liu, Haodong Jing, Yueyi Wu, Xian Wu, Yefeng Zheng2026-03-10💻 cs

Rectified flow-based prediction of post-treatment brain MRI from pre-radiotherapy priors for patients with glioma

This study presents a rectified flow-based AI model that generates realistic post-treatment brain MRIs from pre-radiotherapy priors and dose maps for glioma patients, achieving high structural fidelity and significantly faster inference than diffusion models to support adaptive treatment planning.

Selena Huisman, Nordin Belkacemi, Vera Keil, Joost Verhoeff, Szabolcs David2026-03-10💻 cs

Real-Time Drone Detection in Event Cameras via Per-Pixel Frequency Analysis

This paper proposes DDHF, a novel real-time drone detection framework for event cameras that utilizes Non-uniform Discrete Fourier Transform (NDFT) to analyze per-pixel temporal frequency signatures, achieving superior accuracy and significantly lower latency compared to traditional deep learning methods like YOLO.

Michael Bezick, Majid Sahin2026-03-10💻 cs

AULLM++: Structural Reasoning with Large Language Models for Micro-Expression Recognition

AULLM++ is a structural reasoning framework that leverages Large Language Models to enhance micro-expression Action Unit detection by fusing multi-granularity visual features with learned AU correlations through a three-stage evidence construction, structure modeling, and deduction-based prediction process, achieving state-of-the-art performance and superior cross-domain generalization.

Zhishu Liu, Kaishen Yuan, Bo Zhao, Hui Ma, Zitong Yu2026-03-10💻 cs

StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation

The paper proposes StructBiHOI, a hierarchical framework that combines a jointVAE for long-term planning, a maniVAE for frame-level refinement, and a Mamba-based diffusion denoiser to achieve stable, physically plausible, and semantically aligned long-horizon bimanual hand-object interaction generation.

Zhi Wang, Liu Liu, Ruonan Liu, Dan Guo, Meng Wang2026-03-10💻 cs

SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

SPIRAL is a closed-loop framework that enhances controllable long-horizon video generation by integrating a reflective planning process with iterative action world modeling, enabling self-improvement through explicit planning, object-centric decomposition, and feedback-driven refinement.

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee2026-03-10💻 cs

Grow, Assess, Compress: Adaptive Backbone Scaling for Memory-Efficient Class Incremental Learning

This paper introduces GRACE, a novel dynamic scaling framework for Class Incremental Learning that adaptively balances model capacity through a cyclic "Grow, Assess, Compress" strategy to achieve state-of-the-art performance while significantly reducing memory overhead compared to purely expansion-based methods.

Adrian Garcia-Castañeda, Jon Irureta, Jon Imaz, Aizea Lojo2026-03-10🤖 cs.LG

Information Maximization for Long-Tailed Semi-Supervised Domain Generalization

This paper proposes IMaX, a simple yet effective objective based on the InfoMax principle that maximizes mutual information between learned features and latent labels while mitigating class-balance bias through an $\alpha$ -entropic term, thereby significantly improving the performance of state-of-the-art semi-supervised domain generalization methods in long-tailed distribution scenarios.

Leo Fillioux, Omprakash Chakraborty, Quentin Gopée, Pierre Marza, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Ismail Ben Ayed, Jose Dolz2026-03-10💻 cs

Can Vision-Language Models Solve the Shell Game?

This paper introduces VET-Bench, a diagnostic benchmark revealing that current Vision-Language Models fail at tracking visually identical objects due to an over-reliance on static features, and proposes Spatiotemporal Grounded Chain-of-Thought (SGCoT) to achieve over 90% accuracy by explicitly generating object trajectories as intermediate reasoning steps.

Tiedong Liu, Wee Sun Lee2026-03-10💬 cs.CL

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

The paper proposes Alfa, an attentive low-rank filter adaptation method that reweights pre-trained semantic features via singular value decomposition and attention mechanisms to achieve efficient, sample-efficient test-time personalization for cross-domain gaze estimation, outperforming existing methods while demonstrating applicability beyond computer vision.

He-Yen Hsieh, Wei-Te Mark Ting, H. T. Kung2026-03-10💻 cs

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

This paper proposes X-AVDT, a robust deepfake detector that leverages internal audio-visual cross-attention cues accessed via DDIM inversion to achieve superior generalization across diverse and evolving synthesis paradigms, supported by the introduction of the new MMDF dataset.

Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh2026-03-10🤖 cs.LG

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

This paper proposes Visual Self-Fulfilling Alignment (VSFA), a label-free fine-tuning method that shapes safety-oriented personas in multimodal large language models by exposing them to threat-related images during neutral VQA tasks, thereby reducing attack success rates and mitigating over-refusal without compromising general capabilities.

Qishun Yang, Shu Yang, Lijie Hu, Di Wang2026-03-10💻 cs

Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction

Spherical-GOF is a novel geometry-aware panoramic rendering framework that extends Gaussian Opacity Fields to spherical ray space, achieving superior geometric consistency and photometric quality in 3D scene reconstruction by introducing efficient spherical culling and adaptive filtering to overcome the limitations of existing perspective-based adaptations.

Zhe Yang, Guoqiang Zhao, Sheng Wu, Kai Luo, Kailun Yang2026-03-10💻 cs

OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras

This paper introduces OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras featuring long, diverse sequences and principled voxel visibility annotations, alongside the proposed Focus on Sphere Occ (FoSOcc) framework that effectively addresses fisheye distortion and localization challenges to establish a strong baseline for future research.

Yongzhi Lin, Kai Luo, Yuanfan Zheng, Hao Shi, Mengfei Duan, Yang Liu, Kailun Yang2026-03-10💻 cs

Interactive World Simulator for Robot Policy Training and Evaluation

This paper presents the Interactive World Simulator, a fast and physically consistent framework leveraging consistency models to generate high-fidelity long-horizon video predictions that enable scalable robot policy training and reliable real-world evaluation using solely simulated data.

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, Yunzhu Li2026-03-10🤖 cs.LG

DualFlexKAN: Dual-stage Kolmogorov-Arnold Networks with Independent Function Control

The paper introduces DualFlexKAN, a flexible dual-stage Kolmogorov-Arnold Network architecture that decouples input transformations and output activations to support diverse basis functions and regularization, achieving superior accuracy and convergence with significantly fewer parameters than standard KANs while mitigating their scalability limitations.

Andrés Ortiz, Nicolás J. Gallego-Molina, Carmen Jiménez-Mesa, Juan M. Górriz, Javier Ramírez2026-03-10🤖 cs.LG

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

PRISM introduces a streaming human motion generation framework that employs a joint-factorized latent space and noise-free condition injection within a single foundation model to overcome representation entanglement and error accumulation, thereby unifying text-to-motion, pose-conditioned, and long-horizon sequential synthesis with state-of-the-art performance.

Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou2026-03-10💻 cs

← Previous Next →

cs.CV