cs.CV papers | Gist.Science

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

This paper identifies a "linguistic blindness" failure mode in Vision-Language-Action (VLA) models where they ignore contradictory instructions in favor of visual priors, and proposes IGAR, a train-free attention recalibration method that effectively restores language grounding and prevents erroneous actions without requiring model retraining.

Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen2026-03-09🤖 cs.AI

Demystifying KAN for Vision Tasks: The RepKAN Approach

The paper introduces RepKAN, a novel dual-path architecture that combines CNN efficiency with KAN's non-linear power to achieve state-of-the-art performance and explicit physical interpretability in remote sensing image classification.

Minjong Cheon2026-03-09🤖 cs.AI

EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

EffectMaker is a unified reasoning-generation framework that leverages a multimodal large language model and a diffusion transformer, supported by a large-scale synthetic dataset called EffectData, to enable scalable, high-quality, and controllable customized visual effect creation without requiring per-effect fine-tuning.

Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, Jing Liao2026-03-09💻 cs

MOSIV: Multi-Object System Identification from Videos

The paper introduces MOSIV, a novel framework that leverages differentiable simulation and geometry-aligned objectives to identify continuous, per-object material parameters from videos of complex multi-object interactions, outperforming existing methods on a new synthetic benchmark.

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A. Jeni, Min Xu, Yizhou Zhao2026-03-09💻 cs

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

ViewFusion is a two-stage framework that enhances multi-view spatial reasoning in vision-language models by explicitly separating cross-view spatial pre-alignment from question-driven reasoning, achieving significant accuracy improvements on benchmarks like MMSI-Bench through synthetic supervision and reinforcement learning.

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang2026-03-09💬 cs.CL

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

StruVis is a novel, generator-agnostic framework that enhances reasoning-based text-to-image generation by enabling MLLMs to utilize text-based structured visual representations as intermediate reasoning states, thereby overcoming the limitations of existing text-only and text-image interleaved approaches to achieve significant performance gains on benchmarks.

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, Xuming Hu2026-03-09💻 cs

Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

This paper introduces Occlusion-Aware SORT (OA-SORT), a training-free, plug-and-play framework that enhances multi-object tracking robustness by analyzing occlusion status through a Gaussian Map to mitigate positional cost confusion and estimation instability, thereby significantly improving performance across multiple datasets.

Chunjiang Li, Jianbo Ma, Li Shen, Yanru Chen, Liangyin Chen2026-03-09💻 cs

Ensemble Learning with Sparse Hypercolumns

This paper addresses the computational challenges of using dense hypercolumns for image segmentation by introducing stratified subsampling and ensemble learning, demonstrating that these methods significantly outperform standard UNet baselines, particularly in low-shot scenarios where a simple Logistic Regression classifier achieves the best results.

Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O'Connor2026-03-09💻 cs

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

This paper introduces FontUse, a data-centric approach that leverages a large-scale, automatically annotated dataset of 70K images with style- and use-case-conditioned prompts to fine-tune text-to-image models, significantly improving their ability to generate typography that accurately reflects requested visual attributes without requiring architectural changes.

Xia Xin, Yuki Endo, Yoshihiro Kanamori2026-03-09💻 cs

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

This paper proposes GvU, a self-supervised reinforcement learning framework that leverages a unified multimodal model's own understanding branch as an intrinsic reward signal to iteratively improve its text-to-image generation quality, thereby narrowing the capability gap between visual understanding and generation.

Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang2026-03-09💻 cs

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

GenHOI is a lightweight augmentation for pretrained video generation models that enhances object-consistent hand-object interaction in in-the-wild scenarios by employing Head-Sliding RoPE for temporally balanced reference injection and a two-level spatial attention gate for selective focus on interaction regions.

Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang2026-03-09💻 cs

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

The paper introduces Curious-VLA, a two-stage framework that overcomes the exploration limitations of standard driving VLA models by employing Feasible Trajectory Expansion during imitation learning and Adaptive Diversity-Aware Sampling with a Spanning Driving Reward during reinforcement learning, achieving state-of-the-art performance on the Navsim benchmark.

Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang2026-03-09💻 cs

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

This paper investigates failure modes in lightweight Vision-Language Models for automated driving by analyzing intermediate activations to reveal that while some visual concepts like object presence are linearly encoded, others like orientation are not, leading to either perceptual or cognitive failures that are further exacerbated by object distance.

Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising, Tim Brophy2026-03-09🤖 cs.AI

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

TempoSyncDiff is a reference-conditioned latent diffusion framework that employs teacher-student distillation and temporal regularization to enable low-latency, temporally stable, and identity-consistent audio-driven talking head generation suitable for edge deployment.

Soumya Mazumdar, Vineet Kumar Rakesh2026-03-09🤖 cs.AI

Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting

This paper presents a novel pipeline that transforms archived omnidirectional RGB-LiDAR logs into robust 3D Gaussian Splatting initialization assets by addressing sensor distortion and data density challenges through ERP-to-cubemap conversion, color-stratified downsampling, and multi-modal registration, thereby enabling the creation of high-fidelity digital twins from standard, underutilized sensor data.

Semin Bae, Hansol Lim, Jongseong Brad Choi2026-03-09💻 cs

Text-Driven Emotionally Continuous Talking Face Generation

This paper introduces the novel task of Emotionally Continuous Talking Face Generation (EC-TFG) and proposes the TIE-TFG model, which utilizes text and varying emotion descriptions to synthesize realistic videos featuring smooth, natural emotional transitions rather than fixed expressions.

Hao Yang, Yanyan Zhao, Tian Zheng, Hongbo Zhang, Bichen Wang, Di Wu, Xing Fu, Xuda Zhi, Yongbo Huang, Hao He2026-03-09🤖 cs.AI

Lyapunov Probes for Hallucination Detection in Large Foundation Models

This paper proposes "Lyapunov Probes," a novel hallucination detection method for Large Language Models that frames the problem using dynamical systems stability theory to identify unstable knowledge-transition regions where hallucinations occur.

Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou, Faguo Wu, Hongwei Zheng, Wenjun Wu, Zhaoxin Fan2026-03-09💻 cs

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

DeepSight is the first dedicated depth-driven multimodal large language model that leverages single-channel depth maps, a novel depth instruction dataset, and a modified ViT encoder to significantly enhance 3D scene understanding and spatial reasoning capabilities.

Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin2026-03-09💬 cs.CL

Enhancing Neural Video Compression of Static Scenes with Positive-Incentive Noise

This paper proposes a neural video compression method for static scenes that incorporates positive-incentive noise to disentangle transient variations from persistent backgrounds, achieving a 73% bitrate reduction while maintaining pixel-level fidelity and avoiding hallucinated details.

Cheng Yuan, Zhenyu Jia, Jiawei Shao, Xuelong Li2026-03-09💻 cs

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

FedARKS is a novel federated learning framework for person re-identification that overcomes the limitations of global feature reliance and uniform averaging by introducing Robust Knowledge and Knowledge Selection mechanisms to capture subtle domain-invariant details and prioritize high-quality client contributions for improved domain generalization.

Xin Xu, Binchang Ma, Zhixi Yu, Wei Liu2026-03-09💻 cs

← Previous Next →