cs.CV papers | Gist.Science

DOCFORGE-BENCH: A Comprehensive 0-shot Benchmark for Document Forgery Detection and Analysis

DOCFORGE-BENCH introduces the first unified zero-shot benchmark for document forgery detection, revealing that current methods suffer from severe calibration failures due to the extreme rarity of tampered pixels in documents, which renders standard fixed thresholds ineffective and highlights threshold adaptation as the critical missing step for practical deployment.

Zengqi Zhao, Weidi Xia, En Wei, Yan Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Yiran Tao, Simiao RenWed, 11 Ma💻 cs

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

ChimeraLoRA addresses data scarcity in fine-grained tasks by synthesizing diverse and detail-rich images through a hybrid architecture that combines a class-shared LoRA for semantic priors with per-image LoRAs for specific characteristics, guided by semantic boosting and a Dirichlet-based mixture strategy to improve downstream classification accuracy.

Hoyoung Kim, Minwoo Jang, Jabin Koo, Sangdoo Yun, Jungseul OkWed, 11 Ma💻 cs

Multimodal Classification via Total Correlation Maximization

This paper addresses the issue of modality competition in multimodal learning by theoretically analyzing the relationship between joint and unimodal approaches and proposing TCMax, a hyperparameter-free method that maximizes total correlation between multimodal features and labels to achieve state-of-the-art classification performance.

Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng LuWed, 11 Ma💻 cs

Pathwise Test-Time Correction for Autoregressive Long Video Generation

This paper introduces Test-Time Correction (TTC), a training-free method that stabilizes long-sequence video generation in distilled autoregressive models by using the initial frame as a reference anchor to calibrate intermediate states, thereby overcoming error accumulation and extending generation lengths without compromising quality.

Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao GuoWed, 11 Ma💻 cs

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

This paper introduces RegionReasoner, a reinforcement learning framework that enforces grounded, multi-round visual reasoning by requiring explicit bounding box citations and global-local semantic consistency, alongside a new benchmark called RegionDial-Bench, to significantly improve spatial grounding and reasoning accuracy in vision-language models.

Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. SnoekWed, 11 Ma💻 cs

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

This paper proposes a novel AI-generated image detection method that exploits common final architectural components across diverse generators to "contaminate" real images for training, achieving 98.83% average accuracy on unseen generators by leveraging a taxonomy of 21 models and a DINOv3 backbone.

Yanzhu Liu, Xiao Liu, Yuexuan Wang, Mondal SoumikWed, 11 Ma💻 cs

Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection

To overcome the generalization failure of vanilla CLIP in face forgery detection caused by "low-rank spurious bias," this paper proposes SeLop, a causal representation learning method that identifies and removes spurious correlations via orthogonal low-rank subspace intervention, thereby achieving state-of-the-art performance with high robustness using only 0.39M trainable parameters.

Chi Wang, Xinjue Hu, Boyu Wang, Ziwen He, Zhangjie FuWed, 11 Ma💻 cs

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

This paper introduces Directional Decoupling Alignment (D $^2$ -Align), a novel framework that mitigates Preference Mode Collapse in diffusion reinforcement learning by applying directional corrections to reward signals, thereby preserving generative diversity while achieving superior human preference alignment.

Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu LiWed, 11 Ma💻 cs

AVGGT: Rethinking Global Attention for Accelerating VGGT

This paper introduces AVGGT, a training-free acceleration framework that leverages an analysis of global attention's distinct roles in VGGT and $\pi^3$ to implement a two-step optimization strategy, achieving up to 10 $\times$ inference speedup on long sequences while maintaining or improving accuracy in dense multi-view 3D reconstruction tasks.

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu ZhangWed, 11 Ma💻 cs

Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound

This paper introduces the first formal framework for Audio-Visual World Models (AVWM), presenting the AVW-4k dataset and the AV-CDiT model to enable high-fidelity, synchronized simulation of binaural audio and visual dynamics that significantly enhances agent performance in continuous navigation tasks.

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin MaoWed, 11 Ma💻 cs

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

The paper introduces AFRO, a self-supervised framework that learns dynamics-aware 3D visual representations by modeling state-action-state transitions via a generative diffusion process, thereby significantly improving robotic manipulation performance across diverse simulated and real-world tasks without requiring explicit action or reconstruction supervision.

Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing XuWed, 11 Ma💻 cs

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

This paper introduces V-Attack, a novel adversarial attack method for Large Vision-Language Models that achieves precise local semantic manipulation by targeting disentangled value features within transformer attention blocks, thereby overcoming the controllability limitations of existing approaches that rely on entangled patch-token representations.

Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin ChenWed, 11 Ma💻 cs

SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection

The paper proposes SPAN, a novel framework for monocular 3D object detection that enhances geometric consistency and performance by introducing Spatial Point Alignment and 3D-2D Projection Alignment within a Hierarchical Task Learning strategy to overcome the limitations of traditional decoupled prediction paradigms.

Yifan Wang, Yian Zhao, Fanqi Pu, Xiaochen Yang, Yang Tang, Xi Chen, Wenming YangWed, 11 Ma💻 cs

Who Made This? Fake Detection and Source Attribution with Diffusion Features

This paper introduces FRIDA, a lightweight and data-efficient framework that leverages features from pre-trained Stable Diffusion models to achieve state-of-the-art cross-generator fake image detection and source attribution through a training-free k-Nearest Neighbour approach and a compact neural classifier.

Simone Bonechi, Paolo Andreini, Barbara Toniella CorradiniWed, 11 Ma💻 cs

Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

This paper demonstrates that selecting an optimal subset of body landmarks combined with spline-based imputation enables isolated Brazilian Sign Language (LIBRAS) recognition that is both 5 times faster and as accurate as state-of-the-art methods, overcoming the speed-accuracy trade-off of previous OpenPose-based approaches.

Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. PaixãoWed, 11 Ma💻 cs

Real-Time Neural Video Compression with Unified Intra and Inter Coding

This paper presents a real-time neural video compression framework that unifies intra and inter coding within a single adaptive model and employs a simultaneous two-frame compression design to effectively handle disocclusion, prevent error propagation, and achieve a 12.1% BD-rate reduction over DCVC-RT while maintaining real-time performance.

Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong LiuWed, 11 Ma💻 cs

Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

This study presents a scalable dual-pass deep learning pipeline that successfully extracts the first open-access, nationwide urban footprint dataset for metropolitan France from historical maps (1925–1950), achieving 73% accuracy by effectively mitigating artifacts like text and contour lines to enable quantitative analysis of pre-1970s urban sprawl.

Walid Rabehi, Marion Le Texier, Rémi LemoyWed, 11 Ma💻 cs

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

This paper introduces LLaVAShield, a safety auditing framework for multimodal multi-turn dialogues in Vision-Language Models, supported by the MMDS dataset and MMRT red-teaming framework, which collectively address the limitations of existing moderation tools by effectively detecting concealed malicious intent, contextual risk accumulation, and cross-modal joint risks.

Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun ShenWed, 11 Ma💻 cs

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

This paper proposes an unsupervised method to recover the latent encoding and decoding direction pairs in deep vision networks, enabling the identification of interpretable concepts and facilitating model understanding, debugging, and intervention without relying on feature reconstruction.

Alexandros Doumanoglou, Kurt Driessens, Dimitrios ZarpalasWed, 11 Ma💻 cs

Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS)

This paper introduces the Coral Spawn and Larvae Imaging Camera System (CSLICS), an automated, low-cost computer vision solution that significantly reduces labor-intensive manual counting while accurately monitoring coral spawn and larvae to enhance reef restoration efforts.

Dorian Tsai, Christopher A. Brunner, Riki Lamont, F. Mikaela Nordborg, Andrea Severati, Java Terry, Karen Jackel, Matthew Dunbabin, Tobias Fischer, Scarlett RaineWed, 11 Ma💻 cs