GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

The paper introduces GOT-JEPA, a model-predictive pretraining framework that enhances generic object tracking generalization by predicting tracking models from corrupted frames, and further proposes OccuSolver to refine occlusion handling through iterative visibility estimation and point-centric tracking.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu LinThu, 12 Ma🤖 cs.AI

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

The paper introduces PRoADS, a provably secure and robust audio steganography framework that embeds secret messages into diffusion model noise via orthogonal projection and employs Latent Optimization with Backward Euler Inversion to minimize reconstruction errors, achieving a remarkably low bit error rate of 0.15% under 64 kbps MP3 compression.

YongPeng Yan, Yanan Li, Qiyang Xiao, Yanzhen RenThu, 12 Ma💻 cs

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

The paper proposes AMB-DSGDN, a novel network for multimodal emotion recognition that utilizes modality-specific semantic graphs with a differential attention mechanism to filter noise and an adaptive balancing strategy to prevent dominant modalities from suppressing complementary cues, thereby enhancing the accuracy of dynamic emotional state modeling.

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin LiThu, 12 Ma🤖 cs.AI

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

This paper introduces a new framework of psychophysical tests based on low-level vision principles—specifically contrast sensitivity, masking, and matching—to evaluate and reveal the perceptual strengths and weaknesses of 34 existing image and video quality metrics, demonstrating that standard evaluation protocols often fail to capture these fundamental human visual properties.

Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. MantiukMon, 09 Ma💻 cs

Alkaid: Resilience to Edit Errors in Provably Secure Steganography via Distance-Constrained Encoding

The paper proposes Alkaid, a provably secure steganographic scheme that achieves deterministic robustness against edit errors by integrating minimum distance decoding into the encoding process, thereby significantly outperforming state-of-the-art methods in decoding success rates, payload capacity, and encoding speed.

Zhihan Cao, Gaolei Li, Jun Wu, Jianhua Li, Hang Zhang, Mingzhe ChenMon, 09 Ma🔢 math

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

This paper proposes a novel end-to-end audio-visual speech recognition framework that integrates speech enhancement via a Conformer-based bottleneck fusion module to implicitly refine noisy audio features without explicit mask generation, thereby preserving semantic integrity and outperforming existing mask-based methods on the LRS3 benchmark under noisy conditions.

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei YinMon, 09 Ma🤖 cs.AI

Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information

This paper proposes a pose-aware in-context visual learning (PA-ICVL) framework that enhances Vision-Language Models' ability to detect semantic structural visual hallucinations in non-photorealistic cartoon images by integrating pose information alongside RGB data, achieving significant performance improvements over RGB-only baselines.

Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun SeoMon, 09 Ma🤖 cs.AI

Human-Data Interaction, Exploration, and Visualization in the AI Era: Challenges and Opportunities

This paper examines how the rapid advancement of AI, particularly with foundation models and unstructured data, introduces new challenges in latency, scalability, and interpretability for human-data interaction, arguing for a paradigm shift that redefines human-machine roles and integrates cognitive and perceptual principles to build more effective, human-centered analytical systems.

Jean-Daniel Fekete, Yifan Hu, Dominik Moritz, Arnab Nandi, Senjuti Basu Roy, Eugene Wu, Nikos Bikakis, George Papastefanatos, Panos K. Chrysanthis, Guoliang Li, Lingyun YuMon, 09 Ma🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmãoMon, 09 Ma🤖 cs.AI