cs.MM papers | Gist.Science

ModalImmune: Immunity Driven Unlearning via Self Destructive Training

ModalImmune is a training framework that enhances the robustness of multimodal systems against input channel loss by intentionally collapsing selected modality information during training through a combination of adaptive regularization, targeted intervention, and certified meta-parameter adaptation.

Rong Fu, Jia Yee Tan, Zijian Zhang, Ziming Wang, Zhaolu Kang, Muge Qi, Shuning Zhang, Simon FongTue, 10 Ma🤖 cs.LG

Q-BAR: Blogger Anomaly Recognition via Quantum-enhanced Manifold Learning

The paper proposes Q-BAR, a hybrid quantum-classical framework that utilizes parameter-efficient variational quantum circuits to detect semantic mutations in online media by modeling creator-specific manifolds, effectively overcoming data scarcity challenges where traditional deep learning models typically fail.

Maida Wang, Panyun JiangTue, 10 Ma⚛️ quant-ph

Chasing RATs: Tracing Reading for and as Creative Activity

This paper proposes "Reading Activity Traces" (RATs) to reframe reading as a creative activity by making the interpretive labor of navigating and connecting sources visible, thereby countering the automation of human interpretation through tools like the speculative WikiRAT system.

Sophia Liu, Shm Garanganao AlmedaThu, 12 Ma💻 cs

P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

This paper introduces P-GSVC, the first layered progressive 2D Gaussian splatting framework that unifies scalable image and video reconstruction through a base-plus-enhancement layer structure and a novel joint training strategy, achieving significant PSNR improvements over sequential layer-wise methods.

Longan Wang, Yuang Shi, Wei Tsang OoiThu, 12 Ma💻 cs

Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

This paper addresses the scarcity of low-light anime scenery data by introducing a curated unpaired dataset and a novel Data Relativistic Uncertainty (DRU) framework that leverages uncertainty quantification to dynamically adjust learning objectives, thereby achieving superior enhancement results compared to state-of-the-art methods.

Yiquan Gao, John SeeThu, 12 Ma🤖 cs.LG

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

The paper introduces GOT-JEPA, a model-predictive pretraining framework that enhances generic object tracking generalization by predicting tracking models from corrupted frames, and further proposes OccuSolver to refine occlusion handling through iterative visibility estimation and point-centric tracking.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu LinThu, 12 Ma🤖 cs.AI

PRoADS: Provably Secure and Robust Audio Diffusion Steganography with latent optimization and backward Euler Inversion

The paper introduces PRoADS, a provably secure and robust audio steganography framework that embeds secret messages into diffusion model noise via orthogonal projection and employs Latent Optimization with Backward Euler Inversion to minimize reconstruction errors, achieving a remarkably low bit error rate of 0.15% under 64 kbps MP3 compression.

YongPeng Yan, Yanan Li, Qiyang Xiao, Yanzhen RenThu, 12 Ma💻 cs

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

V2M-Zero introduces a zero-pair video-to-music generation method that achieves superior temporal synchronization and semantic alignment by leveraging shared temporal structures captured through intra-modal event curves, eliminating the need for paired training data or cross-modal supervision.

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. BryanThu, 12 Ma🤖 cs.AI

Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

The paper introduces V-Skip, a dual-path token pruning mechanism that prevents "Visual Amnesia" in Multimodal Large Language Models by anchoring compression to visual salience rather than linguistic redundancy, achieving a 2.9× speedup with negligible accuracy loss.

Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, Haijun ZhangThu, 12 Ma💬 cs.CL

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

The paper proposes AMB-DSGDN, a novel network for multimodal emotion recognition that utilizes modality-specific semantic graphs with a differential attention mechanism to filter noise and an adaptive balancing strategy to prevent dominant modalities from suppressing complementary cues, thereby enhancing the accuracy of dynamic emotional state modeling.

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin LiThu, 12 Ma🤖 cs.AI

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

The paper proposes G-STAR, an end-to-end system that integrates a time-aware speaker-tracking module with a Speech-LLM backbone to achieve robust, timestamped speaker-attributed recognition for long-form, overlapping multi-party speech while maintaining global identity consistency.

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai WangThu, 12 Ma⚡ eess

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

This paper introduces a new framework of psychophysical tests based on low-level vision principles—specifically contrast sensitivity, masking, and matching—to evaluate and reveal the perceptual strengths and weaknesses of 34 existing image and video quality metrics, demonstrating that standard evaluation protocols often fail to capture these fundamental human visual properties.

Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. MantiukMon, 09 Ma💻 cs

Alkaid: Resilience to Edit Errors in Provably Secure Steganography via Distance-Constrained Encoding

The paper proposes Alkaid, a provably secure steganographic scheme that achieves deterministic robustness against edit errors by integrating minimum distance decoding into the encoding process, thereby significantly outperforming state-of-the-art methods in decoding success rates, payload capacity, and encoding speed.

Zhihan Cao, Gaolei Li, Jun Wu, Jianhua Li, Hang Zhang, Mingzhe ChenMon, 09 Ma🔢 math

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

This paper proposes a novel end-to-end audio-visual speech recognition framework that integrates speech enhancement via a Conformer-based bottleneck fusion module to implicitly refine noisy audio features without explicit mask generation, thereby preserving semantic integrity and outperforming existing mask-based methods on the LRS3 benchmark under noisy conditions.

Linzhi Wu, Xingyu Zhang, Hao Yuan, Yakun Zhang, Changyan Zheng, Liang Xie, Tiejun Liu, Erwei YinMon, 09 Ma🤖 cs.AI

Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information

This paper proposes a pose-aware in-context visual learning (PA-ICVL) framework that enhances Vision-Language Models' ability to detect semantic structural visual hallucinations in non-photorealistic cartoon images by integrating pose information alongside RGB data, achieving significant performance improvements over RGB-only baselines.

Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun SeoMon, 09 Ma🤖 cs.AI

Human-Data Interaction, Exploration, and Visualization in the AI Era: Challenges and Opportunities

This paper examines how the rapid advancement of AI, particularly with foundation models and unstructured data, introduces new challenges in latency, scalability, and interpretability for human-data interaction, arguing for a paradigm shift that redefines human-machine roles and integrates cognitive and perceptual principles to build more effective, human-centered analytical systems.

Jean-Daniel Fekete, Yifan Hu, Dominik Moritz, Arnab Nandi, Senjuti Basu Roy, Eugene Wu, Nikos Bikakis, George Papastefanatos, Panos K. Chrysanthis, Guoliang Li, Lingyun YuMon, 09 Ma🤖 cs.AI

VDCook:DIY video data cook your MLLMs

VDCook is a self-evolving, configurable video data operating system that enables researchers and domain teams to automatically generate, update, and manage specialized video training datasets for MLLMs through natural language queries, integrated retrieval-synthesis modules, and automated metadata annotation.

Chengwei WuMon, 09 Ma🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de GusmãoMon, 09 Ma🤖 cs.AI

altiro3D: Scene representation from single image and novel view synthesis

The paper introduces altiro3D, a free library that synthesizes realistic 3D experiences and novel views from a single RGB image or video by combining monocular depth estimation, inpainting, and optimized projection algorithms to generate multi-viewpoint light fields for free-view displays.

E. Canessa, L. Tenze2026-03-10💻 cs

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

SarcasmMiner is a reinforcement learning-based post-training framework that employs a dual-track distillation strategy with a generative reward model and group relative policy optimization to significantly enhance robust audio-visual sarcasm reasoning and reduce hallucinations in foundation models.

Zhu Li, Yongjian Chen, Huiyuan Lai + 3 more2026-03-06💬 cs.CL

← Previous Next →