cs.CV papers | Gist.Science

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

This paper challenges the assumption that neutralizing known triggers eliminates backdoors by demonstrating that perceptually distinct "alternative triggers" can reliably activate latent backdoor directions in feature space, thereby advocating for defenses that target these underlying representation patterns rather than specific input triggers.

Gorka Abad, Ermes Franch, Stefanos Koffas, Stjepan Picek2026-03-11💻 cs

What is Missing? Explaining Neurons Activated by Absent Concepts

This paper identifies that deep neural networks frequently encode the absence of concepts to drive neuron activation—a phenomenon largely overlooked by standard explainable AI methods—and proposes simple extensions to attribution and feature visualization techniques to effectively reveal and leverage these "missing" concepts for better model interpretation and debiasing.

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan Roth2026-03-11🤖 cs.LG

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This paper introduces Test-time Ego-Exo Adaptation for Action Anticipation (TE $^{2}$ A $^{3}$ ), a novel task addressed by the Dual-Clue enhanced Prototype Growing Network (DCPGN) which utilizes a Multi-Label Prototype Growing Module and a Dual-Clue Consistency Module to effectively bridge the inter-view gap and adapt models online without target-view training data.

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang Li2026-03-11💻 cs

RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

This paper introduces a new fine-grained Audio-Visual Learning task called Region-Aware Sound Source Understanding (RA-SSU), supported by two novel datasets (f-Music and f-Lifescene) and a state-of-the-art model named SSUFormer, which utilizes specialized modules to achieve precise sound source segmentation and detailed frame-level textual descriptions.

Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan Sun2026-03-11💻 cs

ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

ConfCtrl is a confidence-aware video interpolation framework that enables precise camera control in video diffusion for novel view synthesis by combining confidence-weighted point cloud projections with a Kalman-inspired predict-update mechanism to balance pose guidance and geometric consistency while reconstructing unseen regions.

Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada2026-03-11💻 cs

BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling

BrainSTR is a spatio-temporal contrastive learning framework that enhances the interpretability of dynamic brain network modeling for neuropsychiatric diagnosis by adaptively partitioning brain states, identifying critical phases, and extracting sparse, disease-specific connectivity patterns to construct a discriminative semantic space validated across ASD, BD, and MDD datasets.

Guiliang Guo, Guangqi Wen, Lingwen Liu, Ruoxian Song, Peng Cao, Jinzhu Yang, Fei Wang, Xiaoli Liu, Osmar R. Zaiane2026-03-11💻 cs

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

This paper introduces VLM-Loc, a framework that leverages large vision-language models to achieve precise text-to-point-cloud localization by transforming 3D maps into bird's-eye-view images and scene graphs for enhanced spatial reasoning, alongside the release of the CityLoc benchmark for systematic evaluation.

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu2026-03-11💻 cs

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

This paper introduces MA-EgoQA, a novel benchmark and dataset featuring 1,700 questions across five categories designed to evaluate the ability of AI models to understand and reason over multiple long-horizon egocentric videos from embodied agents, alongside a proposed baseline model named EgoMAS that highlights current limitations in system-level multi-agent understanding.

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang2026-03-11🤖 cs.AI

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

CycleULM is a novel, label-free deep learning framework that leverages CycleGAN to bridge the simulation-to-reality gap in ultrasound localisation microscopy, significantly enhancing microbubble localisation accuracy, image resolution, and processing speed for real-time clinical application without requiring paired ground truth data.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung, Herman Verinaz-Jadan, Jiakang Chen, Matthieu Toulemonde, Kai Riemer, Jipeng Yan, Clotilde Vié, Qingyuan Tan, Peter D. Weinberg, Pier Luigi Dragotti, Kevin G. Murphy, Meng-Xing Tang2026-03-11⚡ eess

MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities

This paper introduces MissBench, a benchmark and framework for multimodal affective computing that addresses the gap in evaluating models under realistic, imbalanced missing modality conditions by standardizing protocols and proposing new diagnostic metrics (MEI and MLI) to reveal hidden modality inequities and optimization imbalances.

Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi Nguyen2026-03-11💻 cs

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

The paper introduces InternVL-U, a lightweight 4B-parameter unified multimodal model that democratizes advanced understanding, reasoning, generation, and editing capabilities by employing a modular architecture and a reasoning-centric data synthesis pipeline, achieving superior performance-efficiency balance that outperforms significantly larger baselines like BAGEL.

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang2026-03-11💻 cs

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

The paper introduces DISPLAY, a framework for generating controllable and physically consistent human-object interaction videos by utilizing sparse motion guidance (wrist coordinates and object bounding boxes), an object-stressed attention mechanism, and a multi-task auxiliary training strategy to overcome limitations in flexibility, generalization, and data scarcity.

Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang2026-03-11💻 cs

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

This paper introduces CourtSI, a large-scale dataset and benchmark for evaluating spatial intelligence in vision-language models within sports scenarios, revealing significant performance gaps in existing models while demonstrating that fine-tuning on this data substantially improves accuracy and generalization.

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong2026-03-11💻 cs

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

WikiCLIP is an efficient contrastive framework for open-domain visual entity recognition that leverages large language model embeddings enhanced by a Vision-Guided Knowledge Adaptor and Hard Negative Synthesis to significantly outperform generative baselines while reducing inference latency by nearly 100 times.

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He2026-03-11💻 cs

On the Structural Failure of Chamfer Distance in 3D Shape Optimization

This paper identifies a structural gradient failure in Chamfer distance optimization that causes point cloud collapse, demonstrating that introducing non-local coupling mechanisms is a necessary condition to suppress this collapse and achieve successful 3D shape optimization.

Chang-Yong Song, David Hyde2026-03-11💻 cs

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

This paper proposes an interpretable text-motion retrieval framework that represents 3D human motion as joint-angle pseudo-images processed by Vision Transformers and aligns them with text via a token-wise late interaction mechanism, thereby overcoming the limitations of global-embedding methods by capturing fine-grained correspondences and improving retrieval accuracy.

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao2026-03-11💻 cs

Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

The paper introduces ACADiff, an adaptive clinical-aware latent diffusion framework that synthesizes missing multimodal brain imaging data (sMRI, FDG-PET, and AV45-PET) by integrating imaging observations with GPT-4o-encoded clinical metadata, achieving superior generation quality and robust diagnostic performance even when up to 80% of modalities are missing.

Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative2026-03-11🤖 cs.AI

Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy

This paper proposes a novel unsupervised domain adaptation framework based on a reformulated Margin Disparity Discrepancy to bridge the modality gap between annotated CT and unannotated interventional CBCT scans, achieving state-of-the-art performance in liver segmentation for both unsupervised and few-shot settings.

Gauthier Miralles, Loïc Le Folgoc, Vincent Jugnon, Pietro Gori2026-03-11💻 cs

No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space

The paper proposes k-MTR, a novel framework that bypasses the traditional image reconstruction step by directly learning multi-task cardiac diagnostic features from undersampled k-space data through a shared semantic manifold, thereby eliminating reconstruction artifacts and achieving competitive performance across regression, classification, and segmentation tasks.

Yundi Zhang, Sevgi Gokce Kafali, Niklas Bubeck, Daniel Rueckert, Jiazhen Pan2026-03-11🤖 cs.AI

Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading

This paper introduces the concept of Whole Slide Difficulty (WSD), derived from diagnostic disagreements between expert and non-expert pathologists, and demonstrates that leveraging this metric through multi-task learning or weighted loss functions significantly improves the accuracy of prostate cancer Gleason grading in Multiple Instance Learning models, particularly for higher-grade cases.

Marie Arrivat, Rémy Peyret, Elsa Angelini, Pietro Gori2026-03-11💻 cs

← Previous Next →