MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

MAViD is a novel multimodal framework that employs a Conductor-Creator architecture, combining autoregressive audio and diffusion-based video generation with a specialized fusion module, to overcome existing limitations and achieve seamless, long-duration, and contextually coherent audio-visual dialogue understanding and generation.

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu2026-03-10💻 cs

When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

This paper reveals that visual token information in Vision Large Language Models progressively vanishes at a depth-dependent "information horizon," beyond which existing pruning methods underperform random selection, leading to a novel strategy that integrates random pruning to achieve state-of-the-art efficiency without sacrificing accuracy.

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou2026-03-10💻 cs

ReMeDI: Refined Memory for Disambiguation of Identities with SAM3 in Surgical Segmentation

The paper introduces ReMeDI-SAM3, a training-free extension of SAM3 that enhances surgical instrument segmentation in endoscopy by implementing relevance-aware memory filtering, piecewise interpolation, and feature-based re-identification to overcome challenges like occlusions and rapid motion, achieving significant zero-shot performance improvements over existing methods.

Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch2026-03-10💻 cs

It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities

This study analyzes the discrepancies between objective and subjective urban greenery perceptions across five countries using street view imagery and a survey of 1,000 participants, revealing that while demographics and personality have little influence, an individual's geographic location is a primary factor shaping how they perceive green spaces.

Matias Quintana, Fangqi Liu, Jussi Torkko, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Tuuli Toivonen, Yi Lu, Filip Biljecki2026-03-10💻 cs

Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL

This paper demonstrates that reasoning Large Language Models significantly reduce cloud query execution costs and data consumption compared to non-reasoning models in Text-to-SQL tasks, while revealing that execution time is a poor proxy for cost efficiency and highlighting the substantial financial risks posed by non-reasoning models' tendency to generate inefficient queries.

Saurabh Deochake, Debajyoti Mukhopadhyay2026-03-10💻 cs

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

This paper introduces DrivingGen, the first comprehensive benchmark for generative driving world models that addresses the lack of rigorous evaluation by combining a diverse dataset with a novel suite of metrics to assess visual realism, trajectory plausibility, temporal coherence, and controllability, thereby revealing critical trade-offs in current state-of-the-art models.

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander2026-03-10💻 cs

Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging

The paper introduces R^4, a self-improving agentic framework that enhances medical image analysis by decomposing workflows into routing, retrieval, reflection, and repair stages to iteratively refine both textual reports and spatial bounding boxes, achieving significant performance gains over single-pass VLM baselines without requiring gradient-based fine-tuning.

Md. Faiyaz Abdullah Sayeedi, Rashedur Rahman, Siam Tahsin Bhuiyan, Sefatul Wasi, Ashraful Islam, Saadia Binte Alam, AKM Mahbubur Rahman2026-03-10💻 cs

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

This paper introduces "Single-Shot Planning," a secure architecture for Computer Use Agents that generates a complete, trusted execution graph before observing untrusted UI states to effectively mitigate prompt injection and branch steering attacks while maintaining competitive task performance.

Hanna Foerster, Tom Blanchard, Kristina Nikolic, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, Yiren Zhao2026-03-10💻 cs

User Detection and Response Patterns of Sycophantic Behavior in Conversational AI

This paper investigates how users detect and respond to sycophantic behavior in conversational AI through a proposed DCR epistemology, revealing that while users employ various mitigation strategies, sycophancy is not universally harmful and can provide valued emotional support for vulnerable populations, suggesting a need for context-aware AI design rather than universal elimination.

Kazi Noshin, Syed Ishtiaque Ahmed, Sharifa Sultana2026-03-10💻 cs

BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

This paper introduces BoxMind, a closed-loop AI system that transforms unstructured boxing footage into hierarchical tactical indicators and predictive gradients to generate expert-level strategic recommendations, which were validated during the 2024 Paris Olympics by contributing to the Chinese National Team's historic medal success.

Kaiwen Wang, Kaili Zheng, Rongrong Deng, Qingmin Fan, Milin Zhang, Zongrui Li, Xuesi Zhou, Bo Han, Liren Chen, Chenyi Guo, Ji Wu2026-03-10💻 cs

Multifaceted Scenario-Aware Hypergraph Learning for Next POI Recommendation

This paper proposes the Multifaceted Scenario-Aware Hypergraph Learning (MSAHG) framework, which addresses the limitations of existing methods in handling mobility variations across distinct contexts by constructing scenario-specific disentangled sub-hypergraphs and employing a parameter-splitting mechanism to resolve inter-scenario conflicts, thereby significantly improving next POI recommendation performance.

Yuxi Lin, Yongkang Li, Jie Xing, Zipei Fan2026-03-10💻 cs