cs.AI papers | Gist.Science

EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

This paper introduces EgoCross, a comprehensive benchmark comprising 1,000 QA pairs across four challenging domains (surgery, industry, extreme sports, and animal perspective) to evaluate and expose the poor cross-domain generalization capabilities of current Multimodal Large Language Models in egocentric video question answering.

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang2026-03-11🤖 cs.AI

Singing Syllabi with Virtual Avatars: Enhancing Student Engagement Through AI-Generated Music and Digital Embodiment

This paper proposes and evaluates a novel educational approach that uses AI-generated singing and virtual avatars to transform traditional text-based syllabi into engaging audiovisual performances, demonstrating that this method significantly improves student awareness and recall of critical course information.

Xinxing Wu2026-03-11🤖 cs.AI

TaoSR1: The Thinking Model for E-commerce Relevance Search

TaoSR1 is a novel framework that enables the direct deployment of Large Language Models for e-commerce relevance search by employing a three-stage training pipeline—incorporating Chain-of-Thought fine-tuning, DPO, and GRPO—to overcome reasoning errors and hallucinations while achieving superior performance in both offline benchmarks and online human evaluations.

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng2026-03-11🤖 cs.AI

Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents

This paper introduces CMASE, a novel framework that integrates generative agent-based modeling with virtual ethnography to transform researchers into embedded participants, enabling real-time social intervention, causal reconstruction of social phenomena, and predictive modeling with high empirical accuracy.

Hanzhong Zhang, Muhua Huang, Jindong Wang2026-03-11🤖 cs.AI

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

VistaWise is a cost-effective agent framework for Minecraft that leverages a cross-modal knowledge graph and a finetuned object detection model to achieve state-of-the-art performance in open-world tasks while drastically reducing the need for large-scale domain-specific training data.

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang2026-03-11🤖 cs.AI

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

This paper introduces SEER, a self-optimizing framework that adaptively compresses Chain-of-Thought reasoning to significantly reduce computational costs and latency while improving accuracy and robustness in software engineering and mathematical tasks.

Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, Xin Xia2026-03-11🤖 cs.AI

Reinforced Generation of Combinatorial Structures: Hardness of Approximation

This paper demonstrates that AI-driven code mutation agents, specifically AlphaEvolve, can significantly advance complexity theory by discovering new gadget reductions and extremal graph constructions that yield improved hardness of approximation results for MAX-CUT, MAX-Independent Set, MAX-k-CUT, and the metric Traveling Salesman Problem.

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta2026-03-11🤖 cs.AI

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

VSSFlow introduces a unified flow-matching framework that seamlessly integrates Video-to-Sound and Visual Text-to-Speech generation through a disentangled condition aggregation mechanism, demonstrating that joint learning can surpass specialized state-of-the-art baselines without performance degradation.

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song2026-03-11🤖 cs.AI

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

VoiceBridge is a novel one-step latent bridge model that leverages an energy-preserving variational autoencoder and a joint neural prior to efficiently reconstruct high-quality 48 kHz fullband speech from diverse distortions across various in-domain and out-of-domain general speech restoration tasks without requiring distillation.

Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu2026-03-11🤖 cs.AI

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

This paper introduces v-HUB, a novel benchmark for video humor understanding based on non-verbal short videos with rich annotations, which reveals that current multimodal large language models struggle with visual humor alone but show improved performance when environmental audio is incorporated.

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng2026-03-11🤖 cs.AI

Latent Speech-Text Transformer

The Latent Speech-Text Transformer (LST) improves the efficiency and performance of auto-regressive speech-text models by aggregating speech tokens into latent patches, which aligns sequence granularity with text, reduces computational costs, and achieves significant accuracy gains across speech and text benchmarks.

Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le2026-03-11🤖 cs.AI

AlphaApollo: A System for Deep Agentic Reasoning

AlphaApollo is an agentic reasoning system that enhances foundation models' performance on complex, long-horizon tasks by orchestrating multi-turn agentic reasoning, turn-level reinforcement learning for tool-use optimization, and a propose-judge-update evolution loop with verification.

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo Han2026-03-11🤖 cs.AI

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

This paper introduces the NavSpace benchmark to systematically evaluate the spatial intelligence of navigation agents through six task categories and 1,228 trajectory-instruction pairs, revealing limitations in current models and proposing SNav, a new spatially intelligent navigation model that outperforms existing agents on both the benchmark and real robot tests.

Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu, Yihan Wang, Ziyan Yu, Wenzhe Cai, Lei Kang, Hao Dong2026-03-11🤖 cs.AI

RECODE: Reasoning Through Code Generation for Visual Question Answering

The paper introduces RECODE, an agentic framework that enhances visual question answering by reverse-engineering structured visuals into executable code through iterative generation and selection, thereby transforming ambiguous perceptual tasks into verifiable symbolic reasoning problems that significantly outperform existing methods.

Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi2026-03-11🤖 cs.AI

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

This paper introduces REAP, a novel expert pruning method that outperforms existing merging techniques for compressing Mixture-of-Experts models in generative tasks by leveraging router gate-values and activation norms to minimize reconstruction error, achieving near-lossless compression even at 50% expert reduction.

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa2026-03-11🤖 cs.AI

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

RL-100 is a unified real-world reinforcement learning framework that combines diffusion visuomotor policies with a clipped PPO objective and consistency distillation to achieve 100% success across 1,000 diverse robotic manipulation trials, matching or surpassing human experts while demonstrating robust zero-shot generalization and continuous deployment in dynamic environments.

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, Huazhe Xu2026-03-11🤖 cs.AI

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

FALCON addresses the spatial reasoning limitations of existing 2D-based vision-language-action models by leveraging spatial foundation models to inject rich 3D geometric priors directly into the action head, achieving state-of-the-art performance across diverse simulation and real-world tasks without requiring architectural changes or specialized sensors.

Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou2026-03-11🤖 cs.AI

SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation

This paper introduces SynHLMA, a novel framework that synthesizes hand manipulation sequences for articulated objects by aligning natural language instructions with a discrete human-object interaction representation, thereby enabling robust grasp generation, prediction, and interpolation for applications in embodied AI and robotics.

Wang zhi, Yuyan Liu, Liu Liu, Li Zhang, Ruixuan Lu, Dan Guo2026-03-11🤖 cs.AI

GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation

The paper proposes GraphKeeper, a novel framework for Graph Domain-Incremental Learning that addresses catastrophic forgetting through knowledge disentanglement and deviation-free preservation, achieving state-of-the-art performance across multiple graph domains while remaining compatible with various graph foundation models.

Zihao Guo, Qingyun Sun, Ziwei Zhang, Haonan Yuan, Huiping Zhuang, Xingcheng Fu, Jianxin Li2026-03-11🤖 cs.AI

Structured Matrix Scaling for Multi-Class Calibration

This paper proposes a structured matrix scaling approach for multi-class calibration that leverages theoretical insights from logistic regression, combined with structured regularization and robust optimization, to effectively manage the bias-variance tradeoff and achieve substantial performance gains over existing methods while providing an open-source implementation.

Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach2026-03-11🤖 cs.AI

← Previous Next →