AlphaApollo: A System for Deep Agentic Reasoning

AlphaApollo is an agentic reasoning system that enhances foundation models' performance on complex, long-horizon tasks by orchestrating multi-turn agentic reasoning, turn-level reinforcement learning for tool-use optimization, and a propose-judge-update evolution loop with verification.

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo HanWed, 11 Ma🤖 cs.AI

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

This paper introduces EXPLORE-Bench, a benchmark derived from real first-person videos to evaluate the ability of multimodal large language models to perform long-horizon egocentric scene prediction, revealing significant performance gaps compared to humans and demonstrating that stepwise reasoning offers partial improvements at a computational cost.

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun ZhaWed, 11 Ma🤖 cs.AI

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

This paper introduces MUGEN, a comprehensive benchmark revealing that Large Audio-Language Models struggle with multi-audio understanding as input scaling increases, and demonstrates that combining training-free strategies like Audio-Permutational Self-Consistency with Chain-of-Thought can significantly improve performance.

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi LeeWed, 11 Ma🤖 cs.AI

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

The paper introduces ActiveUltraFeedback, an efficient active learning pipeline that leverages uncertainty estimates and novel selection strategies like Double Reverse Thompson Sampling to generate high-quality preference data, enabling Large Language Models to achieve superior alignment performance with as little as one-sixth of the annotated data required by static baselines.

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas KrauseWed, 11 Ma🤖 cs.AI

ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling

The paper proposes ESAinsTOD, a unified end-to-end schema-aware instruction-tuning framework that leverages full-parameter LLM fine-tuning with instruction and schema alignment mechanisms to achieve superior performance, generalization in low-resource settings, and robustness against noise across diverse task-oriented dialog benchmarks.

Dechuan Teng, Chunlin Lu, Libo Qin, Wanxiang CheWed, 11 Ma🤖 cs.AI

Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records

This study demonstrates that a custom Transformer architecture outperforms both traditional machine learning models and zero-shot generative LLMs in automatically classifying cardiac risk from large-context, unstructured Dutch electronic health records, offering a robust alternative to manual administrative coding for geriatric cardiovascular risk management.

Jacopo Vitale, David Della Morte, Luca Bacco, Mario Merone, Mark de Groot, Saskia Haitjema, Leandro Pecchia, Bram van EsWed, 11 Ma🤖 cs.AI

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health

This study investigates gender bias in Large Language Models within French healthcare contexts, demonstrating that these models rely on embedded stereotypes when processing interactions between gender and other social determinants of health, thereby highlighting the need for context-specific assessments that go beyond evaluating individual factors in isolation.

Trung Hieu Ngo, Adrien Bazoge, Solen Quiniou, Pierre-Antoine Gourraud, Emmanuel MorinWed, 11 Ma🤖 cs.AI

TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation

TaSR-RAG is a taxonomy-guided framework that enhances Retrieval-Augmented Generation for multi-hop reasoning by decomposing complex queries into structured triple sub-queries and performing step-wise evidence selection through hybrid matching, thereby achieving superior accuracy and clearer reasoning traces without relying on costly graph construction.

Jiashuo Sun, Yixuan Xie, Jimeng Shi, Shaowen Wang, Jiawei HanWed, 11 Ma🤖 cs.AI

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

The paper introduces VoxEmo, a comprehensive benchmark and toolkit for evaluating Speech Large Language Models on speech emotion recognition across 35 corpora and 15 languages, featuring a distribution-aware soft-label protocol that reveals how these models uniquely align with human subjective emotion distributions despite trailing supervised baselines in hard-label accuracy.

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas HainWed, 11 Ma🤖 cs.AI

PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration

PathoScribe is a unified retrieval-augmented large language model framework that transforms static pathology archives into an active, reasoning-enabled clinical intelligence platform, enabling natural language case retrieval, automated cohort construction, and real-time diagnostic support with high accuracy and efficiency.

Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, Rajendra Singh, Wei Chen, Anil Parwani, Muhammad Khalid Khan NiaziWed, 11 Ma🤖 cs.AI

Fish Audio S2 Technical Report

This paper introduces Fish Audio S2, an open-source text-to-speech system that leverages a multi-stage training pipeline to enable multi-speaker, multi-turn generation with natural-language instruction following, while providing production-ready weights and an efficient SGLang-based inference engine.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei HanWed, 11 Ma🤖 cs.AI

Let's Verify Math Questions Step by Step

This paper introduces MathQ-Verify, a novel five-stage pipeline that rigorously filters ill-posed or under-specified mathematical questions through format validation, formalization, contradiction detection, and completeness checks, achieving state-of-the-art performance in curating reliable datasets for training large language models.

Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao ZhangWed, 11 Ma🤖 cs.AI