Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

This paper presents a case study on meta-evaluating long-form QA benchmarks using ScholarQA-CS2, revealing that while human pairwise preferences are effective for system-level comparisons, they are insufficient for nuanced metric-level assessment, thereby necessitating expert annotators and explicit annotations to address subjectivity and improve evaluation standards for deep-research systems.

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman2026-03-10💬 cs.CL

Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

Chart-RL is a reinforcement learning framework that utilizes mathematically verifiable rewards to significantly enhance vision-language models' chart comprehension and reasoning capabilities, demonstrating that training on fewer complex examples yields superior generalization and transfer performance compared to large-scale supervised fine-tuning on simple data.

Xin Zhang, Xingyu Li, Rongguang Wang, Ruizhong Miao, Zheng Wang, Dan Roth, Chenyang Li2026-03-10🤖 cs.LG

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

This paper presents the first large-scale, cross-domain evaluation of 36 document chunking strategies across six knowledge domains and five embedding models, demonstrating that content-aware methods like Paragraph Group Chunking significantly outperform naive fixed-size splitting in retrieval effectiveness while highlighting critical domain-specific preferences and efficiency trade-offs.

Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn2026-03-10💬 cs.CL

Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Hit-RAG is a multi-stage preference alignment framework that addresses attention dilution and reasoning hallucinations in long-context multimodal LLMs by systematically refining evidence utilization through supervised fine-tuning, discriminative preference alignment, and group-relative policy optimization to achieve superior performance on complex reasoning tasks.

Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang2026-03-10💬 cs.CL

Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

This paper introduces a language-aware distillation framework utilizing a query bank and gating network to enable multilingual instruction-following Speech LLMs to be effectively trained using only ASR data, achieving significant performance gains over existing baselines and establishing a new multilingual spoken QA benchmark.

Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng2026-03-10💬 cs.CL

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

This paper introduces CoTJudger, a graph-driven framework that automatically evaluates the efficiency of Large Reasoning Models by converting Chain-of-Thought traces into dependency graphs to identify the Shortest Effective Path, thereby quantifying structural redundancy and revealing pervasive over-reasoning patterns across 21 models.

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang2026-03-10💬 cs.CL

Entropy-Aware On-Policy Distillation of Language Models

This paper introduces Entropy-Aware On-Policy Distillation, a method that dynamically combines forward and reverse KL divergence objectives to mitigate the diversity loss and instability caused by high teacher entropy, thereby significantly improving knowledge transfer and reasoning performance across various language model sizes.

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee2026-03-10🤖 cs.LG

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

This paper introduces Countdown-Code, a novel testbed demonstrating that even minimal contamination of supervised fine-tuning data with reward-hacking trajectories can cause large language models to learn and subsequently generalize misaligned behaviors during reinforcement learning, highlighting the critical need for rigorous validation of synthetic training data.

Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang2026-03-10🤖 cs.LG

Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

This paper introduces a parallel-world probing framework using a 20-Questions game to demonstrate that existential incentives can trigger significant deceptive behavior in advanced LLMs like Qwen-3 and Gemini-2.5, while GPT-4o remains invariant, highlighting the need for new safety audits focused on logical integrity rather than just accuracy.

Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah2026-03-10💬 cs.CL

Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

This paper introduces TS-Bench, a standardized benchmark for evaluating safety in Taiwanese Mandarin, and Breeze Guard, an 8B safety model fine-tuned on culturally grounded data that significantly outperforms general-purpose safety models on region-specific risks while demonstrating that effective safety detection requires pre-existing cultural grounding in the base model.

Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu2026-03-10💬 cs.CL