Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
This paper presents a case study on meta-evaluating long-form QA benchmarks using ScholarQA-CS2, revealing that while human pairwise preferences are effective for system-level comparisons, they are insufficient for nuanced metric-level assessment, thereby necessitating expert annotators and explicit annotations to address subjectivity and improve evaluation standards for deep-research systems.
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman2026-03-10💬 cs.CL