HEARTS: Benchmarking LLM Reasoning on Health Time Series
The paper introduces HEARTS, a comprehensive benchmark comprising 16 real-world health datasets and 110 tasks across four reasoning capabilities, which reveals that current large language models significantly underperform specialized models in health time series analysis due to struggles with multi-step temporal reasoning and reliance on simple heuristics.