CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

This paper introduces CyberThreat-Eval, an expert-annotated benchmark derived from real-world Cyber Threat Intelligence workflows that addresses the limitations of existing evaluations by assessing Large Language Models across the full triage-to-reporting pipeline using analyst-centric metrics, revealing significant gaps in current models' ability to handle nuanced, actionable security insights.

Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan TangWed, 11 Ma💬 cs.CL

TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA

This paper introduces TA-Mem, a novel framework that enhances long-term conversational QA by employing tool-augmented autonomous agents to adaptively extract structured memory and dynamically select retrieval strategies, thereby overcoming the limitations of static similarity-based methods and achieving superior performance on the LoCoMo dataset.

Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan, Yichao Wu, Penghao LiangWed, 11 Ma💬 cs.CL

How Contrastive Decoding Enhances Large Audio Language Models?

This paper systematically evaluates four Contrastive Decoding strategies across diverse Large Audio Language Models, identifying Audio-Aware and Audio Contrastive Decoding as most effective while introducing a Transition Matrix framework to demonstrate that these methods successfully rectify specific error patterns like false audio absence claims but fail to correct flawed reasoning or confident misassertions.

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi LeeWed, 11 Ma💬 cs.CL

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

This paper reveals that enabling reasoning in large language models significantly enhances the recall of simple factual knowledge through two mechanisms—computational buffering and factual priming—while also demonstrating that hallucinating intermediate facts during this process increases final answer errors, a finding that can be leveraged to improve model accuracy by prioritizing hallucination-free reasoning trajectories.

Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, Jonathan HerzigWed, 11 Ma💬 cs.CL

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

This paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to evaluate Speech Large Language Models under realistic spoken instruction conditions, revealing that text prompts generally outperform spoken ones except in tasks requiring speech output.

Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan NiehuesWed, 11 Ma💬 cs.CL

Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents

This paper proposes using Chow-Liu trees to optimize chunk ordering in Chain-of-Agents frameworks, demonstrating that a breadth-first traversal of the learned dependency structure significantly reduces information loss and improves reasoning accuracy on long-context benchmarks compared to standard ordering methods.

Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover, Ramakrishna B. Bairi, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta, Rishikesh Maurya, Vageesh D. CWed, 11 Ma💬 cs.CL

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

One-Eval is an agentic system that automates the end-to-end evaluation of large language models by converting natural-language requests into traceable, customizable workflows through integrated components for benchmark planning, dataset resolution, and decision-oriented reporting, thereby reducing manual effort and enhancing reproducibility.

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao ZhangWed, 11 Ma💬 cs.CL

Evaluation of LLMs in retrieving food and nutritional context for RAG systems

This paper evaluates four Large Language Models within a Retrieval-Augmented Generation system for food and nutrition data, finding that while they effectively translate natural language queries into structured metadata filters to reduce manual effort, their reliability diminishes when handling complex queries involving constraints that exceed the representational scope of the underlying metadata.

Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušic SeljakWed, 11 Ma💬 cs.CL

Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation

This paper proposes a multi-perspective framework for estimating recipe similarity by integrating semantic, lexical, and domain-specific nutritional data, which was validated by domain experts to identify the most influential factors in human decision-making for applications in personalized nutrition and automated recipe generation.

Denica Kjorvezir, Danilo Najkov, Eva Valencič, Erika Jesenko, Barbara Koroišic Seljak, Tome Eftimov, Riste StojanovWed, 11 Ma💬 cs.CL