cs.AI papers | Gist.Science

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

NC-Bench introduces a theory-grounded benchmark that evaluates the conversational competence of large language models by assessing their ability to manage the form and structure of natural interactions across basic, retrieval-augmented, and complex multi-turn scenarios, revealing that while models excel at basic answering, they struggle significantly with repair and complex sequence management tasks.

Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj GalaTue, 10 Ma💬 cs.CL

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a realistic, telemetry-driven benchmark comprising 1,800 instances across six languages that evaluates LLMs on code completion tasks with a focus on ecological validity, contamination-free assessment, and detailed diagnostic insights to guide practical model selection and development.

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie NallipoguTue, 10 Ma🤖 cs.LG

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

This paper introduces MAS-Orchestra, a training-time framework that optimizes multi-agent system orchestration via function-calling reinforcement learning, alongside the MASBENCH benchmark, to demonstrate that multi-agent benefits are task-dependent and to achieve significant performance gains with over 10x efficiency on complex reasoning tasks.

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq JotyTue, 10 Ma💬 cs.CL

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework and set of financial benchmarks demonstrating that decision determinism and task accuracy in LLM agents are uncorrelated, thereby necessitating independent measurement to ensure reliable regulatory audit replay in financial services.

Raffi KhatchadourianTue, 10 Ma💬 cs.CL

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

This paper proposes a novel data-rate-aware continuous-flow architecture for CNN inference on FPGAs that mitigates hardware underutilization caused by data reduction in pooling and strided convolution layers by interleaving signals and sharing resources, thereby enabling the high-throughput implementation of complex models like MobileNet on a single device.

Tobias Habermann, Michael Mecik, Zhenyu Wang, César David Vera, Martin Kumm, Mario GarridoTue, 10 Ma🤖 cs.LG

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache is a training-free framework that accelerates Flow Matching inference by replacing instantaneous velocity caching with an average-velocity approach using cached Jacobian-vector products and a trajectory-stability scheduling strategy, achieving significant speedups (up to 4.56X) while maintaining high generation quality across models like FLUX.1 and HunyuanVideo.

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo LianTue, 10 Ma🤖 cs.LG

RedSage: A Cybersecurity Generalist LLM

The paper introduces RedSage, an open-source, locally deployable cybersecurity LLM trained on a massive curated dataset and agentic augmentation pipeline, which achieves state-of-the-art performance on specialized cybersecurity benchmarks while also improving general reasoning capabilities.

Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto DamianiTue, 10 Ma💬 cs.CL

Bitcoin Price Prediction using Machine Learning and Combinatorial Fusion Analysis

This paper proposes a Bitcoin price prediction model using Combinatorial Fusion Analysis (CFA) to integrate diverse machine learning models via rank-score characteristics and weighted combinations, achieving a superior Mean Absolute Percentage Error (MAPE) of 0.19% that outperforms individual models and existing prediction methods.

Yuanhong Wu, Wei Ye, Jingyan Xu, D. Frank HsuTue, 10 Ma🤖 cs.LG

Impact of LLMs news Sentiment Analysis on Stock Price Movement Prediction

This paper evaluates the impact of LLM-based news sentiment analysis on stock price prediction, demonstrating that DeBERTa outperforms other models and that an ensemble approach achieves 80% accuracy, while sentiment features provide modest improvements to various time-series forecasting architectures.

Walid Siala (SnT, University of Luxembourg, Luxembourg), Ahmed Khanfir (RIADI, ENSI, University of Manouba, Tunisia, SnT, University of Luxembourg, Luxembourg), Mike Papadakis (SnT, University of Luxembourg, Luxembourg)Tue, 10 Ma💻 cs

In-Run Data Shapley for Adam Optimizer

This paper introduces Adam-Aware In-Run Data Shapley, a novel method that overcomes the limitations of SGD-based attribution in adaptive optimizers by deriving a closed-form approximation and a Linearized Ghost Approximation to achieve near-perfect fidelity in data contribution estimation while maintaining high training efficiency.

Meng Ding, Zeqing Zhang, Di Wang, Lijie HuTue, 10 Ma🤖 cs.LG

Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration

This paper investigates whether Schwartz higher-order values improve sentence-level human value detection, finding that while hierarchical gating offers limited benefits, calibration techniques and hybrid ensembles significantly boost performance, suggesting the value hierarchy is more effective as an inductive bias than a rigid routing mechanism.

Víctor Yeste, Paolo RossoTue, 10 Ma🤖 cs.LG

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

This paper introduces T2T (Thickening-to-Thinning), a dynamic reward shaping framework inspired by human learning dynamics that enhances LLM reasoning by encouraging longer, exploratory trajectories on incorrect attempts and penalizing length upon success, thereby outperforming standard baselines on mathematical benchmarks.

Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, Gao HuangTue, 10 Ma🤖 cs.LG

Semantic Search over 9 Million Mathematical Theorems

This paper introduces a scalable semantic search system for a corpus of 9.2 million mathematical theorems, demonstrating that representing theorems with natural-language descriptions significantly improves retrieval accuracy for both specific theorems and entire papers compared to existing baselines.

Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Jarod Alper, Giovanni Inchiostro, Vasily IlinTue, 10 Ma🔢 math

LMMRec: LLM-driven Motivation-aware Multimodal Recommendation

This paper introduces LMMRec, a model-agnostic framework that leverages large language models and chain-of-thought prompting to extract fine-grained user and item motivations from heterogeneous text data, effectively aligning them with interaction signals to significantly improve multimodal recommendation performance.

Yicheng Di, Zhanjie Zhang, Yun Wang, Jinren Liu, Jiaqi Yan, Jiyu Wei, Xiangyu Chen, Yuan LiuTue, 10 Ma💻 cs

Diffusion-Guided Pretraining for Brain Graph Foundation Models

This paper proposes a unified diffusion-guided pretraining framework for brain graph foundation models that overcomes the limitations of existing methods by using diffusion to preserve semantic connectivity patterns during augmentation and to enable topology-aware global reconstruction, thereby achieving robust and transferable representations across diverse neuroimaging datasets.

Xinxu Wei, Rong Zhou, Lifang He, Yu ZhangTue, 10 Ma🤖 cs.LG

Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement

The paper proposes CoCoA, a training-free decoding algorithm that mitigates LLM hallucinations by detecting and penalizing factually incorrect outputs through the analysis of representational instability and internal disagreement across the model's middle layers.

Koduvayur Subbalakshmi, Sabbir Hossain Ujjal, Venkata Krishna Teja Mangichetty, Nastaran Jamalipour SoofiTue, 10 Ma💬 cs.CL

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

This paper introduces M2RL, a comprehensive study comparing mixed multi-task training versus separate training with model merging for multi-domain Reinforcement Learning with Verifiable Rewards (RLVR), revealing that reasoning-intensive domains exhibit synergistic effects with minimal interference and providing mechanistic insights through extensive experiments.

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui TangTue, 10 Ma💻 cs

A Geometric Taxonomy of Hallucinations in LLMs

This paper proposes a geometric taxonomy of LLM hallucinations into three distinct types (unfaithfulness, confabulation, and factual error) and introduces corresponding detection metrics, the Semantic Grounding Index and Directional Grounding Index, which effectively identify unfaithful and confabulated outputs while revealing that apparent signals for factual errors in existing benchmarks often stem from stylistic annotation confounds rather than genuine geometric distinctions.

Javier MarínTue, 10 Ma💬 cs.CL

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

TrasMuon is a novel optimizer that enhances the stability and convergence of Muon-style methods by integrating global RMS calibration and energy-based trust-region clipping to preserve near-isometric geometry while mitigating sensitivity to step-size hyperparameters and high-energy outliers.

Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen TongTue, 10 Ma🤖 cs.LG

Can a Lightweight Automated AI Pipeline Solve Research-Level Mathematical Problems?

This paper demonstrates that a lightweight, automated AI pipeline integrating next-generation large language models with citation-based verification can successfully generate and solve sophisticated, research-grade mathematical problems, including previously unpublished questions, with verified results and open-sourced tools.

Lve Meng (University of Science,Technology of China, Zhongguancun Academy), Weilong Zhao (Université Paris Cité), Yanzhi Zhang (Zhongguancun Academy), Haoxiang Guan (Zhongguancun Academy), Jiyan He (Zhongguancun Academy)Tue, 10 Ma🔢 math

← Previous Next →