SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

The paper introduces SkillCraft, a benchmark and evaluation protocol designed to test and enhance LLM agents' ability to abstract, compose, and reuse higher-level tool combinations as "skills," demonstrating that such compositional learning significantly improves task success rates and reduces token usage by up to 80%.

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye TehWed, 11 Ma💬 cs.CL

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

KernelCraft introduces the first benchmark evaluating agentic LLM systems that use feedback-driven workflows to automatically generate and optimize low-level kernels for emerging hardware with novel ISAs, demonstrating their ability to produce valid, high-performance code that rivals or exceeds traditional compiler baselines.

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren ZhaoWed, 11 Ma🤖 cs.LG

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

GateLens is a reasoning-enhanced LLM agent that utilizes Relational Algebra as a formal intermediate representation to bridge the gap between natural language and executable code, enabling fast, transparent, and highly accurate analysis of complex tabular data in automotive software release analytics without requiring few-shot examples or complex agent orchestration.

Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan YuWed, 11 Ma🤖 cs.AI

SiliconMind-V1: Multi-Agent Distillation and Debug-Reasoning Workflows for Verilog Code Generation

The paper introduces SiliconMind-V1, a unified multi-agent framework that leverages testbench-driven verification and iterative debug-reasoning workflows to train locally fine-tuned LLMs for generating functionally correct Verilog RTL designs, outperforming state-of-the-art models with greater efficiency and privacy.

Mu-Chi Chen, Yu-Hung Kao, Po-Hsuan Huang, Shao-Chun Ho, Hsiang-Yu Tsou, I-Ting Wu, En-Ming Huang, Yu-Kai Hung, Wei-Po Hsin, Cheng Liang, Chia-Heng Tu, Shih-Hao Hung, Hsiang-Tsung KungWed, 11 Ma🤖 cs.AI

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

The paper introduces EsoLang-Bench, a novel benchmark utilizing esoteric programming languages to expose the limitations of large language models' genuine reasoning capabilities by revealing a dramatic performance gap between their high scores on standard benchmarks and near-zero accuracy on tasks requiring the acquisition of new languages through documentation and experimentation rather than memorization.

Aman Sharma, Paras ChopraWed, 11 Ma🤖 cs.AI

Engineering Systems for Data Analysis Using Interactive Structured Inductive Programming

The paper introduces iProg, an interactive tool that leverages a structured communication protocol between humans and large language models to decompose scientific data analysis tasks into declarative Data Flow Diagrams and generate corresponding code, thereby achieving significantly faster development, higher code quality, and better performance than traditional Low Code/No Code alternatives.

Shraddha Surana, Ashwin Srinivasan, Michael BainTue, 10 Ma💻 cs