cs.SE papers | Gist.Science

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

The paper introduces SkillCraft, a benchmark and evaluation protocol designed to test and enhance LLM agents' ability to abstract, compose, and reuse higher-level tool combinations as "skills," demonstrating that such compositional learning significantly improves task success rates and reduces token usage by up to 80%.

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye TehWed, 11 Ma💬 cs.CL

Extension of ACETONE C code generator for multi-core architectures

This paper proposes extending the ACETONE C code generator, originally limited to sequential code, to support multi-core architectures by formally defining a processor assignment problem and outlining a future implementation of parallel code generation, scheduling heuristics, and synchronization mechanisms.

Yanis Aït-Aïssa (IRIT-TRACES), Thomas Carle (IRIT-TRACES), Sergei Chichin, Benjamin Lesage, Claire PagettiWed, 11 Ma💻 cs

FormalRTL: Verified RTL Synthesis at Scale

FormalRTL is a novel end-to-end multi-agent framework that leverages software reference models as formal specifications to enable scalable, verified, and reliable register-transfer level (RTL) code generation for complex industrial hardware designs.

Kezhi Li, Min Li, Xiangyu Wen, Shibo Zhao, Jieying Wu, Junhua Huang, Qiang XuWed, 11 Ma💻 cs

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

KernelCraft introduces the first benchmark evaluating agentic LLM systems that use feedback-driven workflows to automatically generate and optimize low-level kernels for emerging hardware with novel ISAs, demonstrating their ability to produce valid, high-performance code that rivals or exceeds traditional compiler baselines.

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren ZhaoWed, 11 Ma🤖 cs.LG

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

This paper introduces SEER, a self-optimizing framework that adaptively compresses Chain-of-Thought reasoning to significantly reduce computational costs and latency while improving accuracy and robustness in software engineering and mathematical tasks.

Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, Xin XiaWed, 11 Ma🤖 cs.AI

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

GateLens is a reasoning-enhanced LLM agent that utilizes Relational Algebra as a formal intermediate representation to bridge the gap between natural language and executable code, enabling fast, transparent, and highly accurate analysis of complex tabular data in automotive software release analytics without requiring few-shot examples or complex agent orchestration.

Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan YuWed, 11 Ma🤖 cs.AI

Towards a Neural Debugger for Python

This paper introduces "neural debuggers," a new class of language models that emulate traditional debugging interactions like setting breakpoints and stepping through code to enable both forward and inverse execution prediction, thereby laying the foundation for more powerful agentic coding systems and automated debugging.

Maximilian Beck, Jonas Gehring, Jannik Kossen, Gabriel SynnaeveWed, 11 Ma🤖 cs.AI

Declarative Scenario-based Testing with RoadLogic

This paper introduces RoadLogic, an open-source framework that bridges declarative OpenSCENARIO specifications and executable simulations by combining Answer Set Programming, motion planning, and specification-based monitoring to automatically generate diverse, realistic, and compliant autonomous vehicle testing scenarios.

Ezio Bartocci, Alessio Gambi, Felix Gigler, Cristinel Mateis, Dejan NičkovicWed, 11 Ma🤖 cs.AI

Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software

This paper presents an automated pipeline leveraging Large Language Models to detect and diagnose flaky tests in quantum software, successfully expanding an existing dataset by 54% and demonstrating that models like Google Gemini can achieve high accuracy (F1-scores up to 0.9643) in classifying flakiness and identifying root causes.

Janakan Sivaloganathan, Ainaz Jamshidi, Andriy Miranskyy, Lei ZhangWed, 11 Ma🤖 cs.AI

The Missing Memory Hierarchy: Demand Paging for LLM Context Windows

This paper introduces Pichay, a demand paging system that treats LLM context windows as a memory hierarchy rather than a static cache, successfully reducing context consumption by up to 93% in production by evicting stale content and dynamically reloading it only when needed.

Tony MasonWed, 11 Ma🤖 cs.AI

Arbiter: Detecting Interference in LLM Agent System Prompts

This paper introduces Arbiter, a framework that combines formal rules with multi-model LLM analysis to detect interference patterns in coding agent system prompts, revealing that prompt architecture influences failure types and that multi-model evaluation uncovers distinct vulnerabilities missed by single-model approaches.

Tony MasonWed, 11 Ma🤖 cs.AI

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

This paper introduces Test-Driven AI Agent Definition (TDAD), a methodology that compiles tool-using LLM agents from behavioral specifications by iteratively refining prompts against executable tests, thereby ensuring measurable behavioral compliance and robustness against silent regressions through mechanisms like hidden test splits and semantic mutation testing.

Tzafrir RehanWed, 11 Ma🤖 cs.AI

Turn: A Language for Agentic Computation

This paper introduces **Turn**, a compiled, actor-based programming language that enhances agentic software by integrating LLM inference as a typed primitive with schema validation, confidence-based control flow, isolated actor contexts, capability-based identity, and compile-time schema absorption to enforce critical safety and state invariants at the language level.

Muyukani KizitoWed, 11 Ma🤖 cs.AI

SiliconMind-V1: Multi-Agent Distillation and Debug-Reasoning Workflows for Verilog Code Generation

The paper introduces SiliconMind-V1, a unified multi-agent framework that leverages testbench-driven verification and iterative debug-reasoning workflows to train locally fine-tuned LLMs for generating functionally correct Verilog RTL designs, outperforming state-of-the-art models with greater efficiency and privacy.

Mu-Chi Chen, Yu-Hung Kao, Po-Hsuan Huang, Shao-Chun Ho, Hsiang-Yu Tsou, I-Ting Wu, En-Ming Huang, Yu-Kai Hung, Wei-Po Hsin, Cheng Liang, Chia-Heng Tu, Shih-Hao Hung, Hsiang-Tsung KungWed, 11 Ma🤖 cs.AI

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

The paper introduces EsoLang-Bench, a novel benchmark utilizing esoteric programming languages to expose the limitations of large language models' genuine reasoning capabilities by revealing a dramatic performance gap between their high scores on standard benchmarks and near-zero accuracy on tasks requiring the acquisition of new languages through documentation and experimentation rather than memorization.

Aman Sharma, Paras ChopraWed, 11 Ma🤖 cs.AI

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

This paper introduces the LLM Delegate Protocol (LDP), an AI-native communication framework that enhances multi-agent system efficiency and governance by exposing model identity and reasoning profiles as first-class primitives, demonstrating significant reductions in latency and token usage alongside improved security and recovery capabilities in experimental evaluations.

Sunil PrakashWed, 11 Ma🤖 cs.AI

Engineering Systems for Data Analysis Using Interactive Structured Inductive Programming

The paper introduces iProg, an interactive tool that leverages a structured communication protocol between humans and large language models to decompose scientific data analysis tasks into declarative Data Flow Diagrams and generate corresponding code, thereby achieving significantly faster development, higher code quality, and better performance than traditional Low Code/No Code alternatives.

Shraddha Surana, Ashwin Srinivasan, Michael BainTue, 10 Ma💻 cs

The Future of Software Testing: AI-Powered Test Case Generation and Validation

This paper explores how integrating artificial intelligence into software testing transforms the generation and validation of test cases by enhancing efficiency, accuracy, and scalability while addressing challenges related to data quality, model transparency, and the balance between automation and human oversight.

Mohammad Baqar, Rajat KhandaTue, 10 Ma💻 cs

MORCoRA: Multi-Objective Refactoring Recommendation Considering Review Availability

This paper proposes MORCoRA, a multi-objective search-based technique that recommends code quality-improving refactoring sequences along with suitable reviewers by simultaneously optimizing for expertise and workload availability to ensure the refactorings can be promptly reviewed.

Lei Chen, Shinpei HayashiTue, 10 Ma💻 cs

Designing Value-Based Platforms: Architectural Strategies Derived from the Digital Markets Act

This paper analyzes the technical implications of the Digital Markets Act (DMA) through qualitative methods to propose eight high-level design strategies and 15 specific tactics for building value-based platform architectures that ensure fairness, contestability, and user choice.

Fabian Stiehle, Markus Funke, Patricia Lago, Ingo WeberTue, 10 Ma💻 cs

← Previous Next →