cs.SE papers | Gist.Science

Can Adjusting Hyperparameters Lead to Green Deep Learning: An Empirical Study on Correlations between Hyperparameters and Energy Consumption of Deep Learning Models

This empirical study demonstrates that strategically adjusting hyperparameters in deep learning models can significantly reduce energy consumption and promote "green" AI, particularly when multiple models are trained in parallel, without compromising performance.

Taoran Wang, Yanhui Li, Mingliang Ma, Lin Chen, Yuming ZhouMon, 09 Ma💻 cs

Real-World Fault Detection for C-Extended Python Projects with Automated Unit Test Generation

This paper proposes adapting the Pynguin tool to use subprocess execution for isolating C-extension crashes during automated test generation, a method that successfully increased module coverage by up to 56.5% and uncovered 32 previously unknown faults in popular Python libraries.

Lucas Berg, Lukas Krodinger, Stephan Lukasczyk, Annibale Panichella, Gordon Fraser, Wim Vanhoof, Xavier DevroeyMon, 09 Ma💻 cs

A LINDDUN-based Privacy Threat Modeling Framework for GenAI

This paper introduces a novel, LINDDUN-based privacy threat modeling framework specifically designed for Generative AI systems, which expands the existing threat taxonomy with new categories and examples derived from a systematic literature review and validated through a case study on an AI Agent system.

Qianying Liao, Jonah Bellemans, Laurens Sion, Xue Jiang, Dmitrii Usynin, Xuebing Zhou, Dimitri Van Landuyt, Lieven Desmet, Wouter JoosenMon, 09 Ma💻 cs

Pre-AI Baseline: Developer IDE Satisfaction and Tool Autonomy in 2022

This study establishes a robust pre-AI baseline by analyzing satisfaction data from 1,155 developers in 2022, revealing that tool autonomy is the primary driver of IDE satisfaction while highlighting low cloud IDE adoption and significant retention disparities despite high overall satisfaction scores.

Nikola BalicMon, 09 Ma💻 cs

Detecting Semantic Alignments between Textual Specifications and Domain Models

This paper proposes and evaluates a Natural Language Processing and Large Language Model-based approach that effectively detects semantic alignments between textual specifications and domain models by classifying model elements as aligned, misaligned, or unclassified with high precision and substantial recall.

Shwetali Shimangaud, Lola Burgueño, Rijul Saini, Jörg KienzleMon, 09 Ma💻 cs

When Specifications Meet Reality: Uncovering API Inconsistencies in Ethereum Infrastructure

This paper introduces APIDiffer, a specification-guided differential testing framework that automatically detects API inconsistencies across Ethereum clients by generating real-world test cases and using large language models to filter false positives, successfully uncovering 72 confirmed bugs and significantly outperforming existing tools in coverage and accuracy.

Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Liangxin Liu, Jiushenzi Luo, Xiaopeng Fu, Chiachih Wu, Haoyu Wang, Ying Gao, Yinliang YueMon, 09 Ma💻 cs

Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

The paper proposes MCCom, a model-cascading framework that balances code completion latency and accuracy by triggering a cloud-based LLM only when a lightweight local SLM fails, utilizing user actions as a signal and speculative decoding to significantly reduce inference time and cloud costs while improving performance.

Hanzhen Lu, Lishui Fan, Jiachi Chen, Qiuyuan Chen, Zhao Wei, Zhongxin LiuMon, 09 Ma💻 cs

Real Faults in Model Context Protocol (MCP) Software: a Comprehensive Taxonomy

This paper presents the first large-scale taxonomy of real-world faults in Model Context Protocol (MCP) servers, derived from empirical evidence and validated by a practitioner survey, to identify critical error-prone components and guide the development of more robust and secure AI-enabled software systems.

Mina Taraghi, Mohammad Mehdi Morovati, Foutse KhomhMon, 09 Ma💻 cs

CodeScout: Contextual Problem Statement Enhancement for Software Agents

The paper introduces CodeScout, a framework that enhances software agent performance by performing lightweight pre-exploration of codebases to convert underspecified user requests into comprehensive, actionable problem statements, resulting in a 20% improvement in resolution rates on the SWEBench-Verified benchmark.

Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, Varun KumarMon, 09 Ma💬 cs.CL

The Limits of Long-Context Reasoning in Automated Bug Fixing

This paper demonstrates that while agentic workflows improve bug-fixing performance by decomposing tasks into short-context steps, current large language models fail to effectively reason over genuinely long contexts (e.g., 64k tokens), revealing a significant gap between nominal context length and usable reasoning capacity.

Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish ThakkerMon, 09 Ma🤖 cs.LG

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

ReflexiCoder is a novel reinforcement learning framework that internalizes structured self-reflection and self-correction capabilities into an LLM's weights, enabling it to autonomously generate, debug, and optimize code without external feedback while achieving state-of-the-art performance and improved token efficiency across multiple benchmarks.

Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju KimMon, 09 Ma🤖 cs.LG

Theory of Code Space: Do Code Agents Understand Software Architecture?

This paper introduces Theory of Code Space (ToCS), a benchmark demonstrating that AI code agents exhibit significant, model-dependent variability in their ability to maintain coherent architectural beliefs and utilize active exploration or self-scaffolding during multi-file software engineering tasks.

Grigory SapunovMon, 09 Ma🤖 cs.AI

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox is a lightweight, container-free framework that leverages kernel-level isolation and environment pre-caching to significantly reduce storage and setup overhead while maintaining performance comparable to traditional container-based pipelines for scaling reinforcement learning in software engineering agents.

Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan ZhaoMon, 09 Ma🤖 cs.AI

Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents

This paper presents a comprehensive survey of 178 benchmarks for Code Large Language Models and Agents through a tiered Software Development Life Cycle (SDLC) framework, revealing a significant imbalance that heavily favors the implementation phase while neglecting requirements and design, alongside critical gaps in anti-contamination strategies that necessitate future research to bridge the gap between theoretical capabilities and practical effectiveness.

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, Aishan Liu, Xianglong Liu, Chao Shen, Bin ShiMon, 09 Ma🤖 cs.AI

A Reference Architecture of Reinforcement Learning Frameworks

This paper proposes a reference architecture for reinforcement learning frameworks by analyzing 18 state-of-the-practice implementations through grounded theory to establish a common basis for comparison, evaluation, and integration.

Xiaoran Liu, Istvan DavidMon, 09 Ma🤖 cs.AI

XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights

This paper presents a systematic explainable AI framework that transforms raw coding agent execution traces into structured, visual, and natural language explanations, significantly accelerating root cause identification and improving fix accuracy for both technical and non-technical users compared to raw traces or ad-hoc model explanations.

Arun JoshiMon, 09 Ma🤖 cs.AI

LTLGuard: Formalizing LTL Specifications with Compact Language Models and Lightweight Symbolic Reasoning

LTLGuard is a modular framework that enables resource-efficient open-weight language models (4B–14B parameters) to generate correct and conflict-free Linear Temporal Logic (LTL) specifications from informal requirements by combining constrained generation with lightweight symbolic reasoning for iterative consistency checking and refinement.

Medina Andresel, Cristinel Mateis, Dejan Nickovic, Spyridon Kounoupidis, Panagiotis Katsaros, Stavros TripakisMon, 09 Ma🤖 cs.AI

Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

This paper introduces Tool-Genesis, a diagnostic benchmark designed to evaluate and quantify the capabilities of self-evolving language agents in autonomously creating and utilizing tools from abstract requirements, revealing that even state-of-the-art models struggle with interface precision and logic execution, which leads to significant downstream performance degradation.

Bowei Xia, Mengkang Hu, Shijian Wang, Jiarui Jin, Wenxiang Jiao, Yuan Lu, Kexin Li, Ping LuoMon, 09 Ma🤖 cs.AI

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

The paper introduces EigenData, a self-evolving multi-agent platform that automates the synthesis, auditing, and repair of high-quality function-calling training data, demonstrating its effectiveness by systematically correcting the Berkeley Function-Calling Leaderboard (BFCL-V3) to achieve model rankings that better correlate with human judgments of functional correctness.

Jiaao Chen, Jingyuan Qi, Mingye Gao, Wei-Chen Wang, Hanrui Wang, Di JinMon, 09 Ma🤖 cs.AI

Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents

This paper proposes "Traversal-as-Policy," a framework that distills sandboxed execution logs into verifiable Gated Behavior Trees to replace implicit LLM policies with explicit, state-conditioned macro traversals, thereby significantly improving success rates, eliminating safety violations, and reducing computational costs across diverse autonomous agent benchmarks.

Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, Zhengzhong TuMon, 09 Ma🤖 cs.AI

← Previous Next →