Can Adjusting Hyperparameters Lead to Green Deep Learning: An Empirical Study on Correlations between Hyperparameters and Energy Consumption of Deep Learning Models

This empirical study demonstrates that strategically adjusting hyperparameters in deep learning models can significantly reduce energy consumption and promote "green" AI, particularly when multiple models are trained in parallel, without compromising performance.

Taoran Wang, Yanhui Li, Mingliang Ma, Lin Chen, Yuming ZhouMon, 09 Ma💻 cs

Real-World Fault Detection for C-Extended Python Projects with Automated Unit Test Generation

This paper proposes adapting the Pynguin tool to use subprocess execution for isolating C-extension crashes during automated test generation, a method that successfully increased module coverage by up to 56.5% and uncovered 32 previously unknown faults in popular Python libraries.

Lucas Berg, Lukas Krodinger, Stephan Lukasczyk, Annibale Panichella, Gordon Fraser, Wim Vanhoof, Xavier DevroeyMon, 09 Ma💻 cs

A LINDDUN-based Privacy Threat Modeling Framework for GenAI

This paper introduces a novel, LINDDUN-based privacy threat modeling framework specifically designed for Generative AI systems, which expands the existing threat taxonomy with new categories and examples derived from a systematic literature review and validated through a case study on an AI Agent system.

Qianying Liao, Jonah Bellemans, Laurens Sion, Xue Jiang, Dmitrii Usynin, Xuebing Zhou, Dimitri Van Landuyt, Lieven Desmet, Wouter JoosenMon, 09 Ma💻 cs

When Specifications Meet Reality: Uncovering API Inconsistencies in Ethereum Infrastructure

This paper introduces APIDiffer, a specification-guided differential testing framework that automatically detects API inconsistencies across Ethereum clients by generating real-world test cases and using large language models to filter false positives, successfully uncovering 72 confirmed bugs and significantly outperforming existing tools in coverage and accuracy.

Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Liangxin Liu, Jiushenzi Luo, Xiaopeng Fu, Chiachih Wu, Haoyu Wang, Ying Gao, Yinliang YueMon, 09 Ma💻 cs

CodeScout: Contextual Problem Statement Enhancement for Software Agents

The paper introduces CodeScout, a framework that enhances software agent performance by performing lightweight pre-exploration of codebases to convert underspecified user requests into comprehensive, actionable problem statements, resulting in a 20% improvement in resolution rates on the SWEBench-Verified benchmark.

Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, Varun KumarMon, 09 Ma💬 cs.CL

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

ReflexiCoder is a novel reinforcement learning framework that internalizes structured self-reflection and self-correction capabilities into an LLM's weights, enabling it to autonomously generate, debug, and optimize code without external feedback while achieving state-of-the-art performance and improved token efficiency across multiple benchmarks.

Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju KimMon, 09 Ma🤖 cs.LG

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox is a lightweight, container-free framework that leverages kernel-level isolation and environment pre-caching to significantly reduce storage and setup overhead while maintaining performance comparable to traditional container-based pipelines for scaling reinforcement learning in software engineering agents.

Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan ZhaoMon, 09 Ma🤖 cs.AI

Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents

This paper presents a comprehensive survey of 178 benchmarks for Code Large Language Models and Agents through a tiered Software Development Life Cycle (SDLC) framework, revealing a significant imbalance that heavily favors the implementation phase while neglecting requirements and design, alongside critical gaps in anti-contamination strategies that necessitate future research to bridge the gap between theoretical capabilities and practical effectiveness.

Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, Aishan Liu, Xianglong Liu, Chao Shen, Bin ShiMon, 09 Ma🤖 cs.AI

LTLGuard: Formalizing LTL Specifications with Compact Language Models and Lightweight Symbolic Reasoning

LTLGuard is a modular framework that enables resource-efficient open-weight language models (4B–14B parameters) to generate correct and conflict-free Linear Temporal Logic (LTL) specifications from informal requirements by combining constrained generation with lightweight symbolic reasoning for iterative consistency checking and refinement.

Medina Andresel, Cristinel Mateis, Dejan Nickovic, Spyridon Kounoupidis, Panagiotis Katsaros, Stavros TripakisMon, 09 Ma🤖 cs.AI

Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent

This paper introduces Tool-Genesis, a diagnostic benchmark designed to evaluate and quantify the capabilities of self-evolving language agents in autonomously creating and utilizing tools from abstract requirements, revealing that even state-of-the-art models struggle with interface precision and logic execution, which leads to significant downstream performance degradation.

Bowei Xia, Mengkang Hu, Shijian Wang, Jiarui Jin, Wenxiang Jiao, Yuan Lu, Kexin Li, Ping LuoMon, 09 Ma🤖 cs.AI

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

The paper introduces EigenData, a self-evolving multi-agent platform that automates the synthesis, auditing, and repair of high-quality function-calling training data, demonstrating its effectiveness by systematically correcting the Berkeley Function-Calling Leaderboard (BFCL-V3) to achieve model rankings that better correlate with human judgments of functional correctness.

Jiaao Chen, Jingyuan Qi, Mingye Gao, Wei-Chen Wang, Hanrui Wang, Di JinMon, 09 Ma🤖 cs.AI

Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents

This paper proposes "Traversal-as-Policy," a framework that distills sandboxed execution logs into verifiable Gated Behavior Trees to replace implicit LLM policies with explicit, state-conditioned macro traversals, thereby significantly improving success rates, eliminating safety violations, and reducing computational costs across diverse autonomous agent benchmarks.

Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, Zhengzhong TuMon, 09 Ma🤖 cs.AI