cs.SE papers | Gist.Science

Human-AI Collaboration for Scaling Agile Regression Testing: An Agentic-AI Teammate from Manual to Automated Testing

This paper presents a retrieval-augmented, multi-agent AI system developed in partnership with Hacon that accelerates agile regression testing by automatically generating executable scripts from validated specifications, thereby significantly increasing throughput and reducing manual effort while highlighting the continued necessity of human oversight and clear requirements for quality assurance.

Moustapha El Outmani, Manthan Venkataramana Shenoy, Ahmad Hatahet, Andreas Rausch, Tim Niklas Kniep, Thomas Raddatz, Benjamin KingTue, 10 Ma💻 cs

An explainable hybrid deep learning-enabled intelligent fault detection and diagnosis approach for automotive software systems validation

This paper proposes a novel explainable hybrid deep learning framework combining 1D-CNN and GRU architectures with interpretability techniques like IGs and SHAP to enhance fault detection, diagnosis, and root cause analysis in automotive software system validation while overcoming the limitations of traditional black-box models.

Mohammad Abboush, Ehab Ghannoum, Andreas RauschTue, 10 Ma💻 cs

SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

SWE-Fuse is a novel training framework that enhances software engineering agents by fusing issue-free trajectory learning with entropy-aware RLVR to overcome the limitations of noisy real-world issue descriptions, achieving state-of-the-art performance on the SWE-bench Verified benchmark.

Xin-Cheng Wen, Binbin Chen, Haoxuan Lan, Hang Yu, Peng Di, Cuiyun GaoTue, 10 Ma💻 cs

Social Proof is in the Pudding: The (Non)-Impact of Social Proof on Software Downloads

Through two field experiments on GitHub involving the manipulation of repository stars and package download counts, the study finds that social proof metrics have no discernible impact on subsequent software downloads or developer engagement, suggesting that open-source software choices are not easily gamed by inflating these indicators.

Lucas Shen, Gaurav SoodTue, 10 Ma💻 cs

IOTEL: A Tool for Generating IoT-enriched Object-Centric Event Logs

This paper presents IOTEL, a tool designed to systematically generate IoT-enriched Object-Centric Event Logs (OCEL) by integrating heterogeneous IoT data with business process logs, thereby enabling effective analysis of IoT-enhanced processes without requiring specialized schemas or extensive preprocessing.

Jia Wei, Xin Su, Chun OuyangTue, 10 Ma💻 cs

The role of team diversity in AI systems development

Through a Grounded Theory analysis of 25 interviews with professionals in four AI-focused teams, this study demonstrates that team diversity significantly enhances AI system development by diversifying perspectives for bias identification, fostering empathy, addressing systemic discrimination, and promoting inclusive decision-making.

Ronnie de Souza Santos, Maria Teresa Baldassarre, Cleyton MagalhaesTue, 10 Ma💻 cs

The Effect of Code Obfuscation on Human Program Comprehension

This study investigates how varying levels of code obfuscation affect human program comprehension in Python and JavaScript, revealing that while obfuscation generally increases reasoning time and reduces accuracy, its impact is non-monotonic and language-specific, with moderate deliberation improving performance and experience proving more critical within specific languages than across them.

Anh H. N. Nguyen, Jack Le, Ilse Lahnstein Coronado, Tien N. NguyenTue, 10 Ma💻 cs

AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents

This paper introduces AgentRaft, an automated framework that combines program analysis and semantic reasoning to detect and quantify the systemic risk of Data Over-Exposure in LLM agents, demonstrating high accuracy and efficiency across thousands of real-world tools.

Yixi Lin (Sun Yat-sen University, Zhuhai, Guangdong, China), Jiangrong Wu (Sun Yat-sen University, Zhuhai, Guangdong, China), Yuhong Nan (Sun Yat-sen University, Zhuhai, Guangdong, China), Xueqiang Wang (University of Central Florida, Orlando, Florida, USA), Xinyuan Zhang (Sun Yat-sen University, Zhuhai, Guangdong, China), Zibin Zheng (Sun Yat-sen University, Zhuhai, Guangdong, China)Tue, 10 Ma💻 cs

On the Effectiveness of Code Representation in Deep Learning-Based Automated Patch Correctness Assessment

This paper presents the first extensive study evaluating over 500 models to demonstrate that graph-based code representations consistently outperform other methods in predicting patch correctness, thereby significantly improving the effectiveness of automated program repair tools.

Quanjun Zhang, Chunrong Fang, Haichuan Hu, Yuan Zhao, Weisong Sun, Yun Yang, Tao Zheng, Zhenyu ChenTue, 10 Ma💻 cs

Empathy in Software Engineering Education: Evidence, Practices, and Opportunities

This systematic review of 43 studies reveals that while empathy is increasingly recognized as a vital capability for software engineers, its integration into education remains fragmented, prompting a call to evolve empathy from a peripheral soft skill into a structured, measurable pedagogical component to enhance collaboration, ethics, and inclusive design.

Matheus de Morais Leca, Kim Johnston, Ronnie de Souza SantosTue, 10 Ma💻 cs

Regression Testing in Remote and Hybrid Software Teams: An Exploratory Study of Processes, Tools, and Practices

This study investigates how remote and hybrid work environments reshape regression testing practices, revealing that while core phases remain stable, successful execution increasingly relies on automation, documentation, and standardized tooling to overcome communication challenges and support asynchronous collaboration.

Juliane Pascoal, Cleytton Magalhaes, Ronnie de Souza SantosTue, 10 Ma💻 cs

Echo: Graph-Enhanced Retrieval and Execution Feedback for Issue Reproduction Test Generation

Echo is an advanced agent that leverages code graphs, automatic query refinement, and execution feedback to generate single, high-quality issue reproduction test cases, achieving a new state-of-the-art success rate of 66.28% on the SWT-Bench Verified dataset.

Zhiwei Fei, Yue Pan, Federica Sarro, Jidong Ge, Marc Liu, Vincent Ng, He YeTue, 10 Ma💻 cs

Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

This empirical study demonstrates that deployment-motivated prompting constraints significantly exacerbate citation hallucinations across four large language models, with no model achieving a citation existence rate above 47.5% and a substantial portion of unverifiable outputs being fabricated, thereby underscoring the critical need for post-hoc verification in academic and software engineering contexts.

Chen Zhao, Yuan Tang, Yitian QianTue, 10 Ma💻 cs

A Hybrid LTR-based System via Social Context Embedding for Recommending Solutions of Software Bugs in Developer Communities

This paper proposes a hybrid Learning-to-Rank recommender system that leverages deep learning and social context embeddings from Stack Overflow to effectively recommend software bug solutions to developers, achieving nearly 78% accuracy in the top 10 results.

Fouzi Harrag, Mokdad KhemlicheTue, 10 Ma💻 cs

Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0

This study benchmarks ten small language models on architectural decision record generation to establish a multidimensional evaluation framework, revealing that models exceeding 3 billion parameters excel in zero-shot reasoning while sub-2 billion models benefit most from fine-tuning, and that few-shot prompting effectively calibrates mid-sized models despite high semantic diversity often correlating with hallucinations.

Ha Vo, Nhut Tran, Khang Vo, Phat T. Tran-Truong, Son HaTue, 10 Ma💻 cs

A Declarative Framework for Hand-Crafted Mutation Analysis and Management

This paper introduces Marauder, a declarative framework that unifies diverse hand-crafted mutation representations through a common intermediate form and a mutation algebra to enable efficient, expressive, and lossless management of mutants for evaluating testing tools.

Alperen KelesTue, 10 Ma💻 cs

Patch Validation in Automated Vulnerability Repair

This paper introduces PVBench, a benchmark demonstrating that over 40% of patches generated by automated vulnerability repair systems are falsely validated as correct because they fail to pass critical "PoC+" tests that encode developer intentions, root cause locations, and specific coding conventions.

Zheng Yu, Wenxuan Shi, Xinqian Sun, Zheyun Feng, Meng Xu, Xinyu XingTue, 10 Ma💻 cs

Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

This paper presents a comprehensive taxonomy of faults in agentic AI systems, derived from a large-scale empirical study of 13,602 issues and validated by 145 practitioners, which categorizes 37 distinct fault types, their symptoms, and root causes to reveal critical propagation patterns and mismatches between probabilistic LLM outputs and deterministic system constraints.

Mehil B Shah, Mohammad Mehdi Morovati, Mohammad Masudur Rahman, Foutse KhomhTue, 10 Ma💻 cs

ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

The paper introduces ResearchEnvBench, a new benchmark designed to evaluate autonomous agents' ability to synthesize complex execution environments for research code, revealing significant current limitations in dependency resolution and version management.

Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng QiuTue, 10 Ma💻 cs

Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors

This paper demonstrates that spectral metrics derived from natural language processing of requirements specifications can predict integration effort with high accuracy (correlations >0.95), offering a validated method to quantify structural complexity at the requirements stage and bridge the gap between architectural analysis and requirements engineering.

Maximilian Vierlboeck, Antonio Pugliese, Roshanak Nilchian, Paul Grogan, Rashika Sugganahalli Natesh BabuTue, 10 Ma💬 cs.CL

← Previous Next →