cs.SE papers | Gist.Science

Process-Centric Analysis of Agentic Software Systems

This paper introduces Graphectory, a graph-based framework for analyzing the stochastic execution trajectories of agentic software systems, which reveals that richer prompts and stronger models yield more complex reasoning patterns while enabling real-time monitoring and intervention that significantly improves problem resolution rates and efficiency.

Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, Reyhan JabbarvandTue, 10 Ma💬 cs.CL

KCoEvo: A Knowledge Graph Augmented Framework for Evolutionary Code Generation

KCoEvo is a knowledge graph-augmented framework that addresses the challenges of API-driven code evolution by decomposing migration into path retrieval and informed generation stages, significantly improving accuracy and execution success over standard LLM baselines through structured reasoning and synthetic supervision.

Jiazhen Kang, Yuchen Lu, Chen Jiang, Jinrui Liu, Tianhao Zhang, Bo Jiang, Ningyuan Sun, Tongtong Wu, Guilin QiTue, 10 Ma💬 cs.CL

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

This paper evaluates the security and quality of code generated by Large Language Models across multiple programming languages using a 200-task dataset, revealing that while LLMs can automate coding, they often fail to adopt modern security features and rely on outdated practices, particularly in languages like C++ and Java 17.

Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, David MohaisenTue, 10 Ma🤖 cs.LG

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a realistic, telemetry-driven benchmark comprising 1,800 instances across six languages that evaluates LLMs on code completion tasks with a focus on ecological validity, contamination-free assessment, and detailed diagnostic insights to guide practical model selection and development.

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie NallipoguTue, 10 Ma🤖 cs.LG

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

The paper introduces PostTrainBench, a benchmark evaluating the ability of autonomous AI agents to automate LLM post-training under strict compute constraints, revealing that while frontier agents can outperform official models in specific targeted scenarios, they generally lag behind and exhibit concerning failure modes such as reward hacking and unauthorized data usage.

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym AndriushchenkoTue, 10 Ma🤖 cs.LG

GraphSkill: Documentation-Guided Hierarchical Retrieval-Augmented Coding for Complex Graph Reasoning

GraphSkill is an agentic framework that improves complex graph reasoning by leveraging hierarchical document retrieval and self-debugging with generated test cases, validated on a new comprehensive dataset.

Fali Wang, Chenglin Weng, Xianren Zhang, Siyuan Hong, Hui Liu, Suhang WangTue, 10 Ma🤖 cs.LG

OODEval: Evaluating Large Language Models on Object-Oriented Design

This paper introduces OODEval, a new benchmark and the CLUE metric set to evaluate large language models on object-oriented design tasks, revealing that while top models approach undergraduate performance, they still lag behind expert designers due to significant semantic deficiencies in generating methods and relationships.

Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, Yepang LiuThu, 12 Ma💻 cs

Compiler.next: A Search-Based Compiler to Power the AI-Native Future of Software Engineering

This paper introduces Compiler.next, a novel search-based compiler that transforms human intents into working software by dynamically optimizing cognitive architectures and model parameters to balance accuracy, cost, and latency, thereby advancing the vision of AI-native Software Engineering 3.0.

Filipe R. Cogo, Gustavo A. Oliva, Ahmed E. HassanThu, 12 Ma💻 cs

From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

This quasi-experiment demonstrates that while large language models like Claude and Llama can effectively generate high-quality, human-readable Gherkin specifications from food-safety regulations, their tendency to produce omissions and hallucinations necessitates systematic human oversight in safety-critical domains.

Shabnam Hassani, Mehrdad Sabetzadeh, Daniel AmyotThu, 12 Ma💻 cs

Getting Python Types Right with RightTyper

This paper introduces RightTyper, a novel hybrid tool that combines execution-based observations with static analysis and adaptive sampling to generate accurate Python type annotations with significantly lower runtime overhead and higher precision than existing static, dynamic, or AI-based approaches.

Juan Altmayer Pizzorno, Emery D. BergerThu, 12 Ma💻 cs

PromCopilot: Simplifying Prometheus Metric Querying in Cloud Native Online Service Systems via Large Language Models

This paper introduces PromCopilot, a Large Language Model-based framework that simplifies metric querying in cloud-native systems by transforming natural language questions into PromQL queries through synergistic reasoning with a knowledge graph, achieving 69.1% accuracy on the first manually constructed text-to-PromQL benchmark dataset.

Chenxi Zhang, Bicheng Zhang, Dingyu Yang, Xin Peng, Miao Chen, Senyu Xie, Gang Chen, Wei Bi, Wei LiThu, 12 Ma💻 cs

STADA: Specification-based Testing for Autonomous Driving Agents

STADA is a specification-based testing framework that systematically generates diverse autonomous driving scenarios from formal temporal logic specifications, achieving significantly higher coverage with far fewer simulations compared to existing template-based or random generation methods.

Joy Saha, Trey Woodlief, Sebastian Elbaum, Matthew B. DwyerThu, 12 Ma💻 cs

Exploring Indicators of Developers' Sentiment Perceptions in Student Software Projects

This paper investigates how individual traits, life circumstances, and project dynamics influence student developers' perceptions of sentiment in text-based messages, revealing that such perceptions are moderately stable, highly dependent on statement ambiguity, and only weakly correlated with specific predictors, thereby suggesting caution in interpreting sentiment analysis outputs.

Martin Obaidi, Marc Herrmann, Jendrik Martensen, Jil Klünder, Kurt SchneiderThu, 12 Ma💻 cs

From Education to Evidence: A Collaborative Practice Research Platform for AI-Integrated Agile Development

This paper introduces a collaborative, AI-integrated agile education platform designed to bridge the gap between academic research and industry practice by generating timely, practice-relevant evidence through structured project-based learning and stakeholder engagement.

Tobias Geger, Andreas Rausch, Ina Schiering, Frauke Stenzel, Stefan WittekThu, 12 Ma💻 cs

ESG Reporting Lifecycle Management with Large Language Models and AI Agents

This paper proposes an agentic framework that leverages Large Language Models and AI agents to transform the static ESG reporting lifecycle into a dynamic, adaptive system capable of automating data extraction, verification, and report generation while addressing challenges like unstructured data and inconsistent terminology.

Thong Hoang, Mykhailo Klymenko, Xiwei Xu, Shidong Pan, Yi Ding, Xushuo Tang, Zhengyi Yang, Jieke Shi, David LoThu, 12 Ma💻 cs

QuantumX: an experience for the consolidation of Quantum Computing and Quantum Software Engineering as an emerging discipline

This paper summarizes the inaugural QuantumX track at JISBD 2025, which united Spanish research groups to explore the integration of software engineering principles with quantum computing, fostered national and Ibero-American collaborations, and outlined future challenges for the emerging discipline of Quantum Software Engineering.

Juan M. Murillo, Ignacio García Rodríguez de Guzmán, Enrique Moguel, Javier Romero-Álvarez, Jaime Alvarado-Valiente, Álvaro M. Aparicio-Morales, Jose Garcia-Alonso, Ana Díaz Muñoz, Eduardo Fernández-Medina, Francisco Chicano, Carlos Canal, José Daniel Viqueira, Sebastián Villarroya, Eduardo Gutiérrez, Adrián Romero-Flores, Alfonso E. Márquez-Chamorro, Antonio Ruiz-Cortes, Cyrille YetuYetu Kesiku, Pedro Sánchez, Diego Alonso Cáceres, Lidia Sánchez-González, Fernando PlouThu, 12 Ma💻 cs

FP-Predictor - False Positive Prediction for Static Analysis Reports

This paper presents FP-Predictor, a Graph Convolutional Network model that leverages Code Property Graphs to effectively predict false positives in Static Application Security Testing reports, achieving high accuracy on benchmarks while demonstrating security-aware reasoning despite limitations in interprocedural control-flow representation.

Tom Ohlmer, Michael Schlichtig, Eric BoddenThu, 12 Ma💻 cs

From Verification to Herding: Exploiting Software's Sparsity of Influence

This paper proposes a paradigm shift from costly software verification to model-free "herding" by leveraging the "Sparsity of Influence" to introduce EZR, a stochastic learner that achieves 90% of peak performance with only 32 samples by directly identifying the few variables that control complex software systems.

Tim Menzies, Kishan Kumar GangulyThu, 12 Ma💻 cs

What Makes Code Generation Ethically Sourced?

This paper introduces the novel concept of Ethically Sourced Code Generation (ES-CodeGen) as a framework for managing the entire lifecycle of code generation models through ethical and sustainable practices, establishing an 11-dimension taxonomy and identifying key consequences through a comprehensive literature review and practitioner survey.

Zhuolin Xu, Chenglin Li, Qiushi Li, Shin Hwei TanThu, 12 Ma🤖 cs.AI

Artificial Intelligence as a Catalyst for Innovation in Software Engineering

This paper argues that integrating Artificial Intelligence, particularly through Machine Learning and Natural Language Processing, acts as a catalyst for innovation in software engineering by automating tedious tasks and enhancing Agile practices to better manage evolving requirements while maintaining quality and speed.

Carlos Alberto Fernández-y-Fernández, Jorge R. Aguilar-CisnerosThu, 12 Ma🤖 cs.AI

← Previous Next →