cs.SE papers | Gist.Science

SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments

The paper introduces SpecOps, a fully automated, multi-agent framework that effectively evaluates and detects bugs in real-world GUI-based AI agents by decomposing testing into specialized phases, outperforming existing baselines in accuracy, efficiency, and cost.

Syed Yusuf Ahmed, Shiwei Feng, Chanwoo Bae, Calix Barrus Xiangyu ZhangThu, 12 Ma💻 cs

MALTA: Maintenance-Aware Technical Lag, Estimation to Address Software Abandonment

This paper introduces MALTA, a maintenance-aware scoring framework that significantly improves upon traditional Version Lag metrics by identifying high-risk, abandoned dependencies that would otherwise be overlooked, thereby addressing a critical blind spot in software supply chain security.

Shane K. Panter, Nasir U. EistyThu, 12 Ma💻 cs

TOSSS: a CVE-based Software Security Benchmark for Large Language Models

This paper introduces TOSSS, a CVE-based benchmark designed to evaluate the ability of Large Language Models to distinguish between secure and vulnerable code snippets in C/C++ and Java, revealing that current models achieve security scores ranging from 0.48 to 0.89.

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, Roos WensveenThu, 12 Ma🤖 cs.LG

DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

This paper introduces DUCTILE, an agentic LLM orchestration framework that separates adaptive decision-making from deterministic tool execution to automate engineering analysis in product development, successfully handling input deviations in an aerospace case study while highlighting the emerging tension between task automation and the creation of exhausting supervisory roles.

Alejandro Pradas-Gomez, Arindam Brahma, Ola IsakssonThu, 12 Ma🤖 cs.AI

Building Privacy-and-Security-Focused Federated Learning Infrastructure for Global Multi-Centre Healthcare Research

This paper introduces FLA³, a governance-aware federated learning platform that integrates authentication, authorization, and accounting mechanisms to enable secure, privacy-preserving, and regulatory-compliant multi-center healthcare research, demonstrating its operational feasibility and clinical utility across international institutions while achieving predictive performance comparable to centralized training.

Fan Zhang, Daniel Kreuter, Javier Fernandez-Marques, BloodCounts Consortium, Gregory Verghese, Bernard Butler, Nicholas Lane, Suthesh Sivapalaratnam, Joseph Taylor, Norbert C. J. de Wit, Nicholas S. Gleadall, Carola-Bibiane Schönlieb, Michael RobertsThu, 12 Ma💻 cs

OAuthHub: Mitigating OAuth Data Overaccess through a Local Data Hub

This paper introduces OAuthHub, a development framework that mitigates OAuth data overaccess by utilizing users' personal devices as intermediaries to enforce a centralized, just-in-time permission model, thereby significantly reducing both the code complexity and development time required for secure data sharing.

Qiyu Li, Yuhe Tian, Haojian JinThu, 12 Ma💻 cs

SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation

This paper introduces Agentic AIBOMs, a multi-agent framework that extends static Software Bills of Materials (SBOMs) with autonomous, policy-constrained reasoning to dynamically capture runtime behavior and environmental drift, thereby enhancing supply-chain security through reproducible, context-aware vulnerability assessment and minimal schema extensions to existing standards.

Petar Radanliev, Carsten Maple, Omar Santos, Kayvan AtefiThu, 12 Ma🤖 cs.AI

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

This paper presents and evaluates five prompt engineering strategies for reducing LLM hallucinations in industrial settings without modifying model weights, finding that an Enhanced Data Registry (M4) achieved perfect consistency in initial trials while a revised Decomposed Model-Agnostic Prompting (M2) showed the most significant improvement in subsequent verification.

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach GordonThu, 12 Ma🤖 cs.AI

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

This large-scale controlled study reveals that evaluation format (multiple-choice vs. open-ended) and specific model-scaffold interactions, rather than scaffold architecture alone, are the primary drivers of measured safety differences in language models, ultimately demonstrating that no universal safety ranking exists across different deployment configurations.

David GringrasThu, 12 Ma🤖 cs.AI

Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

This paper proposes Nurture-First Development (NFD), a paradigm that shifts AI agent creation from static engineering to a continuous, conversational co-evolution process with domain experts, utilizing a Knowledge Crystallization Cycle to progressively transform tacit operational dialogue into structured, reusable expertise.

Linghao ZhangThu, 12 Ma🤖 cs.AI

Packaging Jupyter notebooks as installable desktop apps using LabConstrictor

LabConstrictor is a GitHub-based automation tool that streamlines the deployment of Jupyter notebooks into one-click, installable desktop applications, thereby overcoming common barriers to software adoption and reproducibility in life sciences research.

Iván Hidalgo-Cenalmor, Marcela Xiomara Rivera Pineda, Bruno M. Saraiva, Ricardo Henriques, Guillaume JacquemetThu, 12 Ma🧬 q-bio

One Model, Many Skills: Parameter-Efficient Fine-Tuning for Multitask Code Analysis

This paper presents the first comprehensive evaluation of parameter-efficient fine-tuning (PEFT) for multitask code analysis, demonstrating that a single shared PEFT module can match or surpass full fine-tuning performance while significantly reducing computational and storage costs, provided that tasks are strategically grouped based on factors like complementarity and stability.

Amal Akli, Maxime Cordy, Mike Papadakis, Yves Le TraonThu, 12 Ma💻 cs

A Structured Approach to Safety Case Construction for AI Systems

This paper addresses the inadequacy of traditional safety-case frameworks for modern AI systems by introducing comprehensive taxonomies of claims, arguments, and evidence, along with a reusable, structured template designed to construct credible and adaptive safety cases for generative and agentic AI.

Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, Lionel BriandMon, 09 Ma💻 cs

UniCoR: Modality Collaboration for Robust Cross-Language Hybrid Code Retrieval

UniCoR is a novel self-supervised framework that addresses the challenges of insufficient semantic understanding, inefficient modality fusion, and weak cross-language generalization in hybrid code retrieval by employing multi-perspective supervised contrastive learning and representation distribution consistency, thereby achieving state-of-the-art performance on both empirical and large-scale benchmarks.

Yang Yang, Li Kuang, Jiakun Liu, Zhongxin Liu, Yingjie Xia, David LoMon, 09 Ma💻 cs

ROS-related Robotic Systems Development with V-model-based Application of MeROS Metamodel

This paper proposes a structured methodology that integrates the Robot Operating System (ROS) with Model-Based Systems Engineering (MBSE) through a specialized SysML metamodel called MeROS and an adapted V-model, aiming to enhance the semantic coherence, structural traceability, and reliable coordination of complex heterogeneous robotic systems.

Tomasz Winiarski, Jan Kaniuka, Daniel Giełdowski, Jakub Ostrysz, Krystian Radlak, Dmytro KushnirMon, 09 Ma💻 cs

Systems of Twinned Systems: A Systematic Literature Review

This paper presents a systematic literature review of 80 studies on "systems of twinned systems," a paradigm that integrates the System of Systems and Digital Twin concepts to address modern engineering complexity, resulting in a new classification framework compatible with existing theories.

Feyi Adesanya, Kanan Castro Silva, Valdemar V. Graciano Neto, Istvan DavidMon, 09 Ma💻 cs

Understanding and Finding JIT Compiler Performance Bugs

This paper presents the first study on JIT compiler performance bugs, combining an empirical analysis of 191 bug reports to identify common patterns with the development of "Jittery," a tool using layered differential performance testing that successfully discovered and helped fix multiple previously unknown performance issues in Oracle HotSpot and Graal compilers.

Zijian Yi, Cheng Ding, August Shi, Milos GligoricMon, 09 Ma💻 cs

A Scalable Benchmark for Repository-Oriented Long-Horizon Conversational Context Management

This paper introduces LoCoEval, the first benchmark designed to evaluate long-horizon conversational context management in repository-oriented development, revealing significant limitations in existing methods and proposing a unified memory approach that outperforms current baselines.

Yang Liu, Li Zhang, Fang Liu, Ping Lin, Xinyi LiMon, 09 Ma💻 cs

A Generalized Feature Model for Digital Twins

This paper presents a generalized feature model for Digital Twins, Digital Models, and Digital Shadows, derived from a systematic literature review and validated across emergency, vehicular, and manufacturing use cases, to facilitate informed design decisions, model-driven development, and verification.

Philipp Zech, Yanis Mair, Michael Vierhauser, Pablo Oliveira Antonino, Frank Schnicke, Tony ClarkMon, 09 Ma💻 cs

Story Point Estimation Using Large Language Models

This study demonstrates that large language models can effectively predict software story points without training data or with only a few examples, outperforming traditional supervised deep learning models, while also finding that comparative judgments, though not inherently easier to predict, can serve as effective few-shot examples to further enhance estimation accuracy.

Pranam Prakash Shetty, Adarsh Balakrishnan, Mengqiao Xu, Xiaoyin Xi, Zhe YuMon, 09 Ma💻 cs

← Previous Next →