cs.SE papers | Gist.Science

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

The paper introduces DIVE, an evidence-driven framework that prioritizes executing diverse real-world tools before reverse-deriving tasks to ensure grounding and structural variety, which significantly enhances the out-of-distribution generalization of tool-using LLMs compared to traditional quantity-focused scaling.

Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua XiaoFri, 13 Ma🤖 cs.AI

Quantum Computing for All: Online Courses Built Around Interactive Visual Quantum Circuit Simulator

This paper presents an online course that utilizes an interactive quantum circuit simulator with immediate feedback and automated evaluation to lower the entry barrier and make quantum computing accessible to students from diverse backgrounds without prior physics knowledge.

Juha Reinikainen, Vlad Stirbu, Teiko Heinosaari + 2 more2026-03-11⚛️ quant-ph

Dance of the ADS: Orchestrating Failures through Historically-Informed Scenario Fuzzing

This paper introduces ScenarioFuzz, a novel scenario-based fuzzing methodology that leverages historical test data, map networks, and graph neural networks to autonomously generate and optimize high-risk scenarios, significantly reducing testing time while uncovering numerous safety-critical bugs in autonomous driving systems.

Tong Wang, Taotao Gu, Huan Deng + 3 more2026-03-11🤖 cs.AI

Exploration of Evolving Quantum Key Distribution Network Architecture Using Model-Based Systems Engineering

This paper proposes a variability-driven systems engineering framework using Orthogonal Variability Modelling and Systems Modelling Language to systematically model, trace, and evolve Quantum Key Distribution network architectures, thereby addressing the challenges of integrating complex quantum devices into existing classical infrastructure to meet future security needs.

Hayato Ishida, Amal Elsokary, Maria Aslam + 3 more2026-03-10⚛️ quant-ph

LAMBDA: A Large Model Based Data Agent

LAMBDA is a novel, open-source, code-free multi-agent system that leverages large language models with collaborative programmer and inspector roles, along with a knowledge integration mechanism and user intervention capabilities, to enhance the accessibility and efficiency of data analysis for diverse users.

Maojun Sun, Ruijian Han, Binyan Jiang + 4 more2026-03-10🤖 cs.AI

A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

This paper proposes a generic evaluation framework to assess large language models' ability to generate code for constraint-based domain-specific languages like OCL and Alloy, revealing that while performance is generally lower than for general-purpose languages like Python, strategies such as code repair and multiple generation attempts can significantly improve output quality.

David Delgado, Lola Burgueño, Robert Clarisó2026-03-06💻 cs

A Benchmarking Framework for Model Datasets

This paper proposes a Benchmark Platform for Model-Driven Engineering that provides a unified framework to systematically assess and compare the quality, representativeness, and suitability of software model datasets, thereby addressing issues of reproducibility, bias, and result comparability in current research.

Philipp-Lorenz Glaser, Lola Burgueño, Dominik Bork2026-03-06💻 cs

Why Do You Contribute to Stack Overflow? Understanding Cross-Cultural Motivations and Usage Patterns before the Age of LLMs

This study investigates cross-cultural differences in Stack Overflow contributor motivations across the US, China, and Russia by combining qualitative profile analysis with quantitative linguistic data, revealing distinct regional patterns such as stronger self-promotion among Americans and learning-oriented engagement among Chinese users to inform strategies for sustaining the knowledge-sharing ecosystem in the age of LLMs.

Sherlock A. Licorish, Elijah Zolduoarrati, Tony Savarimuthu + 3 more2026-03-06💻 cs

Auto-Generating Personas from User Reviews in VR App Stores

This study presents an auto-generated persona system derived from VR app store reviews that effectively facilitates accessibility requirements elicitation and enhances student empathy in VR design courses.

Yi Wang, Kexin Cheng, Xiao Liu + 4 more2026-03-06💻 cs

Public Sector Open Source Program Offices - Archetypes for how to Grow (Common) Institutional Capabilities

This study identifies six distinct archetypes of Open Source Programme Offices (OSPOs) within European public sector organizations through a qualitative analysis of 16 cases, providing strategic guidance and policy recommendations for designing institutional capabilities that foster OSS adoption, digital sovereignty, and improved service interoperability.

Johan Linåker, Astor Nummelin Carlberg, Ciaran O'Riordan2026-03-06💻 cs

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

This paper introduces FireBench, a new open-source benchmark comprising over 2,400 real-world enterprise and API-driven samples across six capability dimensions, designed to evaluate and improve instruction following in LLMs beyond traditional chat-based constraints.

Yunfan Zhang, Yijie Bei, Jetashree Ravi + 1 more2026-03-06💬 cs.CL

RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

The paper introduces RepoLaunch, an agent that automates the build and test pipeline for code repositories across any language and platform, thereby enabling fully automated creation of software engineering datasets for benchmarking and training coding agents.

Kenan Li, Rongzhi Li, Linghao Zhang + 17 more2026-03-06🤖 cs.LG

MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

MOOSEnger is a domain-specific AI agent that combines retrieval-augmented generation with deterministic, MOOSE-aware parsing and execution tools to automatically convert natural language into validated simulation inputs, achieving a 93% execution success rate on a diverse benchmark compared to just 8% for an LLM-only baseline.

Mengnan Li, Jason Miller, Zachary Prince + 2 more2026-03-06💻 cs

Behaviour Driven Development Scenario Generation with Large Language Models

This paper evaluates GPT-4, Claude 3, and Gemini on a proprietary dataset of 500 user stories to generate Behaviour-Driven Development scenarios, finding that while GPT-4 excels in text similarity, Claude 3 produces the highest quality results according to human and LLM-based experts, with optimal performance dependent on model-specific prompting strategies, high-quality input descriptions, and specific generation parameters.

Amila Rathnayake, Mojtaba Shahin, Golnoush Abaei2026-03-06💻 cs

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

This paper introduces Vibe Code Bench, a novel benchmark featuring 100 web application specifications evaluated by autonomous browser agents, which reveals that even the best frontier models achieve only 58.0% accuracy on end-to-end development tasks and highlights self-testing and evaluator alignment as critical factors for success.

Hung Tran, Langston Nashold, Rayan Krishnan + 2 more2026-03-06💻 cs

Industrial Survey on Robustness Testing In Cyber Physical Systems

This paper presents findings from an industrial survey conducted in Wallonia that assesses current practices, challenges, and gaps in Cyber-Physical Systems robustness testing across various sectors, comparing industry realities with state-of-the-art methodologies.

Christophe Ponsard, Abiola Paterne Chokki, Jean-François Daune2026-03-06💻 cs

CLARC: C/C++ Benchmark for Robust Code Search

The paper introduces CLARC, a robust C/C++ code search benchmark featuring over 6,700 query-code pairs and challenging evaluation settings like identifier anonymization and compilation to low-level languages, which reveals that current state-of-the-art models rely heavily on lexical features rather than true semantic understanding.

Kaicheng Wang, Liyan Huang, Weike Fang + 1 more2026-03-06💻 cs

iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation

This paper introduces iScript, a domain-adapted Qwen3-8B model and its corresponding benchmark for generating reliable Innovus Tcl scripts, which leverages a novel multi-stage data synthesis pipeline and a two-step verification framework to overcome data scarcity and outperform state-of-the-art LLMs in physical design automation.

Ning Xu, Zhaoyang Zhang, Senlin Shu + 10 more2026-03-06💻 cs

MPBMC: Multi-Property Bounded Model Checking with GNN-guided Clustering

This paper proposes MPBMC, a hybrid approach that leverages Graph Neural Network embeddings and runtime design statistics to functionally cluster properties, thereby significantly accelerating multi-property Bounded Model Checking verification on HWMCC benchmarks compared to state-of-the-art methods.

Soumik Guha Roy, Sumana Ghosh, Ansuman Banerjee + 2 more2026-03-06💻 cs

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

LoRA-MME is a parameter-efficient multi-model ensemble that combines LoRA-tuned UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa encoders to achieve strong code comment classification performance, though its high computational cost ultimately limited its final competition score.

Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid + 2 more2026-03-06💻 cs

← Previous Next →