Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

Imagine you are a judge at a massive cooking competition. Every year, thousands of chefs submit their recipes and claim, "My dish is delicious, and anyone can make it!"

In the world of cybersecurity research, these "chefs" are scientists, the "recipes" are their computer code and data, and the "judges" are volunteer reviewers. The goal is Artifact Evaluation (AE): checking if the science is real, if the code actually works, and if the results can be repeated by others.

The Problem:
Right now, this process is a nightmare.

Too many submissions: Thousands of papers arrive, but there are only a few tired human judges.
The "Kitchen" is messy: Every recipe requires different ingredients (software versions, special hardware, obscure libraries). Setting up a kitchen to test one recipe can take a human judge hours or days.
The result: Many fake or broken recipes slip through, or good ones get rejected just because the judge was too exhausted to set up the kitchen.

The Solution: The AI Sous-Chef
This paper introduces a new toolkit powered by Large Language Models (LLMs)—think of them as super-smart, tireless AI assistants. Instead of replacing the human judges, this AI acts as a "Sous-Chef" that does the heavy lifting before the judge ever sees the dish.

The toolkit has three main jobs, which the authors call RATE, PREPARE, and ASSESS.

1. RATE: The "Smell Test"

What it does: Before trying to cook anything, the AI reads the recipe (the paper) and the instruction manual (the "Readme" file). It asks: "Does this look like a recipe that can actually be followed?"
How it works: The AI looks for "concept vectors"—essentially, it learns what a "good, reproducible recipe" sounds like versus a "vague, impossible one."
The Analogy: It's like a food critic sniffing a dish through the window. If the smell is off, they tell the judge, "Don't bother going in there; this recipe is a mess."
Success: It correctly identifies about 95% of the recipes that can be cooked, filtering out the junk early so judges don't waste time.

2. PREPARE: The "Auto-Kitchen"

What it does: For the recipes that passed the smell test, the AI tries to actually build the kitchen and cook the dish. It downloads the code, installs the software, and runs the program inside a safe, isolated box (a "sandbox").
How it works: The AI acts like a robot butler. It reads the instructions, types commands into the computer, and if it hits an error (like "missing ingredient"), it tries to fix it automatically.
The Analogy: Imagine a robot that not only reads your IKEA furniture instructions but also builds the entire shelf for you. If a screw is missing, the robot tries to find a workaround.
Success: It successfully set up and ran about 28% of the recipes that were supposed to work. For the rest, it left a detailed note saying, "I tried, but I got stuck here. Here's the error log for the human judge." This saves the human judge hours of trial-and-error.

3. ASSESS: The "Logic Check"

What it does: Even if the code runs, the science might still be flawed. Maybe the chef only tested the dish on a tiny sample of people, or they cheated by looking at the answers beforehand. This stage checks for these "methodological pitfalls."
How it works: The AI scans the paper for common tricks used in bad science, like "Sampling Bias" (testing on only one type of person) or "Base Rate Fallacy" (misinterpreting how common a threat is).
The Analogy: This is like a food safety inspector checking if the chef used expired ingredients or if the "delicious" taste was just because they added too much sugar to hide the bad meat.
Success: It detected these hidden flaws with over 92% accuracy.

The Big Picture

By combining these three steps, the toolkit creates a pipeline that:

Filters out the trash (RATE).
Prepares the working dishes for the judges (PREPARE).
Warns the judges about hidden tricks (ASSESS).

Why does this matter?
Currently, the process of verifying science is slow and relies on overworked volunteers. This AI toolkit acts as a force multiplier. It doesn't replace the human expert's judgment, but it clears the path so humans can focus on the quality of the science rather than the tedious setup of the code.

The Bottom Line:
Just as a smart kitchen appliance helps a chef cook faster and more consistently, this AI toolkit helps the scientific community verify cybersecurity research faster, more reliably, and with less burnout. It ensures that when we say a security tool works, it actually works in the real world, not just in a broken simulation.

Here is a detailed technical summary of the paper "Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers."

1. Problem Statement

Artifact Evaluation (AE) is a critical process in cybersecurity research (particularly for IoT and Cyber-Physical Systems) to ensure transparency, reliability, and the bridge between academic prototypes and real-world deployment. However, the current AE process faces significant scalability challenges:

Manual Bottleneck: AE relies on volunteer experts to manually verify code, data, and instructions. This is time-consuming and cannot keep pace with the rising volume of submissions at top-tier security conferences.
Low Reproducibility Rates: Studies indicate that a large percentage of submitted artifacts lack clear instructions (Readmes), and even fewer can be successfully executed or reproduce claimed results.
Methodological Flaws: Standard AE focuses on "can it run?" but often misses deeper methodological pitfalls (e.g., sampling bias, base rate fallacy) that undermine scientific validity.
Complexity: Modern security research involves heterogeneous software/hardware stacks, making manual setup difficult.

The authors posit that Large Language Models (LLMs) can automate and augment these processes to improve scalability and consistency without replacing human expert judgment.

2. Methodology

The authors propose a modular, three-stage LLM-driven toolkit designed to integrate into the peer-review workflow. The pipeline consists of RATE, PREPARE, and ASSESS.

A. RATE: Content-Based Reproducibility Rating

Goal: To automatically screen submissions and discard those unlikely to be reproducible before expensive computational resources are wasted.
Technique: The authors adapt the concept of concept vectors from LLM hidden states (based on Yang et al.).
- They define two opposing prompts: one describing a paper as "easy to reproduce" ( $p^+$ ) and one as "difficult to reproduce" ( $p^-$ ).
- They extract embedding vectors from the LLM's final layer for a set of probing texts under both prompts.
- Using Principal Component Analysis (PCA) on the difference vectors, they derive a distilled reproducibility concept vector ( $\hat{v}$ ).
- New papers are projected onto this vector to generate a score ( $s$ ). If the score is low, the submission is flagged as non-reproducible.
Input: Paper text and Readme files.

B. PREPARE: Autonomous Execution Environment Setup

Goal: To autonomously set up and execute artifacts in a sandboxed environment.
Technique: An LLM Agent (using OpenAI's GPT-4o-mini) is deployed within a Docker container.
- Workflow: The agent receives the paper, source code, and Readme. It iteratively generates shell commands to clone repositories, install dependencies, compile, and execute code.
- Feedback Loop: The agent captures command outputs, diagnoses errors (e.g., missing dependencies, version mismatches), and generates corrective commands.
- Isolation: Each artifact runs in a fresh container to ensure reproducibility and host isolation.
Output: A runnable container image or a structured error report for human experts.

C. ASSESS: Methodological Pitfall Detection

Goal: To identify common design and evaluation flaws in security research that standard AE misses.
Technique: Similar to the RATE stage, the system extracts concept vectors for specific pitfalls defined in Arp et al.'s taxonomy (e.g., sampling bias, lab-only evaluation).
- Positive and negative prompts are used to train concept vectors for each of the $m$ pitfalls.
- A new paper is scored against these vectors to create a feature vector.
- A supervised classifier determines which pitfalls are likely present.
Output: A report highlighting potential methodological weaknesses to aid reviewers.

3. Key Contributions

Three-Stage Toolkit: The first comprehensive framework integrating text-based screening, autonomous code execution, and methodological assessment for cybersecurity AE.
Concept Vector Adaptation: Successfully applying hidden-state concept extraction to quantify abstract concepts like "reproducibility" and specific "pitfalls" without fine-tuning the model.
Autonomous Agent: A functional LLM agent capable of iterative debugging and environment setup in a sandboxed container.
Open Science: The authors released their code and pipeline on GitHub.

4. Experimental Results

The toolkit was evaluated on two expert-annotated datasets:

Olszewski et al. Dataset: ~750 AI-based security papers (for RATE and PREPARE).
Arp et al. Dataset: 30 papers with annotated methodological pitfalls (for ASSESS).

Performance Metrics:

Overall Pipeline Accuracy: The combined pipeline correctly classifies >72% of papers regarding reproducibility.
RATE Stage: Achieved a Recall of ~95% (specifically 94.64%). This means it successfully identifies almost all papers that are runnable, ensuring very few valid submissions are discarded prematurely.
PREPARE Stage:
- Successfully autonomously set up execution environments for 28% of submissions that were marked as "runnable" in the ground truth.
- Achieved an accuracy of 66.24% and a True Negative rate of >85%, effectively filtering out non-runnable submissions.
- Limitation: Failed on artifacts requiring GUIs or specialized hardware not available in the Docker environment.
ASSESS Stage:
- Detected 7 out of 10 common pitfalls with F1 scores > 0.92 and accuracy >90%.
- Performance was lower on "biased parameters" due to dataset ambiguity in the ground truth.

5. Significance and Impact

Scalability: The toolkit significantly reduces the manual workload for reviewers by automating the tedious "trial-and-error" setup of environments and filtering out non-reproducible submissions early.
Quality Assurance: By detecting methodological pitfalls (ASSESS), the tool helps reviewers provide deeper feedback, potentially improving the scientific rigor of published security research.
Incentivization: Integrating this tool into conference workflows could incentivize authors to submit higher-quality, better-documented artifacts to avoid automated rejection or negative flags.
Sustainability: The approach offers a path toward sustainable AE processes that can handle the growing volume of cybersecurity research without requiring a proportional increase in human volunteer hours.

Limitations & Future Work:

Current limitations include the inability to handle GUI-based artifacts or specialized hardware.
Future work involves fine-tuning LLMs specifically for reproducibility concepts, expanding the dataset for pitfall detection, and addressing security risks (e.g., prompt injection) in the automated pipeline.

Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

1. RATE: The "Smell Test"

2. PREPARE: The "Auto-Kitchen"

3. ASSESS: The "Logic Check"

The Big Picture

1. Problem Statement

2. Methodology

A. RATE: Content-Based Reproducibility Rating

B. PREPARE: Autonomous Execution Environment Setup

C. ASSESS: Methodological Pitfall Detection

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents