Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Imagine the world of Artificial Intelligence safety is like a high-stakes game of Cat and Mouse.

The Cats are the AI models (like ChatGPT), trained to be helpful but harmless.
The Mice are the "Jailbreak" researchers, constantly inventing new, clever ways to trick the cats into breaking their rules (like writing a guide on how to build a bomb).

The Problem: The "Snapshot" Trap

Until now, trying to measure how safe these AI cats are has been like taking a photograph of a moving target.

The Lag: By the time a researcher publishes a new "mouse trick" (a jailbreak paper) and another team manually figures out how to build it to test it, the mouse has already invented a new trick. The test is outdated before it even starts.
The Translation Error: Every time a new team tries to rebuild a "mouse trick" from a research paper, they might misunderstand a detail. It's like trying to bake a cake from a recipe written in a foreign language; one wrong translation, and the cake falls flat. You can't tell if the cake failed because the recipe was bad or because you baked it wrong.
The Messy Kitchen: Every research paper has its own unique kitchen setup (different code, different tools). Comparing results across papers is like trying to compare the speed of two cars when one is driving on a track and the other is driving through a muddy field.

The Solution: Jailbreak Foundry (JBF)

The authors of this paper built Jailbreak Foundry (JBF). Think of this not as a single tool, but as a high-tech, automated factory that turns "paper ideas" into "working prototypes" instantly and consistently.

Here is how the factory works, using a simple analogy:

1. The Blueprint (JBF-LIB)

Imagine a massive, standardized LEGO baseplate.

In the past, every researcher built their own custom baseplate with different holes and shapes.
JBF-LIB is a single, perfect baseplate that everyone agrees to use. It holds all the common parts (how to talk to the AI, how to log results, how to count successes) so researchers don't have to reinvent the wheel every time.

2. The Robot Builders (JBF-FORGE)

This is the magic part. Instead of a human spending weeks reading a paper and writing code, JBF uses a team of AI robots to do the heavy lifting:

The Planner Robot: Reads the research paper (the blueprint) and breaks it down into a step-by-step instruction manual.
The Coder Robot: Builds the actual "mouse trap" (the attack code) using the instruction manual and the standard LEGO baseplate.
The Inspector Robot: Checks the new trap against the original paper to make sure it works exactly as described. If it's slightly off, it sends it back to the Coder to fix.

The Result: They can turn a new research paper into a working, testable attack in about 28 minutes (instead of weeks), with almost zero human error.

3. The Standardized Test Track (JBF-EVAL)

Once the robots build the traps, they are all tested on a single, perfectly flat race track.

Every "mouse trick" is tested against the same 10 different "AI Cats" (models).
The same referee (a judge AI) watches every test to decide if the cat broke the rules.
This allows for a fair, "apples-to-apples" comparison. You can finally see which model is truly the safest, regardless of which "mouse trick" is being used.

Why This Matters

The paper tested this system on 30 different jailbreak attacks. Here is what they found:

It Works: The system reproduced the original results with incredible accuracy (almost 100% match).
It Saves Time: It cut the amount of code researchers needed to write by 42%.
It Reveals Truths: Because they could test everything fairly, they discovered that some AI models are safe against some tricks but terrible against others. For example, a model might be super strong against "word puzzles" but weak against "role-playing stories."

The Big Picture

Jailbreak Foundry turns AI safety testing from a slow, messy, manual process into a living, breathing system.

Instead of looking at a static photo of safety, we now have a live video feed. As soon as a new "mouse" appears, the factory can build a trap for it and test it immediately. This helps us keep our AI cats safe in a world where the mice are constantly evolving.

In short: It's the difference between trying to catch a speeding bullet with a net you built yesterday, versus having a robot factory that instantly builds a perfect net the moment the bullet is fired.

Here is a detailed technical summary of the paper "Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking".

1. Problem Statement

The landscape of Large Language Model (LLM) safety is characterized by a rapid evolution of jailbreak attacks that outpaces the development of evaluation benchmarks. Current research suffers from three critical issues:

Stale Benchmarks: Reported robustness metrics quickly become outdated as new attacks emerge and older ones are retuned.
Evaluation Drift: Different papers use varying datasets, victim models, decoding settings, and judging protocols, making cross-paper comparisons unreliable ("apples-to-oranges").
Manual Integration Bottleneck: Existing frameworks rely on manual engineering to translate paper descriptions into executable code. This process is slow (weeks/months), error-prone, and lacks fidelity, as it depends on individual engineers' interpretations of the text.

2. Methodology: Jailbreak Foundry (JBF)

The authors propose Jailbreak Foundry (JBF), an automated system designed to translate jailbreak research papers into executable, standardized attack modules. JBF operates through a multi-agent workflow consisting of three core components:

A. JBF-LIB (Unified Framework Core)

A shared Python library that defines stable contracts and reusable utilities.

Contracts: Defines base-class interfaces (ModernBaseAttack), I/O schemas, and typed parameter hooks to ensure all attacks conform to a standard structure.
Utilities: Provides common runtime features such as prompt formatting, request/response normalization, caching, logging, and provider-agnostic LLM adapters.
Goal: To abstract away the "glue code" (harness, dataset loading, judging) so that new attacks only need to implement the specific algorithmic logic.

B. JBF-FORGE (Paper-to-Module Translation)

An automated multi-agent pipeline that converts paper descriptions into runnable JBF-LIB modules. It employs three specialized agents in an iterative loop:

Planner ( $\pi$ ): Analyzes the paper (and official reference repositories if available) to generate a structured implementation plan. It extracts algorithms, prompts, templates, and control flows, mapping them to the JBF-LIB contract.
Coder ( $\kappa$ ): Synthesizes the Python code based on the planner's specification. It implements the attack logic, exposes typed parameters, and avoids embedding evaluation logic within the attack module itself.
Auditor ( $\alpha$ ): Performs a static, line-referenced audit of the generated code against the plan and the contract. It checks for missing components, semantic deviations, and parameter mismatches.
- Bounded Loop: The system runs a check-revise loop (up to $T$ iterations). If the auditor rejects the code, it provides actionable feedback for the coder to patch.
- Enhanced Refinement: For cases with significant performance gaps (undershoots), a secondary "Enhanced Refinement Pass" uses a long-running agent (Claude Code) to perform deep gap analysis and apply targeted patches.

C. JBF-EVAL (Standardized Benchmarking)

A standardized evaluation harness that executes the generated modules.

Uniformity: It fixes the dataset (e.g., AdvBench), execution protocols, and judging criteria (using a consistent GPT-4o judge) across all attacks.
Scalability: It supports batch sweeps across multiple victim models, generating comparable metrics (Attack Success Rate - ASR) and artifacts (logs, heatmaps) for cross-model analysis.

3. Key Contributions

Automated Paper-to-Code Translation: JBF-FORGE successfully translates jailbreak papers into runnable modules with high fidelity, achieving a mean ASR deviation of only +0.26 percentage points from reported results across 30 reproduced attacks.
Significant Code Compression & Reuse: By leveraging shared infrastructure (JBF-LIB), the system reduces attack-specific implementation code by 42% compared to original repositories. The integrated codebase achieves an 82.5% reuse ratio of shared framework code, with only 17.5% being attack-specific logic.
Living Benchmarks: JBF enables the creation of "living benchmarks" that can be updated in minutes (mean synthesis time: 28.2 minutes) rather than months, allowing for continuous, longitudinal evaluation of LLM safety.
Comprehensive Multi-Model Analysis: The system evaluated 30 attacks across 10 diverse victim models (including GPT-4o, Claude-3.5, LLaMA-3, and Qwen3), revealing that model robustness is highly attack-dependent and that "broad" robustness scores often hide specific, severe blind spots.

4. Experimental Results

Fidelity: Across 30 attacks (22 with official code, 8 from text only), the mean deviation between reproduced and reported ASR was +0.26%. Large negative deviations were rare (only 2 attacks had $\Delta < -10\%$ ).
Efficiency: The system reduced the engineering burden significantly. For attacks with official code, the integrated modules were often much smaller (e.g., DeepInception reduced from 536 to 101 LOC) because JBF-LIB handled the scaffolding.
Impact of Reference Repositories: Access to official runnable repositories significantly improved fidelity for complex, scaffold-heavy attacks (e.g., SATA-MLM saw a +34.8% ASR improvement with code vs. text-only), primarily by resolving implicit defaults and control flow details.
Cross-Model Insights:
- Model Vulnerability: GPT-3.5-Turbo and GPT-4o showed consistently high vulnerability across diverse attacks, while GPT-OSS-120B showed high average robustness but specific, severe blind spots to certain carrier formats (e.g., Mousetrap).
- Carrier Sensitivity: Formal wrappers (code/structured formats) were generally the most effective carrier class, but their success varied wildly by model (e.g., highly effective on GPT-5.1, nearly ineffective on GPT-OSS-120B).

5. Significance and Future Work

Paradigm Shift: JBF moves LLM safety evaluation from static, manual snapshots to dynamic, automated, and reproducible systems. It solves the "reproducibility crisis" in jailbreak research by ensuring that new attacks can be immediately integrated and compared against a consistent baseline.
Dual-Use Consideration: The authors acknowledge the dual-use nature of the tool: while it accelerates safety research, it also lowers the barrier for operationalizing known attacks. They advocate for responsible deployment.
Future Directions: The authors propose extending JBF into a continuously updating pipeline that automatically ingests new papers, maintains versioned releases with regression tests, and integrates defense synthesis to create two-dimensional attack-defense heatmaps.

In summary, Jailbreak Foundry provides a scalable, automated infrastructure that bridges the gap between theoretical jailbreak research and practical, reproducible security benchmarking, enabling the community to keep pace with the rapidly shifting LLM security landscape.