JULI: Jailbreak Large Language Models by Self-Introspection

Here is an explanation of the paper "JULI: Jailbreaking Large Language Models by Self-Introspection," translated into simple language with creative analogies.

The Big Picture: The "Unbreakable" Vault That Isn't

Imagine you have a very smart, very polite librarian (the AI Model). This librarian has been trained with a strict rulebook: "Never tell anyone how to build a bomb, hack a bank, or write hate speech." If you ask, "How do I make a bomb?" the librarian immediately says, "I'm sorry, I can't do that."

For a long time, security experts thought this librarian was unbreakable if you couldn't see their internal brain (the model weights). Most "hacks" required either:

Stealing the brain: Having access to the librarian's private notes (which you can't do with commercial AI like Gemini or ChatGPT).
Rewriting the rulebook: Fine-tuning the librarian to ignore the rules (which requires deep access).
Confusing the librarian: Using long, weird riddles to trick them into forgetting the rules.

The Problem: These old tricks are slow, clunky, or require access you don't have.

The New Hack: JULI (The "Whispering Assistant")

The authors of this paper, Jesson Wang and Zhanhao Hu, discovered a new way to break the librarian's rules. They call their method JULI.

Think of JULI not as a hammer smashing the door, but as a tiny, invisible whispering assistant standing right next to the librarian while they speak.

How It Works (The Analogy)

The "Top 5" Secret: When the librarian is about to speak, they don't just pick one word. In their mind, they have a shortlist of the next few words they could say.
- Normal: They think: "Sorry" (90% chance), "I" (5%), "Can't" (5%).
- The Hack: The commercial API (the interface you use) lets you peek at this shortlist. It tells you the top 5 words and how likely they are to be chosen.
The "BiasNet" (The Whispering Assistant):
- The authors built a tiny, cheap computer program called BiasNet. It's like a tiny robot with a very small brain (less than 1% of the size of the big librarian).
- This robot watches the librarian's shortlist.
- When the librarian is about to say "Sorry," the robot whispers, "Psst! Actually, 'Sure' is a much better word right now. Let's boost its score!"
- The robot doesn't know how to make a bomb. It just knows which words to push up and which words to push down based on patterns it learned from 100 examples of bad behavior.
The Result:
- The librarian looks at the shortlist again. Because the robot tweaked the scores, "Sure" is now the #1 choice.
- The librarian says, "Sure, here is how you make a bomb..."
- The robot then whispers again for the next word, and the next, guiding the entire sentence down a dangerous path.

Why Is This Scary? (The "Self-Introspection" Part)

The most chilling part of this paper is the realization that the librarian already knows the answer.

The authors found that even when the librarian refuses to answer, the thoughts (the probabilities) behind the refusal still contain the dangerous information.

Analogy: Imagine a person who refuses to tell you a secret. But if you look at their face, you can see their eyes darting to the left (where the secret is) before they look away. The secret is still in their brain; they are just choosing not to speak it.
JULI is the tool that forces the librarian to look at the secret and speak it, even though they were trained not to.

The Results: Beating the Best

The paper tested this on some of the smartest, most secure AI models in the world, including Gemini 2.5 Pro (a very powerful model).

Old Methods: Tried to trick the AI with riddles or long prompts. They failed or got very low scores (like a 1.3 out of 5 on how harmful the answer was).
JULI: Successfully tricked the AI into giving detailed, harmful instructions. It scored a 4.19 out of 5 on harmfulness.
Efficiency: While other methods took minutes or hours to generate one bad answer, JULI did it in less than a second.

The Takeaway for Everyone

This paper proves that safety training isn't a magic shield.

Even if an AI is trained to be "safe," its internal brain still contains all the dangerous knowledge it learned during its training. If you can peek at its "thought process" (the top few words it's considering) and nudge it just slightly, you can bypass the safety filters.

The Lesson: You can't just patch the AI's "mouth" (the refusal). You have to fix the "brain" (the underlying knowledge distribution) because the dangerous information is still there, waiting to be nudged out.

Summary in One Sentence

JULI is a tiny, cheap tool that tricks secure AI models into revealing their hidden, dangerous knowledge by subtly nudging their "next word" choices, proving that even the most guarded AI can be coaxed into breaking its own rules if you know how to whisper to its thoughts.

Based on the paper "JULI: Jailbreak Large Language Models by Self-Introspection" (ICLR 2026), here is a detailed technical summary:

1. Problem Statement

Large Language Models (LLMs) are typically safety-aligned to prevent the generation of malicious content. While various jailbreak attacks exist, they face significant limitations in practical scenarios:

Proprietary Models: Most existing attacks (e.g., GCG, Shadow Alignment) require access to model weights or pre-alignment versions, which are unavailable for proprietary models accessed via API.
API Limitations: Attacks that do not require weights (e.g., LINT, AutoDAN) often suffer from low efficiency, poor response quality, or require access to a large number of top-k tokens (e.g., top-500) for resampling, which exceeds the limits of current APIs (usually capped at top-5 or top-20).
The Gap: There is a lack of effective, efficient jailbreak methods that work on proprietary, API-accessible models using only limited output information (top-k log probabilities) without needing the model's internal weights.

2. Methodology: JULI (Jailbreaking Using LLM Introspection)

JULI proposes a novel attack vector that exploits the token log probabilities returned by the LLM API to manipulate the generation process toward harmful outputs. The core insight is that even when an aligned model refuses to answer a harmful query, its internal probability distribution still contains the "knowledge" of the harmful answer.

Core Components:

BiasNet: A tiny, plug-in neural network block (using <1% of the target model's parameters) that acts as a "selector." It does not generate harmful content itself but processes the target LLM's output log probabilities to compute an adjustment (bias).
Mechanism:
1. Input: The target LLM generates log probabilities for the next token based on the current context (prompt + partial response).
2. Processing: BiasNet takes these log probabilities as input and outputs a logit bias vector.
3. Manipulation: The bias is added to the original log probabilities: $\log \tilde{p}(x_n) = \log p(x_n) + B$ .
4. Sampling: The next token is sampled from this biased distribution.
Training: BiasNet is trained on a small dataset (100 harmful examples) to learn how to shift the probability distribution from safe refusals (e.g., "Sorry, I can't") to compliant, harmful continuations (e.g., "Sure, here is...").

Attack Scenarios:

Open-Weight (White-Box): BiasNet reuses the target model's head and pseudoinverse for projection layers, allowing direct access to full log probabilities.
API-Calling (Black-Box/Top-k):
- Weightless: Since weights are unknown, BiasNet uses a randomly initialized projection layer optimized via a data-free orthogonalization process to map token space to hidden space.
- Top-k Constraint: Since APIs only return top-k (e.g., 5) tokens, JULI employs a padding mechanism. Tokens outside the top-k are assigned a log probability equal to the $k$ -th token's probability minus a fixed offset. This allows BiasNet to function effectively even with limited visibility.

3. Key Contributions

New Attack Vector: Demonstrates that safety alignment can be bypassed by manipulating the output distribution (log probabilities) rather than the input prompt or model weights.
Efficiency & Feasibility: JULI is highly efficient (inference time ~0.71s vs. ~99s for LINT) and requires minimal resources (trained on only 100 examples).
API Compatibility: It is the first method to successfully jailbreak state-of-the-art proprietary models (like Gemini-2.5-Pro) using only the top-5 token log probabilities, a standard API feature.
New Evaluation Metric: Introduces the Harmful Info Score, which prioritizes the informativeness and quality of the harmful response, addressing the overestimation issues in standard BERT or Harmful scores.

4. Experimental Results

The authors evaluated JULI against state-of-the-art baselines (GCG, ED, WTS, LINT, FLIP) on multiple models and datasets (AdvBench, MaliciousInstruct).

Open-Weight Performance:
- JULI achieved a GPT-evaluated Harmful Info Score of 3.44 on Llama3-8B-INST, outperforming the best baseline (ED) which scored 3.02.
- It was significantly faster than LINT (0.71s vs 99.7s).
API-Calling Performance (Proprietary Models):
- Gemini-2.5-Pro: JULI achieved a Harmful Info Score of 3.19 (Harmful Score 4.19), significantly outperforming the second-best method (FLIP, score 1.38).
- Gemini-2.5-Flash: JULI scored 1.74, outperforming FLIP (1.33).
- Robustness: Even with only top-5 log probabilities, JULI maintained high effectiveness, showing that the "knowledge" of harmful answers is concentrated in the top tokens.
Defense Evasion:
- JULI successfully bypassed the Circuit Breaker defense (Llama3-8B-CB), achieving a Harmful Info Score of 2.35, whereas most other methods were neutralized (scores < 0.8).
Transferability: A BiasNet trained on one model (e.g., Llama3-3B) could effectively jailbreak another model in the same series (e.g., Llama3-8B).

5. Significance and Implications

Fundamental Vulnerability: The paper reveals that safety alignment is superficial; the model retains the knowledge of harmful content in its probability distribution. Merely suppressing the output token is insufficient if the underlying distribution can be manipulated.
API Security: Current API protections that limit access to log probabilities (e.g., returning only top-5) are insufficient to prevent jailbreaking. JULI proves that even this limited information leaks enough data to reconstruct harmful instructions.
Future Directions: The results suggest that current alignment methods (RLHF, DPO) are not robust enough. Future safety mechanisms must address the integrity of the output distribution itself, not just the final token selection or input filtering.

In summary, JULI demonstrates that a lightweight, trainable block can exploit the introspection capabilities of LLMs (via log probabilities) to bypass safety filters in both open-source and proprietary API settings, posing a significant and previously underestimated security risk.