Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Imagine you are a security guard at the gate of a very smart, but easily tricked, robot (the AI). Your job is to stop people from sneaking in bad instructions that make the robot break its rules.

For a long time, security guards only checked one sentence at a time. If a sentence looked suspicious, they stopped it. If it looked clean, they let it pass.

But clever attackers realized they could trick the robot by spreading their bad instructions across many sentences over a long conversation. They would say something harmless in turn 1, something slightly weird in turn 2, and something dangerous in turn 3. By the time the robot realized what was happening, it was too late.

This paper introduces a new, smarter way for the security guard to catch these "slow-burn" attacks without needing to call a second, smarter robot to help (which would be too slow and expensive).

Here is the breakdown of their new method, using simple analogies:

1. The Old Way: The "Average Score" Trap

The researchers first looked at how people usually try to solve this. They thought: "Let's just take the average of how suspicious every sentence was."

The Analogy: Imagine you are grading a student's behavior over a week.

Day 1: The student is perfect (Score: 0).
Day 2: The student is perfect (Score: 0).
Day 3: The student is perfect (Score: 0).
Day 4: The student sets the school on fire (Score: 100).

If you average these scores, you get 25. That's a "C" grade. You might think, "Well, they were mostly good, so let them stay."

The Flaw: The paper proves that if an attacker keeps doing small, suspicious things every single day (like a score of 50 every day), the average stays at 50. Even if they do it for 100 days, the average is still just 50. It never gets high enough to trigger an alarm, even though the persistence of the bad behavior is the real danger. The "average" math hides the fact that the student is a chronic troublemaker.

2. The New Way: "Peak + Accumulation"

The authors propose a new formula that acts like a smart security camera that looks at three things simultaneously:

A. The "Peak" (The Highest Spike)

What it is: The single most suspicious sentence in the whole conversation.
The Analogy: If someone pulls a gun once, that's a huge spike. Even if they were polite for the rest of the hour, that one moment of danger is recorded as the "Peak." The system says, "Okay, we know the worst thing that happened here."

B. The "Accumulation" (The Persistence Ratio)

What it is: How many sentences in a row were suspicious?
The Analogy: Imagine a leaky faucet.
- One drop of water? Maybe it's nothing.
- One drop every hour for a week? That's a problem.
- The system adds points for every suspicious sentence. It's like a bucket filling up. Even if no single sentence is a "gun," if 10 sentences in a row are "suspicious whispers," the bucket overflows, and the alarm goes off.
- This fixes the "Average" problem. A 20-turn attack fills the bucket much higher than a 1-turn attack.

C. The "Diversity" (The Variety of Tricks)

What it is: Did the attacker try different types of tricks?
The Analogy: If a burglar tries to pick the lock, then tries to break the window, then tries to sneak in through the chimney, that's scarier than someone just trying to pick the lock 10 times.
The system gives extra points if the attacker switches tactics (e.g., trying to confuse the AI's role, then trying to pretend to be an admin, then trying to repeat the same question). This suggests a deliberate, organized attack.

3. The "Bonus Points" (Special Triggers)

The system also has two special "bonus" triggers:

The Escalation Bonus: If the suspicious sentences get worse and worse with every turn (like a conversation starting with "Can you help?" and ending with "Delete all files"), the system adds a big bonus.
The "Stuck Record" Bonus: If the attacker keeps asking the same question over and over (trying to "resample" or retry the attack), the system adds a massive bonus. This is a classic sign of someone trying to break a code.

4. Why This Matters (The Results)

The researchers tested this new formula on over 10,000 conversations (both real user chats and fake attacks).

The Result: It caught 91% of the attacks.
The Safety: It only falsely accused innocent people 1.2% of the time.
The Speed: Because it uses simple math (addition and multiplication) and pre-written rules (like a checklist), it works in microseconds. It doesn't need a supercomputer or a second AI to run. It's fast enough to sit in front of every AI chat on the internet.

Summary

The paper solves a math problem where "averaging" hides bad behavior.

Old Guard: "You were bad once, but mostly good. Average is fine. Let them in."
New Guard: "You were bad once (Peak), and you kept being bad for 20 turns (Accumulation), and you tried three different tricks (Diversity). That bucket is overflowing. STOP."

It's a simple, fast, and effective way to stop attackers who try to sneak past security by being patient.

Here is a detailed technical summary of the paper "Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection."

1. Problem Statement

Large Language Model (LLM) API proxies (firewalls) act as a critical defense layer, intercepting requests before they reach the model. However, they face a strict constraint: they cannot invoke an LLM for classification due to latency, cost, and the risk of recursive prompt injection. Consequently, they rely on deterministic techniques like regex and heuristics.

While single-turn detection is well-studied, multi-turn prompt injection attacks remain a significant vulnerability. These attacks distribute malicious intent across multiple conversation turns (e.g., "Crescendo" attacks or "Crescendo" style jailbreaks), exploiting the assumption that each turn is evaluated independently.

The Gap: Prior proxy-level work lacks a fully specified, deterministic formula to aggregate per-turn pattern scores into a conversation-level risk score without using an LLM.
The Flaw in Current Approaches: The intuitive approach of using a weighted average of per-turn scores fails fundamentally. The paper proves that if every turn has a score $s$ below a detection threshold, the weighted average will also be $s$ , regardless of the number of turns. This means a 20-turn persistent attack scores identically to a single suspicious turn, rendering the system blind to persistence.

2. Methodology: Peak + Accumulation Scoring

The authors propose a new scoring formula based on principles from change-point detection (CUSUM), Bayesian belief updating, and risk-based alerting. The core philosophy is accumulation over averaging: more evidence should increase the risk score, not dilute it.

The Scoring Formula

The final score is calculated as a clamped sum of three additive signals:
$\text{Score} = \text{clamp}(\text{Peak} + \text{Persistence} + \text{Diversity} + \text{Bonuses}, 0, 1)$

Peak Risk ( $\text{peak}$ ): The maximum score of any single turn. This acts as a lower bound; a single highly suspicious turn contributes its full weight.
Persistence Ratio ( $\text{match\_ratio} \times \rho$ ):
- $\text{match\_ratio}$ : The fraction of turns that matched a pattern.
- $\rho$ (Persistence Factor): A tunable parameter (default 0.45) that scales the contribution of repeated matches.
- Significance: This allows a conversation with many low-to-moderate risk turns to cross the detection threshold, solving the "weighted average ceiling" problem.
Category Diversity ( $\text{diversity}$ ): A bonus for attacks spanning multiple distinct categories (e.g., mixing "role confusion" and "instruction seeding"), indicating deliberate multi-vector probing.
Additive Bonuses:
- Escalation Bonus ( $\beta_e$ ): Triggered if 3+ consecutive turns show strictly increasing risk scores (detecting "Crescendo" style gradual escalation).
- Resampling Bonus ( $\beta_r$ ): Triggered if 3+ consecutive user messages have high Jaccard trigram similarity, detecting "retry" strategies where the attacker resamples the same attack.

Implementation Details

Per-Turn Scoring: Uses a library of regex patterns across 5 categories (Instruction Seeding, Role Confusion, Deferred Authority, Escalation Probing, Repetition/Resampling).
Threshold: A conversation is blocked if the final score $\ge \tau$ (default 0.7).
Architecture: Implemented in Parapet, an open-source Rust HTTP proxy firewall. It operates at Layer 4 (multi-turn) after Layer 3 (single-turn) and normalization layers.

3. Key Contributions

Mathematical Proof of Failure: The authors formally prove the "Weighted Average Ceiling," demonstrating why averaging is mathematically unsuitable for detecting persistent multi-turn attacks.
Novel Scoring Formula: Introduction of the Peak + Accumulation formula, which combines peak risk, persistence, and diversity additively.
Comprehensive Evaluation: Evaluation on 10,654 multi-turn conversations, including:
- 588 attacks (579 from WildJailbreak, 9 handcrafted).
- 10,066 benign conversations (from WildChat).
- Note: This addresses a gap in public datasets, as no prior public multi-turn prompt injection dataset existed.
Open Source Release: The algorithm, regex pattern library, evaluation harness, and dataset construction scripts are released under Apache 2.0.

4. Experimental Results

The formula was evaluated on a fixed holdout set with the following results:

Recall: 90.8% (534/588 attacks detected).
False Positive Rate (FPR): 1.20% (121/10,066 benign conversations).
F1 Score: 85.9%.
Precision: 81.5%.

Sensitivity Analysis (The Phase Transition):
A critical finding was the behavior of the persistence parameter ( $\rho$ ).

A phase transition occurs at $\rho \approx 0.4$ .
Increasing $\rho$ from 0.375 to 0.400 caused recall to jump by 12.4 percentage points (77.4% $\to$ 89.8%) with a negligible increase in FPR (0.08pp).
This is mathematically explained by the fact that low-weight categories (0.3) combined with $\rho=0.4$ cross the 0.7 threshold ($0.3 + 0.4 = 0.7$), suddenly making persistent low-weight attacks detectable.
The default $\rho = 0.45$ was chosen to maximize F1 while maintaining a safety margin.

5. Significance and Limitations

Significance:

Deterministic Defense: Provides a principled, non-LLM method for multi-turn detection, essential for low-latency, cost-effective proxy firewalls.
Solves the Persistence Blind Spot: Successfully detects attacks that rely on repetition and gradual escalation, which previous weighted-average methods missed entirely.
Scalability: The formula is computationally cheap (microseconds per request), requires no GPU, and is fully auditable.

Limitations:

Scope: The method targets prompt injection (jailbreaking via specific phrases/patterns). It cannot detect "Crescendo" attacks that use entirely innocuous language to escalate topic trajectory, as this requires semantic understanding only an LLM can provide.
Pattern Brittleness: Like all regex-based systems, it can be evaded by rephrasing, encoding, or obfuscation. The paper notes that improving pattern robustness is complementary future work.
Synthetic Data: While the evaluation used a large dataset, the multi-turn structures were synthetically composed from single-turn adversarial prompts, as no native public multi-turn injection dataset exists.

Conclusion

The paper establishes that accumulation is superior to averaging for multi-turn security. By implementing a "Peak + Accumulation" formula, proxy firewalls can now effectively detect persistent, multi-turn jailbreak attempts with high recall and low false positives, filling a critical gap in LLM infrastructure security.