Online LLM watermark detection via e-processes

Imagine you are a librarian in a massive, chaotic library where millions of books are being written every second. Some are written by humans, but a new, powerful robot (the AI) has started writing books that look and sound exactly like human work. The problem? You can't tell them apart.

To solve this, the robot's creators decided to embed a secret watermark in every sentence the robot writes. It's like a hidden, invisible ink that only the robot knows how to use.

However, there's a catch: The robot writes these books live, one word at a time, in a continuous stream. Traditional methods of checking for the watermark are like waiting until the entire book is finished, then running a slow, complex lab test. If the robot is writing a novel, you might have to wait 10 hours to know if it's fake. By then, the damage (like spreading fake news) is already done.

This paper introduces a new, super-fast way to catch the robot while it's still writing.

The Core Idea: The "Magic Scorecard"

The authors, Su, Wang, and Zhao, propose a new statistical tool called an e-process. Think of this as a Magic Scorecard that updates itself with every single word the robot writes.

Here is how it works, using a simple analogy:

1. The Old Way: The "Fixed Exam"

Imagine a teacher giving a student a 100-question test. The teacher waits until the student finishes all 100 questions, then grades the paper.

The Problem: If the teacher checks the paper after question 10, then again after 20, then 30, they might accidentally find a "fake" pattern just by luck. This is called inflating the error rate.
The Risk: In the real world, if you keep checking the stream of text, you might falsely accuse a human writer of being a robot just because you looked too many times.

2. The New Way: The "Live Scorecard" (E-Process)

The authors' method is like a live scoreboard in a sports game.

Every time the robot writes a word, the scoreboard updates.
The scoreboard starts at 1.
If the word looks "normal" (like a human wrote it), the score stays low or goes down.
If the word has the "secret watermark" (the robot's signature), the score multiplies and goes up.
The Magic Rule: The authors proved mathematically that if the text is truly human-written, this score will never explode to a huge number, no matter how long you watch or how many times you check. It's like a rigged game where the house always wins if the player is honest.
The Trigger: As soon as the score hits a specific high number (say, 100), you can immediately shout, "Stop! This is a robot!" You don't have to wait for the whole book.

Why is this better?

1. It's "Anytime" Valid (The "Stop-Anytime" Superpower)
Imagine you are watching a magic show. A traditional test says, "Wait until the magician finishes the whole show, then we check the cards." The new method says, "We can check the cards right now, or in 5 minutes, or after the next trick. No matter when we stop, the math guarantees we won't be fooled."
This is crucial for autonomous agents (AI robots that do tasks for us). If an AI agent is about to send a malicious email, you need to catch it before it hits "send," not after.

2. It Handles the "Boring" Moments
Sometimes, an AI writes very simple, repetitive sentences (like "The cat sat on the mat"). In these moments, the "secret ink" is hard to see. Old methods get confused and might fail to detect the robot.
The new method is like a smart detective that adjusts its magnifying glass. If the text is boring, it waits. If the text gets interesting, it zooms in. It adapts to the situation, making it much harder for the robot to hide.

3. It's Unbeatable (The "Admissible" Claim)
The paper proves a fascinating mathematical fact: Among all possible ways to build this "Magic Scorecard," the method they invented is the only one that is both:

Fair: It never falsely accuses a human (controls errors).
Sharp: It catches the robot as fast as possible (maximizes power).
It's like finding the only key that fits a lock perfectly without breaking it.

The "Gumbel-Max" Secret Sauce

The paper focuses on a specific type of watermark called Gumbel-Max.

Analogy: Imagine the robot has a bag of marbles (words). Some marbles are "green" (good words) and some are "red" (bad words).
The Trick: The robot doesn't just pick a marble randomly. It adds a little bit of "chaos" (random noise) to the bag before picking.
The Watermark: The way the robot mixes the chaos is the secret. The "Magic Scorecard" knows exactly how to look for that specific pattern of chaos. If the pattern is there, the score goes up.

Real-World Results

The authors tested this on real AI models (like OPT-1.3B).

Old Methods: When they checked the text word-by-word, they started crying "Wolf!" (false alarms) way too often.
New Method: It stayed calm. It only raised the alarm when it was actually a robot, and it did so faster than the old methods, even when the robot was trying to write very simple, repetitive text.

Summary

This paper gives us a real-time radar for AI text. Instead of waiting for the whole story to finish to see if it's fake, we can check the story as it's being written. The math guarantees that we won't get tricked by false alarms, and the system is smart enough to catch the robot even when it's trying to be subtle.

It's the difference between waiting for a criminal to finish a heist and catching them while they are still picking the lock.

Here is a detailed technical summary of the paper "Online LLM watermark detection via e-processes" by Weijie Su, Ruodu Wang, and Zinan Zhao.

1. Problem Statement

The rapid advancement of Large Language Models (LLMs) has created a critical need to distinguish AI-generated text from human-written content to prevent misuse (e.g., misinformation, plagiarism). While watermarking schemes embed statistical dependencies between generated tokens and a pseudo-random sequence (the watermark key) to enable detection, existing detection methods face three major limitations:

Lack of Anytime Validity: Most methods rely on fixed-sample hypothesis testing using $p$ -values. In real-world streaming scenarios where text is generated sequentially, repeatedly checking for watermarks inflates the Type I error rate (false positives) due to the "optional stopping" problem.
Power Degradation: Traditional methods often lose statistical power when the Next-Token Prediction (NTP) distribution becomes highly concentrated (degenerate), a common occurrence in long text generation or low-temperature settings.
Theoretical Gaps: There is a lack of rigorous theoretical characterization regarding the power properties of watermark detection procedures, particularly against adaptive adversaries or in sequential settings.

The paper aims to develop a unified framework for online watermark detection that provides rigorous Type I error control under arbitrary stopping times while maintaining high detection power.

2. Methodology

The authors propose a framework based on e-values and e-processes, which are tools designed for online sequential testing.

A. Mathematical Formulation

Hypothesis Testing: The problem is framed as testing the independence between the observed token $W_t$ $W_{t}$ and the pseudo-random variable $\zeta_t$ $ζ_{t}$ (the watermark key).
- $H_0$ : $W_t$ and $\zeta_t$ are independent (Human text).
- $H_1$ : $W_t$ is generated via a watermarking scheme $S(P, \zeta_t)$ (AI text).
Pivotal Statistics: The authors focus on the Gumbel-max watermark (Aaronson, 2023). They define a pivotal statistic $Y_t = U_{t, W_t}$ (where $U$ is a uniform random vector). Under $H_0$ , $Y_t$ is uniformly distributed on $[0, 1]$ . Under $H_1$ , $Y_t$ follows a "super-uniform" distribution (stochastically larger than uniform).

B. The E-Process Framework

Instead of accumulating $p$ -values, the method constructs an e-process $M_t$ , a non-negative stochastic process that is a martingale under $H_0$ with $E[M_t] \le 1$ .

Construction: $M_t = \prod_{s=1}^t E_s$ , where $E_s$ are sequential e-variables derived from calibrators $f_s$ .
Calibrators: A calibrator $f: [0, 1] \to \mathbb{R}^+$ is a decreasing function such that $\int_0^1 f(p) dp = 1$ . The e-variable is computed as $E_t = f_t(1 - Y_t)$ .
Stopping Rule: The null hypothesis is rejected as soon as $M_t \ge 1/\alpha$ . By Ville's Inequality, this guarantees that the Type I error is controlled at level $\alpha$ regardless of when the test is stopped (anytime validity).

C. Proposed E-Process Variants

To maximize detection power (the growth rate of $M_t$ under $H_1$ ), the authors propose three specific construction strategies:

Weight-Adaptive E-Process: Uses a fixed calibrator (e.g., $g(p) = -\log p$ ) but adaptively tunes a mixing weight $\lambda_t$ based on past data to maximize the log-growth rate.
Online Grenander (OG) E-Process: Uses the Online Grenander estimator to non-parametrically estimate the optimal decreasing density (calibrator) from the observed data stream. This adapts to the specific shape of the alternative distribution without assuming a parametric form.
Average E-Process: A simple arithmetic average of the Weight-Adaptive and OG e-processes. Theoretically, the growth rate of the average is the maximum of the individual growth rates, often yielding superior empirical performance.

3. Key Contributions

Unified Framework for Anytime Validity: The paper establishes the first unified framework for LLM watermark detection that provides rigorous Type I error control under arbitrary stopping times. This is crucial for streaming applications and autonomous agents.
Admissibility and Uniqueness: The authors prove (Theorem 2) that under mild assumptions, the proposed class of e-processes constitutes the only class of admissible and unbiased sequential tests for this problem. This provides a theoretical justification that no other sequential testing strategy can outperform these methods without violating validity or unbiasedness.
Theoretical Power Guarantees:
- They establish asymptotic power-one results (Theorem 3 & 4), proving that the probability of detecting the watermark approaches 1 as the text length increases.
- They characterize the growth rates of the e-processes, showing they achieve exponential growth under the alternative hypothesis.
Robustness to Degeneracy: The framework is shown to be robust against "degenerate" NTP distributions (where the model becomes deterministic), a scenario where traditional sum-based methods often fail.

4. Experimental Results

The authors evaluated their methods on both simulated data and open-source LLMs (OPT-1.3B) using the Gumbel-max watermark.

Type I Error Control:
- E-process methods: Maintained strict Type I error control in both fixed-sample and sequential (online) settings.
- Sum-based methods (e.g., $h_{ars}$ , $h_{log}$ ): While powerful in fixed-sample settings, their Type I error rates exploded in sequential testing due to multiple testing issues, rendering them unsuitable for online detection.
Detection Power (Type II Error):
- In many settings, the proposed Average E-Process achieved power comparable to, and in some cases superior to, the best sum-based methods.
- Low-Temperature Robustness: Under low-temperature settings (where NTPs become degenerate), sum-based methods showed an inflation in Type II error rates. In contrast, the e-process methods maintained a consistent, monotonic reduction in Type II errors.
Performance Comparison: The Average E-Process (combining adaptive weights and OG estimation) was recommended as the most robust practical choice, balancing the strengths of both adaptive and non-parametric approaches.

5. Significance

This work represents a paradigm shift in LLM watermark detection:

From Batch to Streaming: It moves the field from static, fixed-length analysis to dynamic, real-time monitoring, which is essential for the deployment of autonomous AI agents.
Statistical Rigor: It resolves the "optional stopping" dilemma, ensuring that continuous monitoring does not compromise statistical validity.
Theoretical Foundation: By linking watermark detection to the theory of e-processes and establishing admissibility, it provides a solid mathematical foundation for future research in AI safety and provenance.
Practical Applicability: The proposed methods are computationally efficient and do not require knowledge of the underlying NTP distribution, making them applicable to black-box LLMs.

In summary, the paper provides a theoretically grounded, statistically robust, and practically effective solution for the urgent problem of detecting AI-generated text in real-time streaming environments.

Online LLM watermark detection via e-processes

The Core Idea: The "Magic Scorecard"

1. The Old Way: The "Fixed Exam"

2. The New Way: The "Live Scorecard" (E-Process)

Why is this better?

The "Gumbel-Max" Secret Sauce

Real-World Results

Summary

1. Problem Statement

2. Methodology

A. Mathematical Formulation

B. The E-Process Framework

C. Proposed E-Process Variants

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model