Log Probability Tracking of LLM APIs

Imagine you hire a master chef to cook a specific dish for your restaurant every day. You pay them through an API (a digital menu), and you expect the "Gourmet Burger" to taste exactly the same every time you order it. This consistency is crucial: if the recipe changes without you knowing, your regular customers might get sick, or your food critics might give you bad reviews for a dish you didn't actually serve.

However, in the world of Large Language Models (LLMs)—the AI chefs of today—providers often change the recipe. They might tweak the ingredients (fine-tuning), swap the stove for a cheaper one (hardware changes), or even secretly add a new spice (backdoors). The problem is, nobody knows when these changes happen because checking the taste is too expensive and slow.

This paper introduces a clever, cheap, and super-sensitive way to catch these changes. Here is the breakdown:

1. The Problem: The "Taste Test" is Too Expensive

Previously, to check if an AI changed, researchers had to ask it thousands of complex questions (like "Write a poem about a cat") and compare the answers.

The Analogy: Imagine trying to detect if a chef changed the salt in their soup by ordering a full 5-course meal every hour. It costs a fortune in food, takes forever to eat, and you still might miss a tiny pinch of salt.
The Result: Because it's so expensive, most people just assume the AI stays the same, even when it doesn't.

2. The Solution: Listening to the "Whispers" (Log Probabilities)

The authors realized that when an AI generates a word, it doesn't just pick one word; it calculates a "confidence score" (called a log probability) for every word in its dictionary before picking one.

The Analogy: Imagine the chef is about to shout "Salt!" but before they do, you can hear them whispering a list of all the ingredients they were considering: "Salt... Pepper... Sugar... Salt... Salt..."
The Catch: These whispers aren't perfectly consistent. Sometimes the chef is a little tired, or the kitchen is noisy, so the whisper fluctuates slightly. This is called non-determinism.

3. The Trick: The "Single Token" Whisper

The authors developed a method called Log Prob Tracking (LT). Instead of asking the AI for a whole essay, they ask it for just one single word (or even just a single letter like "x").

How it works: They ask the AI for that one word, listen to the "whispers" (the log probabilities) of what it almost said, and record the average. They do this thousands of times.
The Magic: Even though the whispers fluctuate slightly due to noise, the average pattern of whispers is unique to that specific version of the AI. If the chef changes the recipe (even just a tiny bit, like one step of training), the pattern of whispers shifts.
The Result: They can detect a change as small as one single step of fine-tuning (a microscopic recipe tweak) by asking just one question.

4. The "TinyChange" Benchmark

To prove this works, they created a new test called TinyChange.

The Analogy: Imagine they took a perfect cake and made 58 slightly different versions: one with a tiny pinch less sugar, one with a slightly different oven temperature, one with a tiny bit of flour removed.
The Test: They challenged their method against other expensive methods.
The Winner: The "Single Token Whisper" method was 1,000 times cheaper and 100 to 1,000 times more sensitive than the old methods. It could spot the "pinch of sugar" change that the others missed entirely.

5. Real-World Detective Work

The team didn't just test this in a lab; they used it to spy on 189 real AI APIs for four months.

The Discovery: They found 37 hidden changes.
The Shock: Even with "Open Weight" models (where the code is public and people expect stability), providers were quietly changing the models. It's like a restaurant claiming they use a "standard, open recipe" but secretly swapping the brand of flour every Tuesday night.

Why This Matters

For Developers: You can now know if your AI app suddenly started acting weird because the provider changed the model.
For Researchers: You can trust that your experiments are reproducible.
For Security: It helps catch "backdoors" or malicious changes before they cause harm.

In a nutshell: This paper teaches us that we don't need to eat the whole meal to know if the chef changed the recipe. We just need to listen to the chef's nervous whispers before they speak, and we can do it for the price of a single crumb.

1. Problem Statement

Large Language Model (LLM) API providers promise version-pinned endpoints to ensure consistency for downstream applications, research reproducibility, and regulatory compliance. However, in practice, providers frequently modify models (via fine-tuning, quantization, or prompt changes) or infrastructure (hardware, inference kernels) without transparent notification.

The Gap: Existing audit methods for detecting these changes are prohibitively expensive. They typically require processing thousands of tokens across many queries to achieve statistical significance, making continuous, real-time monitoring infeasible for most users.
The Challenge: While LLMs return log probabilities (logprobs) alongside tokens, these values are non-deterministic in production environments due to temperature sampling, batch-level variations, and hardware routing. This non-determinism makes simple equality checks impossible.

2. Methodology: Log Probability Tracking (LT)

The authors propose Log Probability Tracking (LT), a method that exploits the continuous nature of logprobs to detect minute model changes with minimal cost.

Core Intuition

Instead of analyzing the discrete output tokens (which are sparse and noisy), LT analyzes the log probabilities of the first generated token. Even though the specific token sampled may vary, the underlying probability distribution (encoded in logprobs) shifts systematically when the model weights or inference stack change.

The Algorithm

Input: A fixed, short prompt (as short as a single token, e.g., "x").
Sampling: The auditor sends $N$ requests to the target API, requesting the logprobs for the top- $k$ tokens of the response.
Handling Non-Determinism:
- Since logprobs fluctuate between identical requests, the method treats each logprob as a sample from a distribution.
- Imputation: If a token appears in some samples but not others (due to top- $k$ truncation), its missing logprob is conservatively imputed using the minimum logprob value observed in that specific sample.
Statistical Test:
- The method computes the mean logprob for every observed token across $N$ samples for two APIs (or the same API at different times).
- It calculates a test statistic $S$ : the average absolute distance between the mean logprobs of the two distributions.
- A Permutation Test is used to calculate a p-value. By randomly shuffling the pooled samples and recalculating $S$ , the method determines if the observed difference is statistically significant (rejecting the null hypothesis that the distributions are identical).

3. Key Contributions

Log Probability Tracking (LT) Method: A novel, low-cost approach using a single-token prompt and a single-token response to detect model changes. It is significantly more sensitive than token-based methods.
TinyChange Benchmark: A new benchmark designed to evaluate change detection sensitivity on small, realistic model modifications. It generates 58 variants per model across five difficulty levels:
- Fine-tuning (1 to 512 steps).
- LoRA fine-tuning.
- Unstructured weight pruning (random and magnitude-based).
- Parameter noise addition (Gaussian noise).
Extensive Evaluation: The paper evaluates LT against two state-of-the-art baselines (MET and MMLU-ALG) across 5 open-weight models (0.5B to 8B parameters), totaling 290 variants.

4. Results

The authors conducted extensive in-vitro experiments and real-world monitoring:

Sensitivity: LT can detect changes as small as one step of fine-tuning.
- In weight pruning experiments, LT detected changes at a magnitude of $2^{-10}$ (removing ~0.1% of weights), whereas MET required $2^{-1}$ and MMLU-ALG required $2^{-4}$ .
- LT is 2–3 orders of magnitude (OOM) more sensitive than MET and 1–2 OOM more sensitive than MMLU-ALG.
Cost Efficiency:
- Token Cost: LT requires only 28 input and 20 output tokens per test.
- Financial Cost: At GPT-4.1 pricing, monitoring an endpoint hourly for a year costs $0.14 with LT, compared to $146 for MET and $332 for MMLU-ALG. This represents a ~1,000x cost reduction.
Prompt Length: Experiments showed that very short prompts (1–2 tokens) are nearly as effective as longer prompts (33 tokens), with only a ~1% difference in ROC AUC.
Real-World Deployment:
- The authors monitored 189 API endpoints from 10 providers over 4 months, collecting 1.7M responses.
- They detected 37 suspected changes across 29 endpoints.
- Notably, almost all detected changes (34/37) affected open-weight models, challenging the assumption that open models are more stable or transparent.

5. Significance and Implications

Transparency & Accountability: LT provides a practical, low-cost tool for researchers, developers, and regulators to verify the consistency of LLM APIs. It exposes the prevalence of undocumented changes, even in "open" models.
Feasibility of Continuous Monitoring: By reducing the cost by three orders of magnitude, LT makes continuous, automated auditing of LLM APIs feasible for the first time.
Security: The method can detect malicious injections (e.g., backdoors) or unauthorized model swaps that might otherwise go unnoticed until a downstream application fails.
Limitations: The method requires API support for logprobs (currently ~23% of endpoints). It detects that a change occurred but does not identify the nature of the change (e.g., hardware vs. model update). Providers could theoretically evade detection by caching responses for specific monitoring prompts, though this introduces other risks.

Conclusion

The paper demonstrates that log probabilities, despite being non-deterministic, contain sufficient statistical signal to detect minute changes in LLMs. By leveraging simple hypothesis testing on single-token outputs, Log Probability Tracking (LT) offers a highly sensitive, cost-effective solution for ensuring the integrity and reproducibility of LLM APIs, addressing a critical gap in the current deployment landscape.

Log Probability Tracking of LLM APIs

1. The Problem: The "Taste Test" is Too Expensive

2. The Solution: Listening to the "Whispers" (Log Probabilities)

3. The Trick: The "Single Token" Whisper

4. The "TinyChange" Benchmark

5. Real-World Detective Work

Why This Matters

1. Problem Statement

2. Methodology: Log Probability Tracking (LT)

Core Intuition

The Algorithm

3. Key Contributions

4. Results

5. Significance and Implications

Conclusion

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank