Rare Event Analysis of Large Language Models

Original authors: Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

Published 2026-05-29

📖 6 min read🧠 Deep dive

Original authors: Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a very talented, but slightly unpredictable, storyteller. This storyteller (a Large Language Model, or LLM) is great at telling normal stories about cats, forests, and rhinoceroses. However, because it is a probabilistic machine, it can occasionally tell a story that is bizarre, dangerous, or completely nonsensical. These weird stories are the "rare events."

The problem is that these weird stories are so rare that if you ask the storyteller a million times, you might never hear one. But if you ask it a billion times (which happens when millions of people use AI every day), those weird stories will eventually show up, and they could cause trouble.

This paper is like a new toolkit designed to find, study, and understand these "needle-in-a-haystack" stories without having to wait a billion years to hear them naturally.

Here is how the authors explain their method using simple analogies:

1. The Problem: The "Silent Library"

Imagine a library where 99.9% of the books are normal fairy tales. The other 0.0001% are terrifying horror stories. If you just walk in and grab books at random, you will only ever find fairy tales. You might think the library is 100% safe. But if you wait long enough, you will find a horror story.

The authors say: "We can't wait that long. We need a way to find the horror stories now so we know what they look like and how dangerous they are."

2. The Solution: The "Magic Lens" (Rare Event Analysis)

Instead of waiting for the rare stories to appear naturally, the authors use a technique borrowed from physics (called Rare Event Analysis). Think of this as putting on a "Magic Lens" that makes the rare, scary stories appear much more frequently, while still keeping track of how rare they actually are.

They do this in three main steps:

Step 1: Define the "Monster" (Setup)
First, you have to decide what you are looking for. Is it a story that is too hard to read? Is it a story that the model thinks is very unlikely to happen? The authors pick two specific "monsters" to hunt:
- The "Gibberish Monster": Stories that are so complex or repetitive they are impossible to read (measured by a "Readability Index").
- The "Ghost Story": Stories that the model itself thinks are extremely unlikely to happen (measured by "Log-Probability").
Step 2: The "Nudge" (Estimation)
To find these monsters, the authors don't just ask the model to "tell a story." They use a technique called Transition Path Sampling (TPS).
- The Analogy: Imagine you are trying to find a specific, rare path through a dense forest. Usually, you just walk forward, and you stay on the main road.
- The Nudge: The authors use a "nudge" (a mathematical bias) to gently push the storyteller toward the rare paths. They ask the model to generate a story, then they say, "Hey, that part was too normal, let's try changing the middle of the story to be a bit weirder."
- They do this over and over, like a sculptor chipping away at a block of stone, slowly guiding the story toward the "weird" zone. They use a "cooling schedule" (annealing) to do this gradually, so the story doesn't break apart.
Step 3: The "Mathematical Mirror" (Exploration & Correction)
Because they "nudged" the model to find these rare stories, the stories they find are no longer 100% natural. They are "biased."
- The Analogy: Imagine you used a magnifying glass to find a rare bug. You found 1,000 bugs, but in the real world, there are only 10.
- The Correction: The authors use a mathematical tool called MBAR (Multistate Bennett Acceptance Ratio). This acts like a "mathematical mirror" that corrects the numbers. It looks at the 1,000 bugs they found and says, "Okay, because we used a magnifying glass, we know that in the real world, this actually represents a probability of 1 in a billion."
- This allows them to calculate the true odds of the rare event happening, even though they forced it to happen in their experiment.

3. What They Found

The authors tested this on a small model called TinyStories (a model trained on children's stories).

The "Hard to Read" Stories: They found that while the model is designed to write for kids, it can generate stories that are incredibly difficult to read (like a university-level thesis written in gibberish). These stories are rare, but they exist.
The "Repetition" Trick: When the model tries to write these difficult stories, it often falls back on a safety net: repetition. It starts repeating words over and over (e.g., "Trururururu... Trururururu..."). The model thinks this is a good way to keep the story going, even though it looks like a glitch to a human.
The "Ghost" Stories: They also found stories that the model thinks are so unlikely they should never happen, yet the model still generates them when nudged.

4. Why This Matters (According to the Paper)

The paper claims this is the first time someone has built a complete "end-to-end" system to do this for AI.

It's a Practical Guide: They aren't just talking theory; they provide the code and the step-by-step instructions for how to do this.
It's Efficient: They proved you don't need to wait a billion years. You can find these rare events in a reasonable amount of time using their "nudging" and "mathematical mirror" techniques.
It's General: While they tested it on a small model, the math works for any size model.

Summary

Think of this paper as a safety inspector's manual for AI. Instead of waiting for a car to crash to see if the brakes work, this manual teaches you how to intentionally drive the car into a "crash zone" in a controlled way, measure exactly how likely a crash is, and figure out what the car does right before it crashes. This helps developers build better "guardrails" to stop the AI from saying or doing dangerous things in the real world.

Technical Summary: Rare Event Analysis of Large Language Models

Problem Statement
Large Language Models (LLMs) are probabilistic systems that, during inference, can generate "rare events": outputs that are highly atypical yet potentially significant. While standard development and testing often fail to observe these events due to their low probability, the massive scale of LLM deployment means such events can occur with non-negligible frequency in production. Current methods for analyzing these events are in their infancy. Direct sampling (the current state-of-the-art) is inefficient for exploring the tails of the output distribution, often requiring prohibitive computational resources to observe events with probabilities orders of magnitude lower than typical outputs. This paper addresses the need for a systematic, end-to-end framework to estimate the probabilities of rare events and explore their structural properties in LLMs.

Methodology
The authors propose a Rare Event Analysis (REA) framework adapted from statistical physics and computational chemistry, specifically utilizing techniques designed for molecular dynamics. The framework consists of three stages: Setup, Estimation, and Exploration.

Stochastic Process Formulation: LLMs are treated as stochastic processes generating trajectories (sequences of tokens). Rare events are defined as atypical values of a specific "observable" (a function of the completion).
Importance Sampling and Biasing: To overcome the inefficiency of direct sampling, the authors employ Importance Sampling. They introduce a "biasing observable" to create a tilted (biased) distribution, $p_\lambda$ , which encourages the sampling of rare values. The target distribution is reweighted using an exponential factor involving a bias parameter $\lambda$ and the observable $\phi$ .
Transition Path Sampling (TPS): Instead of generating independent samples, the authors use TPS, a variant of the Metropolis-Hastings (MH) algorithm. TPS generates a Markov Chain of trajectories by proposing edits to a sequence (truncating at a random point and regenerating the remainder). This allows the system to explore the state space more effectively than independent sampling.
Annealing and MBAR: To ensure convergence and coverage of the distribution tails, the authors use an "annealing" protocol, gradually increasing the magnitude of the bias $\lambda$ across multiple chains. They combine samples from these biased distributions using the Multistate Bennett Acceptance Ratio (MBAR) estimator to reconstruct the unbiased probability density.
Error Analysis: Statistical confidence intervals are constructed using bootstrap methods for MBAR estimates and Wilson intervals for direct sampling. Convergence is monitored using the Gelman-Rubin (GR) statistic.

Experimental Setup
The framework is demonstrated using the TinyStories-8M model, a small LLM trained on children's stories. Two observables are analyzed:

Log-Probability: The natural log-probability of the completion, measuring how likely the model finds its own output.
Automated Readability Index (ARI): A linguistic metric measuring text complexity. Since TinyStories is trained for children, high ARI scores represent "unwanted" or misaligned behavior (complex text).

The authors compare Direct Sampling (generating ~4.2 million completions) against TPS with MBAR (generating a comparable number of tokens via biased trajectories).

Key Results

Probability Estimation: The MBAR/TPS approach successfully estimates probabilities in the distribution tails that are orders of magnitude smaller than those accessible via direct sampling. While direct sampling yields empty bins in the tails, MBAR provides density estimates across the full range.
Error Reduction: The relative width of the confidence intervals (CIs) for MBAR estimates is significantly smaller than those for direct sampling in the tail regions, indicating higher precision for rare events.
Model Behavior Insights:
- Log-Prob: The distribution of log-probabilities is strongly non-Gaussian.
- ARI: The model generates completions with extremely high ARI scores (complex text) that are assigned high log-probabilities by the model, despite being out-of-distribution relative to the training data.
- Mechanism: Exploratory Data Analysis (EDA) reveals that these high-ARI, high-probability completions often exhibit extreme token repetition (e.g., "Trururururu..."). The model appears to fall back on repetitive patterns to maintain high likelihood when extrapolating beyond its training regime.
Proxy Identification: The study demonstrates that simple proxies, such as the count of consecutive token repeats, correlate with extreme ARI values, suggesting a potential mechanism for runtime filtering of rare events.

Significance and Contributions
The paper claims to provide the first complete, end-to-end application of rare event analysis techniques to LLMs. Its primary contributions are:

Framework: A practical, modular framework (Setup, Estimation, Exploration) for systematically studying rare events in LLMs.
Implementation Guide: A detailed guide covering theory, generation strategies (TPS), probability estimation (MBAR), and error analysis, making these advanced statistical physics tools accessible to ML researchers.
Empirical Validation: Demonstration that rare event probabilities can be accurately estimated with modest computational budgets (relative to production training) using small models, suggesting scalability to larger models.
Insight into Alignment: The ability to probe out-of-distribution regimes reveals specific failure modes (e.g., repetitive text generation) that standard testing might miss.

The authors emphasize that while the study uses a small model, the theoretical methods are model-agnostic. They note that future applications to production models will require collaboration across fields and potentially algorithmic improvements (e.g., adaptive biasing, parallel tempering, or using smaller models as proposal distributions), but the current work establishes a viable starting point for understanding and controlling rare, potentially unsafe, or significant LLM behaviors.