Lyapunov Probes for Hallucination Detection in Large Foundation Models

Imagine you have a very smart, well-read friend (let's call them "The Model") who can answer almost any question. But sometimes, when they don't actually know the answer, they get confident and make up a story that sounds plausible but is completely made up. This is called a hallucination.

Current ways to catch these lies are like asking the friend, "Are you sure?" (which they might lie about) or checking a massive encyclopedia to see if the fact exists (which is slow and expensive).

This paper proposes a brand new way to catch these lies by treating the AI not just as a chatbot, but as a physical system, like a ball rolling on a hilly landscape.

The Big Idea: The "Hill and Valley" Analogy

Imagine the AI's knowledge is a giant, 3D landscape:

The Deep Valleys (Stable Knowledge): When the AI knows a fact (e.g., "The sky is blue"), its internal "ball" sits deep in a valley. If you give the ball a little nudge (a small change in the question), it wobbles but rolls right back to the bottom. It's stable.
The Flat Unknown Plains (Stable Unknown): Sometimes the AI doesn't know something, but it's honest. It sits on a flat plain. If you nudge it, it doesn't move much, and it just says, "I don't know." This is also stable.
The Rugged Cliff Edges (The Hallucination Zone): This is the dangerous part. It's the edge of the cliff where the known world meets the unknown. If the ball is here, even a tiny nudge sends it tumbling off the edge into chaos. This is where the AI starts making things up. It's unstable.

The Problem: Current AI detectors don't know where the cliff edge is. They just guess if the answer is true or false.

The Solution: The authors built a tool called a Lyapunov Probe.

What is a Lyapunov Probe?

Think of the Lyapunov Probe as a super-sensitive seismometer or a stability tester.

Instead of just asking, "Is this answer true?", the Probe asks: "If I shake this answer slightly, does it stay the same, or does it fall apart?"

Here is how it works in three simple steps:

The Nudge: The Probe takes the AI's answer and gives it a tiny "nudge." This could be changing a word slightly, adding a bit of noise, or rephrasing the question.
The Reaction:
- If the AI is in a Stable Valley (it knows the fact), the answer stays solid. The Probe says, "Confidence: High."
- If the AI is on a Cliff Edge (it's hallucinating), the tiny nudge makes the answer collapse or change wildly. The Probe sees this instability and says, "Confidence: Low! Danger!"
The Rule of Decay: The Probe is trained with a special rule: As the nudge gets bigger, the confidence must go down. If the AI is lying, a big nudge should make it panic. If the AI is telling the truth, a big nudge shouldn't change its mind much.

Why is this better?

It's like a lie detector for stability: Instead of checking facts against a database, it checks if the AI's brain is "shaky" when you poke it.
It works everywhere: Because it looks at the structure of the AI's thinking (the hills and valleys), it works on different types of questions, different languages, and even images, without needing a new encyclopedia for every topic.
It catches the "Maybe" moments: It's really good at spotting when the AI is in that dangerous "I think I know, but I'm not sure" zone, which is exactly where hallucinations happen.

The Results

The authors tested this on many different AI models (like Llama, Qwen, and Falcon). They found that:

The Probe is much better at catching lies than previous methods.
It works even on questions the AI was never explicitly trained on (it generalizes well).
It can tell the difference between an AI that is confidently wrong and one that is honestly unsure.

In a Nutshell

This paper teaches us that hallucinations happen when the AI is standing on shaky ground. By building a tool that gently shakes the AI to see if it wobbles, we can catch it before it starts making things up. It turns the problem of "Is this true?" into "Is this stable?"—a much smarter way to keep AI honest.

1. Problem Statement

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) frequently suffer from hallucinations—generating plausible but factually incorrect content. This poses critical risks in high-stakes domains like healthcare and law.

Existing detection methods fall into two categories with significant limitations:

External Verification: Relies on comparing outputs against knowledge bases, which is expensive, requires continuous updates, and has limited coverage.
Internal Feature-Based Methods: Train classifiers on model representations or token probabilities. However, these lack theoretical grounding, treating hallucination as a standard binary classification problem without explaining why or where hallucinations occur in the model's knowledge space.

The core problem addressed is the lack of a principled framework to identify the specific regions in a model's representation space where factual reliability breaks down.

2. Methodology: Lyapunov Probes

The authors reframe hallucination detection through the lens of dynamical systems stability theory.

Theoretical Framework

Dynamical System Modeling: The (M)LLM is modeled as a high-dimensional dynamical system operating in a continuous representation space.
Knowledge Partitioning: The representation space is divided into three zones:
1. Stable Known (SK): Inputs grounded in factual knowledge; small perturbations yield consistent, correct outputs.
2. Stable Unknown (SU): Inputs outside the model's knowledge scope; the model consistently outputs "unknown" or abstains, remaining stable under perturbation.
3. Unstable Knowledge Boundary (B): Transition regions between SK and SU. Here, minor perturbations cause abrupt changes in output, leading to hallucinations.
Lyapunov Stability: The authors hypothesize that hallucinations occur at unstable equilibrium points. A Lyapunov function ( $V$ ) is defined to measure the "energy" or confidence of a state. For a stable factual region, the confidence should monotonically decay as input perturbations increase.

Probe Architecture

The Lyapunov Probe is a lightweight network designed to detect these stability transitions:

Input: Concatenation of multi-layer hidden states ( $\{h_l\}$ ) from the target model and explicit perturbation vectors ( $\delta$ ).
Structure:
- HiddenProcessor: A Transformer-based module using self-attention to capture inter-layer dependencies.
- Classifier: A 3-layer MLP that outputs a confidence score $V(h, \delta) \in [0, 1]$ .
Perturbation Strategies:
- Semantic Perturbations: Word substitutions, token insertion, and structural changes.
- Representational Perturbations: Injecting Gaussian noise directly into hidden states.
- Perturbations are applied with incremental magnitudes to test stability boundaries.

Training Strategy (Two-Stage)

The probe is trained using a composite loss function: $L_{total} = L_{BCE} + \lambda L_{Lyapunov}$ .

Stage 1 (Binary Cross-Entropy): Trains the probe to distinguish factual from non-factual outputs on unperturbed data ( $L_{BCE}$ ).
Stage 2 (Lyapunov Constraint): Introduces a penalty term ( $L_{Lyapunov}$ $L_{L y a p u n o v}$ ) that enforces monotonic decay. It penalizes cases where confidence increases or stays flat as perturbation magnitude ( $\delta$ $δ$ ) increases.
- Constraint: $\frac{\partial V(h, \delta)}{\partial \|\delta\|} < 0$ .

3. Key Contributions

Theoretical Connection: Establishes the first clear link between dynamical systems stability theory and hallucination detection, defining hallucinations as phenomena occurring at unstable knowledge boundaries.
Lyapunov Probes: Proposes a novel, lightweight detection framework that uses derivative-based stability constraints and multi-scale perturbations to enforce monotonic confidence decay.
Layer Analysis: Demonstrates through experiments that stability signals are most pronounced in mid-to-late layers of the transformer architecture, and that aggregating signals from multiple layers yields superior performance compared to single-layer approaches.

4. Experimental Results

The method was evaluated on six models (Llama-2/3, Qwen, Falcon, LLaVA, Qwen-VL) across eight benchmarks (TriviaQA, PopQA, CoQA, MMLU, POPE, TextVQA, VizWiz, MME).

Performance:
- Achieved consistent improvements over strong baselines (Verbalized confidence, Surrogate models, Sequence Probability, and standard Probes).
- LLMs: Average improvement of 6.2% over standard probes and 18.5% over probability-based baselines (measured by AUPRC).
- MLLMs: Average improvement of 2.1% over base probes, with significant gains (up to 6.2%) on noisy, real-world visual tasks (VizWiz).
Generalization: The probes trained on one dataset (e.g., TriviaQA) transferred effectively to unseen domains (CoQA, PopQA), outperforming probability-based baselines by 20–30 percentage points in cross-domain settings.
Verification of Theory: Ablation studies confirmed that the probes exhibit the required monotonic decay property (confidence drops as perturbation increases), whereas baseline probes showed erratic, non-monotonic behavior.

5. Significance

Principled Detection: Moves beyond heuristic pattern matching to a mathematically grounded approach based on system stability.
Efficiency: The probes are lightweight and can be attached to existing frozen foundation models without retraining the base model.
Interpretability: Provides a mechanism to visualize and understand where in the model's knowledge space hallucinations are likely to occur (the unstable boundaries).
Robustness: The method is effective across diverse architectures (LLMs and MLLMs) and robust to noise, making it suitable for real-world deployment in safety-critical applications.

In conclusion, the paper successfully demonstrates that treating hallucination detection as a stability analysis problem allows for more reliable identification of factual errors, offering a new paradigm for ensuring the trustworthiness of foundation models.