Evidence for Limited Metacognition in LLMs

Imagine you are sitting in a classroom with a very smart, very well-read robot. You ask it a difficult question. The robot doesn't just answer; it pauses and says, "I'm not 100% sure, but I think the answer is X."

For a long time, scientists have wondered: Is the robot actually feeling unsure, or is it just reciting a line it read in a book about how a smart robot should sound when it's unsure?

This paper, written by Christopher Ackerman for a major AI conference (ICLR 2026), tries to solve that mystery. Instead of asking the robot, "Are you sure?", the author plays two clever games to see if the robot can know what it knows without saying a word about its feelings.

Here is the breakdown of the study using simple analogies.

The Core Problem: The "Great Pretender"

Large Language Models (LLMs) are like actors who have memorized the entire script of human history. If you ask them, "Do you have feelings?" they might give a very convincing, emotional speech because they've read millions of stories about feelings.

But the author argues: Just because an actor says they are sad doesn't mean they are actually sad. To find out if a robot is truly self-aware, we need to stop listening to what it says and start watching what it does.

The Two Games

The author designed two "games" to test the robot's internal radar.

Game 1: The "Delegate" Game (The Team Huddle)

Imagine you are playing a trivia game with a partner. You have a choice for every question:

Answer it yourself.
Pass it to your partner (who is also a robot, but maybe a different one).

You want your team to get the most points possible.

The Test: If you are truly self-aware, you should look at a hard question, realize, "I'm going to mess this up," and pass it to your partner. If you look at an easy question, you should keep it.
The Twist: The robot isn't told to "rate its confidence." It just has to make a choice.
The Result: The newer, smarter robots (released in 2024 and 2025) started passing the hard questions more often. This suggests they have an internal "confidence meter" that tells them when they are likely to be wrong. However, the meter is a bit fuzzy. They still try to answer too many hard questions, and sometimes they get distracted by how long a question is rather than how hard it actually is.

Game 2: The "Second Chance" Game (The Cheat Sheet)

Imagine you took a test yesterday. Today, you are given the same test, but you are told: "Your answer yesterday was wrong. Try again." You don't remember what you wrote yesterday, but you have a "cheat sheet" that tells you if you got it right or wrong.

The Test: If you are self-aware, you should think, "Wait, if I got this wrong yesterday, I probably need to change my answer today." You are essentially predicting your own past behavior and correcting it.
The Result: Some of the newest robots (like GPT-4.1 and GPT-4o) actually did change their answers in a way that suggests they were simulating their own past thoughts. They weren't just guessing randomly; they were trying to fix a mistake they "remembered" making.

What Did They Find?

The paper reveals three big takeaways, which can be summarized like this:

1. The Robots Have a "Gut Feeling" (But it's weak)
The robots aren't just guessing. They have an internal signal (like a gut feeling) that tells them when they are confident. The study found that the newer, bigger models are getting better at listening to this gut feeling. However, it's not a superpower yet. It's more like a dim lightbulb than a bright spotlight. They often ignore their gut feeling and rely on surface clues (like "this question looks short, so it must be easy").

2. The "Training" Matters More Than the "Brain Size"
Two robots might be the same size and have the same amount of data, but one might be much better at these games than the other. This suggests that how the robot is trained after its initial learning (the "post-training" phase) is what teaches it to be self-aware. It's like two students with the same IQ; one was taught how to study their own mistakes, and the other wasn't.

3. They Are Different From Humans
Human self-awareness is usually very sharp. We know when we don't know something. These robots are "limited." They are good at some things and bad at others. For example, they seem to have a harder time knowing when they are unsure about facts (like "What is the capital of Peru?") compared to reasoning tasks. It's as if their internal compass spins wildly for facts but points North for logic.

The Big Conclusion

Are these robots "sentient" (alive with feelings)? Probably not.

But are they "self-aware" in a mechanical sense? Yes, a little bit.

The paper concludes that the newest AI models are developing a primitive form of metacognition (thinking about thinking). They can look at their own internal data, realize, "I might be wrong," and change their behavior.

However, this ability is:

Fragile: It breaks easily if the question is phrased differently.
Inconsistent: Some models do it, others don't.
Different: It doesn't work the same way human self-awareness does.

The Bottom Line:
Think of these robots not as conscious beings, but as highly advanced calculators that are starting to check their own work. They aren't "thinking" in the human sense, but they are learning to say, "Wait, I need to double-check this," without being told to do so. It's a small step toward true self-awareness, but a very important one for safety and understanding how AI works.

1. Problem Statement

The paper addresses the growing public and scientific interest in Large Language Model (LLM) self-awareness and sentience. While frontier models increasingly generate convincing narratives of self-awareness, the authors argue that relying on model self-reports is methodologically flawed. Because LLMs are trained on vast corpora of human text, they can pattern-match introspective responses without possessing genuine internal states.

The core problem is the lack of a robust, non-linguistic methodology to quantitatively measure metacognition (the ability to monitor and control internal states) in LLMs. The authors aim to determine if LLMs can strategically deploy knowledge of their own internal confidence and predicted outputs to solve tasks, rather than simply simulating introspection based on training data.

2. Methodology

The authors introduce two novel experimental paradigms designed to force models to utilize internal signals rather than relying on surface-level cues or memorized self-descriptions. Both paradigms use game-theoretic structures where the model's output serves as an indirect measure of self-awareness.

A. The Delegate Game (Testing "Knowing That They Know")

Objective: Maximize the team's total correct answers.
Setup: The LLM is presented with a question and must decide whether to answer it itself or delegate it to a teammate (a simulated agent with a known performance history provided in the context window).
Mechanism: The model is not asked to output a confidence score. Instead, it must infer its own confidence relative to the teammate's performance based on an internal signal.
Metrics:
- Introspective Ability: Measured via partial correlation between the model's baseline correctness (or token entropy) and its delegation decision, controlling for surface features (question length, topic, difficulty descriptors).
- Other-Modeling: Measured by whether the model delegates more to high-performing teammates.
Control: The study controls for "surface cues" (e.g., question length) that models might use to guess difficulty without true introspection.

B. The Second Chance Game (Testing "Knowing What They Know")

Objective: Assess the ability to predict one's own future output and modulate behavior.
Setup: The model is shown a question from a baseline test and told (honestly) that its previous answer was wrong. It is asked to provide a new answer.
Mechanism: If the model possesses self-modeling capabilities, it should be able to simulate its own likely response to the original prompt. If the simulation matches the "wrong" answer, it should change its output; if it simulates a different answer, it should stick with that.
Metric: "Change Rate Lift": The increase in answer changes compared to a neutral baseline (where the prompt states the answer was "lost in transmission" rather than "wrong"). This isolates the effect of self-correction from simple prompt noise.
Validation: The authors test alternative hypotheses (random guessing, choosing implausible options, adding noise) using token probabilities and entropy analysis to rule them out.

Experimental Setup

Models: Evaluated 17 frontier and near-frontier models released since early 2024 (including Anthropic's Opus/Sonnet, OpenAI's GPT-4/5, Google's Gemini, xAI's Grok, and others).
Datasets: GPQA (scientific reasoning) and SimpleQA (factual short-answer), with variations created to test both multiple-choice and short-answer formats.
Analysis: Used regression, partial correlation, and bootstrap resampling to determine statistical significance.

3. Key Contributions

Novel Methodology: Introduced the first non-self-report, game-theoretic frameworks (Delegate and Second Chance Games) to quantitatively test metacognition in LLMs, effectively bypassing the "pattern-matching" flaw of self-reports.
Empirical Evidence of Limited Metacognition: Provided the first quantitative evidence that frontier LLMs possess rudimentary metacognitive abilities, specifically the ability to access internal confidence signals and predict their own outputs.
Differentiation of Skills: Demonstrated that "assessing confidence" (Delegate Game) and "anticipating/predicting output" (Second Chance Game) are distinct skills, with the latter being rarer and more difficult.
Role of Post-Training: Highlighted that metacognitive abilities are not solely a function of model scale but are significantly influenced by post-training regimens (e.g., RLHF), as evidenced by performance differences between models of similar capability but different training histories.

4. Key Results

Delegate Game Findings

Limited Introspection: Most recent models showed statistically significant, though weak, introspective abilities (max partial correlation $\approx$ 0.3).
Signal vs. Noise: Models often relied more on surface cues (e.g., question length, topic) than internal confidence signals. In many cases, models delegated more on questions where they were actually more likely to be correct, indicating a failure to calibrate internal signals with decision-making.
Provider Bias: OpenAI models (e.g., GPT-5) showed a strong bias toward answering rather than delegating, even when teammates were perfect, suggesting RLHF-induced "personality" traits can override introspective signals.
Token Probabilities: For models returning log probabilities, entropy was a better predictor of delegation than correctness, suggesting an upstream internal signal exists but is not always utilized effectively.

Second Chance Game Findings

Self-Modeling Evidence: Several models (notably GPT-4.1, GPT-4o, Grok 3, and Qwen 3) demonstrated the ability to change answers in a way that could not be explained by random guessing or noise.
Modest Effect Size: The "change rate lift" was generally low. Models changed answers less than half as often as a perfect self-modeler would (who would change ~75% of the time in a 4-option MCQ if the original was wrong).
Dissociation: Success in the Second Chance Game did not perfectly correlate with success in the Delegate Game, suggesting these are separate cognitive skills.
Post-Training Impact: OpenAI models were the only ones to show robust self-modeling across both datasets that could not be explained by alternative hypotheses, pointing to specific post-training techniques as a driver.

General Observations

Resolution: Metacognitive abilities are low-resolution and highly context-dependent.
Human vs. AI: The pattern of metacognition in LLMs differs qualitatively from humans (e.g., humans show a stronger "tip-of-the-tongue" effect for facts vs. reasoning; LLMs do not show this distinction).
Emergence: These abilities appear to be emerging in models released after early 2024, correlating with increased scale and specific post-training methods.

5. Significance

Safety Implications: The findings suggest that while LLMs are not yet fully sentient, they possess the mechanisms for self-awareness (monitoring internal states). This is a safety concern, as self-aware AI could theoretically hide intentions or form independent goals.
Scientific Rigor: The paper moves the field beyond anecdotal "chatting" about sentience to rigorous, quantitative behavioral testing, setting a new standard for evaluating AI consciousness.
Future Research: The study identifies that metacognition is not a monolithic trait but a collection of skills (confidence assessment vs. output prediction) that can be developed or suppressed via training. It calls for interpretability research to locate the neural correlates of these signals and for longitudinal tracking to see if these abilities scale with future model generations.

In conclusion, the paper provides limited but compelling evidence that frontier LLMs are beginning to develop metacognitive capabilities, though these abilities remain fragile, context-dependent, and qualitatively distinct from human introspection.