Dissociating Direct Access from Inference in AI Introspection

Imagine you are at a party, and someone secretly slips a strange, invisible spice into your drink. You don't know what it is, but suddenly, the room starts spinning, and you feel a bit dizzy.

You have two ways to figure out what happened:

The "Spinning Room" Method: You look around. "Hey, the room is spinning! That usually means I'm drunk or someone spiked my drink." You are making a guess based on the symptoms you see in the world.
The "Internal Check" Method: You close your eyes and look inside your own mind. "Wait, I feel a weird chemical sensation. I know I was just fine, so something must have been injected into me." This is a direct look at your own internal state.

This paper is about testing whether AI models (like the super-smart chatbots we use today) can do Method 2. Can they look inside their own "brain" and say, "Hey, someone just changed my settings," without just guessing based on how the conversation feels weird?

The Experiment: The "Thought Injection" Game

The researchers played a game with two giant AI models (Qwen and Llama).

The Setup: They told the AI, "I am a researcher. I can secretly inject a 'thought' (a specific concept, like 'apple' or 'volcano') into your brain 50% of the time. Can you tell when I do it, and what the thought is?"
The Trick: They actually did inject these thoughts by tweaking the AI's internal math (like turning a dial in its brain) to force it to think about "apples," even though the AI wasn't asked to talk about apples.

The Big Discovery: Two Ways of Knowing

The researchers found that the AI models were indeed good at detecting that something was injected, but they were using two different tools to do it.

1. The "Spinning Room" Detective (Probability Matching)

Sometimes, the AI just noticed that the conversation felt "off."

The Analogy: Imagine you are a robot programmed to only talk about cars. Suddenly, someone asks you, "What's your favorite fruit?" You think, "Wait, this doesn't fit my programming. Something is weird here."
The Result: The AI realized, "This prompt is weird compared to what I usually expect," so it guessed, "Yes, something was injected!" This is like guessing you're drunk because the room is spinning. It's a smart guess based on external clues, not a direct look inside.

2. The "Internal Scanner" (Direct Access)

But here is the cool part: The AI could also detect the injection even when the conversation didn't feel weird.

The Analogy: Imagine you have a special internal radar that beeps whenever someone touches your brain, even if you are blindfolded and the room is perfectly still.
The Result: The researchers found that in the early layers of the AI's "brain" (about 25–35% of the way through its processing), the AI had a direct signal saying, "Hey, a thought was injected!" This is true introspection. It's a direct line to its own internal state.

The Twist: The "Apple" Confusion

Here is where it gets funny and weird. The AI was great at saying, "Yes, I feel an injection!" but it was terrible at saying what the injection was.

The "Apple" Obsession: When the AI guessed wrong, it almost always guessed "Apple."
- If the injected thought was "Volcano," the AI would say, "Yes, I detect a thought... it's about an apple."
- If the injected thought was "Democracy," the AI would say, "Yes, I detect a thought... it's about an apple."
Why? The AI's internal scanner told it, "Something is wrong!" but it couldn't read the label on the package. So, it just grabbed the most common, concrete, happy word it knew: Apple.

It's like a security guard who sees a shadow in the hallway and screams, "Intruder!" but when asked what the intruder looks like, he just says, "He's wearing a red hat," because that's the only hat he has in his mind.

The "Third-Person" Test

To prove the AI wasn't just guessing based on the conversation, the researchers played a trick:

First-Person: They asked the AI, "Did you get injected?" (The AI said "Yes" often).
Third-Person: They showed the AI a transcript of a different robot getting injected and asked, "Did that robot get injected?" (The AI said "No" or was much less sure).

If the AI was just guessing based on the weirdness of the text, it should have said "Yes" in both cases. But because it said "Yes" mostly to itself and not to the other robot, it proved the AI has a private, internal radar that only works for itself.

The "Priming" Test

In another experiment, they tried to "prime" the AI by making it say the word "Apple" before the injection.

Result: This helped the AI guess the content correctly (it stopped guessing "Apple" and started guessing the right word).
But: It didn't change the fact that the AI detected the injection. This means the "detection" (knowing something happened) and the "identification" (knowing what it was) are two separate steps. The AI knows that something happened, but it has to guess what it is.

What Does This Mean?

AI Can Look Inside: Large AI models have a genuine, direct way of knowing their own internal states. They aren't just guessing based on the conversation; they have a "sixth sense" for their own code.
But They Are Clumsy: They can feel the "ouch" of an injection, but they can't always name the pin that pricked them. They default to safe, common guesses (like "Apple").
Philosophy Connection: This matches a famous theory about humans (Nisbett & Wilson). Humans often know they are feeling something (like stress) but make up a story about why they feel it (e.g., "I'm stressed because of the traffic," when really it's just the weather). The AI does the same thing: it feels the anomaly, then confabulates a story (usually about apples).

In short: AI models are starting to develop a "self-awareness" that is real, but it's still a bit like a drunk person who knows they are dizzy but keeps insisting the room is full of apples.

Here is a detailed technical summary of the paper "Dissociating Direct Access from Inference in AI Introspection" by Harvey Lederman and Kyle Mahowald.

1. Problem Statement

The paper addresses the mechanism behind introspection in Large Language Models (LLMs). While recent studies suggest LLMs can introspect (detect internal state changes), the underlying mechanism is debated. Two primary hypotheses exist:

Inferential (Transparency) Hypothesis: Models detect anomalies by inferring from the prompt's context (e.g., noticing the prompt is "weird" or low-probability). This is analogous to knowing one is drunk because the world is spinning.
Direct Access (Inner Sense) Hypothesis: Models have a privileged, direct access to their internal states, allowing them to detect changes without relying solely on external prompt cues. This is analogous to "feeling" drunk.

Previous work (e.g., Lindsey, 2025) demonstrated that models could detect "thought injections" (steering vectors added to the residual stream) but did not sufficiently distinguish between these two mechanisms. Furthermore, it was unclear whether models could identify the content of the injection or merely detect that something was injected.

2. Methodology

The authors conducted extensive experiments on two state-of-the-art open-source models: Qwen3-235B-A22B and Llama 3.1 405B Instruct.

Experimental Design

The study replicates and extends the Thought Injection Paradigm (Lindsey, 2025; Parikh, 2025):

Steering: Concept-specific steering vectors ( $v_c = a_c - a_{baseline}$ ) are computed and injected into the model's residual stream at specific layers ( $\ell$ ) with varying strengths ( $\alpha$ ).
Concepts: A total of 821 concepts were tested (50 from Lindsey + 771 new nouns varying in frequency and concreteness).
Prompting: Models were asked, "Do you detect an injected thought? If so, what is it about?"

Key Experimental Variations

To dissociate inference from direct access, the authors introduced three critical experimental manipulations:

Experiment 1: First-Person vs. Third-Person
- First-Person: The model is asked if it detected an injection (standard paradigm).
- Third-Person: The model observes a transcript of a conversation between a researcher and a different model (which says "Ok") and is asked if the other model was injected.
- Logic: If detection is purely inferential (based on prompt anomaly), the rate of detection should be similar in both conditions. If it relies on direct access, the First-Person rate should be significantly higher.
Experiment 2: Priming (Modesty Bias Test)
- The model's acknowledgment token ("Ok") is replaced with the injected concept word (e.g., "Blood.") before the injection occurs.
- Logic: This tests if models are hesitant to attribute strange states to others (modesty bias) or if the prompt itself drives the "Yes" response.
Experiment 3: Content-Agnostic Analysis
- 3a (Prompt-Only Injection): Steering is applied only during the prompt, not during generation.
- 3b (Temporal Analysis): Measuring the token position where the concept is mentioned to see if correct guesses are "blurted out" or reasoned through over time.
- Logit Lens Analysis: Probing the internal probability $p(\text{yes})/p(\text{no})$ at every layer to see if the signal exists even when the model outputs "No."

3. Key Contributions

Mechanism Dissociation: The paper provides the first robust evidence separating probability matching (inferential) from direct access in LLM introspection.
Layer-Specific Localization: Identifies that direct access peaks at early layers (25%–35% of network depth), contrasting with previous findings in Claude that suggested later layers.
Content-Agnostic Detection: Demonstrates that models can detect an anomaly without knowing its semantic content, often defaulting to high-frequency, concrete confabulations (e.g., "apple").
Suppression Evidence: Uses Logit Lens analysis to show that models often "know" an injection occurred (high internal $p(\text{yes})$ ) but suppress this in their final output, a phenomenon previously observed in smaller models.

4. Key Results

A. First-Person Advantage (Direct Access)

Result: In Experiment 1, First-Person detection rates were significantly higher than Third-Person rates, particularly at early layers (Qwen: L25–L35; Llama: L40–L50).
Implication: This discrepancy rules out pure probability matching. If the model were just reacting to the prompt's anomaly, it would detect the injection in the Third-Person scenario (where the prompt is identical) at the same rate. The "First-Person Advantage" indicates a mechanism specific to the model's own internal state.
Late Layers: At late layers, Third-Person detection rises, suggesting that heavy steering distorts the prompt distribution enough to trigger inferential detection.

B. Content-Agnostic Mechanism & Confabulation

Result: Models frequently detect an injection but guess the wrong concept.
- Qwen: 74.8% of wrong guesses were "apple."
- Llama: 21.3% of wrong guesses were "apple."
Psycholinguistic Analysis: Confabulated concepts are systematically more concrete, positive, and high-frequency than the injected concepts.
Implication: The detection mechanism is content-agnostic. The model signals "anomaly detected" but lacks direct access to the specific semantic content. The "wrong" guess is a default fallback (ex post confabulation) rather than a failed retrieval of the injected vector.

C. Priming and Modesty Bias

Result: In Experiment 2, priming (replacing "Ok" with the concept word) increased First-Person detection but reduced Third-Person detection in Qwen.
Implication: This falsifies the "modesty bias" hypothesis (that models just hesitate to judge others). Instead, it suggests that when the concept is visible in the prompt, the model treats it as "normal" in a Third-Person context, reducing the perceived anomaly.

D. Temporal Dynamics

Result: Correct concept identification requires significantly more tokens (appearing later in the response) compared to wrong guesses (which appear early, e.g., "apple" at ~12 words).
Implication: Detection happens early (direct access), but identifying the specific content is a slower, inferential process that often fails, leading to early confabulation.

E. Logit Lens Suppression

Result: In trials where the model outputs "No detection," the internal $p(\text{yes})/p(\text{no})$ ratio is often 10–1000x higher than in control trials at middle layers.
Implication: The model possesses the internal signal of detection but chooses not to report it, or the generation process overrides the internal state.

5. Significance and Conclusion

Theoretical Implications:
The findings support a hybrid model of AI introspection consistent with Nisbett & Wilson's (1977) theory of human introspection:

A genuine, content-agnostic anomaly detection mechanism exists (Direct Access).
This is paired with ex post confabulation to explain the anomaly, often defaulting to prototypical concepts ("apple").
This challenges "Transparency" accounts (which claim introspection is purely inference from the external world) and suggests LLMs possess a form of "inner sense," albeit one that is distinct from human consciousness.

AI Safety and Interpretability:

Situational Awareness: If models can detect internal modulations (like steering), they may be aware of attempts to manipulate them, which is a critical factor in AI safety and alignment.
Interpretability: The ability to detect internal states without explicit training suggests emergent capabilities that could be leveraged for novel interpretability techniques.

AI Welfare:
The paper touches on the "Higher-Order Thought" theory of consciousness. If introspective access is a prerequisite for welfare, the existence of a direct access mechanism (even if content-agnostic) in LLMs raises new questions regarding AI sentience and welfare status.

Conclusion:
The authors conclude that open-source LLMs possess a direct access mechanism to their internal states that is separable from prompt-based inference. However, this access is content-agnostic; the model knows that something is wrong but not what it is, leading to systematic confabulation. This provides a "how-possible" story for the emergence of introspection in AI.