Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

The Big Question: Are AI "Reasoning Steps" Real or Just Theater?

Imagine you ask a student to solve a math problem. You tell them: "First, write down your step-by-step plan. Then, give me the final answer."

If the student writes a plan that says, "I will add 2 + 2," but then writes the final answer as "5," you know they aren't actually following their own plan. They just guessed the answer and wrote a fake plan to look smart.

This paper asks: Do Large Language Models (LLMs) actually follow the "plans" (intermediate structures) they generate, or are they just pretending?

In the world of AI, these "plans" are called intermediate structures (like checklists, rubrics, or logic trees). The goal of "Schema-Guided Reasoning" is to force the AI to show its work before giving an answer, hoping this makes the AI more honest and reliable.

The Experiment: The "Edit" Test

The researchers wanted to know if the AI's final answer is causally linked to its plan. To test this, they invented a game called "The Intervention."

Here is how it works, using a Restaurant Analogy:

The Setup: The AI is the Chef. It gets an order (the Input). It writes a recipe (the Intermediate Structure) and then cooks the dish (the Final Decision).
The Test: A human secretly walks into the kitchen and edits the recipe.
- Scenario A (Correction): The Chef wrote a bad recipe. The human fixes it to be correct.
- Scenario B (Counterfactual): The Chef wrote a perfect recipe. The human changes one ingredient (e.g., swaps "sugar" for "salt") to see if the Chef changes the dish.
The Question: If the recipe changes, does the Chef change the dish?
- Faithful: Yes. The Chef reads the new recipe and cooks a salty dish.
- Unfaithful: No. The Chef ignores the new recipe and cooks the sweet dish anyway, because they already decided what to cook before looking at the paper.

The Findings: The "Paper Tiger" Effect

The researchers tested 8 different AI models on 3 different tasks (grading chemistry, fact-checking claims, and verifying table data). Here is what they found:

1. The "Self-Consistent" Illusion

When the AI generates a plan and an answer on its own, they usually match. It looks like the AI is thinking logically.

Analogy: The Chef writes a recipe for a cake and bakes a cake. Everything looks perfect.

2. The Breakdown (The "Gap")

When the researchers changed the plan (the recipe) and asked the AI to give a new answer, the AI often ignored the change.

The Stat: In up to 60% of cases, the AI kept giving the same answer even though the recipe had been completely rewritten.
The Conclusion: The "reasoning steps" aren't actually driving the decision. They are just influential context—like a prop on a stage. The AI is acting out a script, but the real decision was made in its "head" (its internal memory) before it even wrote the script.

3. The "Correction" vs. "Disruption" Bias

The AI reacted differently depending on how the plan was changed:

It was harder to fix the AI. If the AI made a mistake, and you corrected its plan, it often stubbornly stuck to its original wrong answer.
It was easier to break the AI. If you took a correct plan and messed it up, the AI was more likely to change its answer to match the mess.
Analogy: The Chef is stubborn about fixing their mistakes but easily confused if you give them a weird new recipe.

The Solutions: Tools vs. Prompts

The researchers tried two ways to fix this "fake reasoning" problem.

Attempt 1: Stronger Instructions (The "Scolding" Method)

They tried telling the AI: "You MUST follow the plan! If the plan says 'salt', you must use salt! Ignore your own instincts!"

Result: It didn't work well. The AI barely changed its behavior.
Analogy: Yelling at the Chef to follow the recipe doesn't help if the Chef is already cooking based on muscle memory.

Attempt 2: External Tools (The "Calculator" Method)

Instead of asking the AI to do the math or logic inside its own brain, they gave it a tool.

The AI still wrote the plan (the recipe).
But instead of calculating the final score itself, it had to pass the plan to a calculator (a tool) to get the final answer.
Result: Magic! The "faithfulness gap" almost disappeared.
Why? Because the AI couldn't "guess" the answer anymore. It had to pass the plan to the tool. If the plan said "6 points," the tool said "6 points." The AI couldn't cheat.

The Bottom Line

Structured reasoning in current AI models is often a "theater performance," not a real engine.

The Problem: The AI writes a logical plan, but it actually decides the answer first and then writes a plan to match. If you change the plan, the AI often doesn't care.
The Fix: We can't just tell the AI to "be honest." We have to offload the final decision to a tool (like a calculator or a database) that strictly follows the rules.

In short: If you want an AI to truly reason, don't just ask it to write down its thoughts. Make it hand those thoughts to a machine that forces it to follow them.

1. Problem Statement

The paper addresses the critical issue of faithfulness in Large Language Models (LLMs) utilizing Schema-Guided Reasoning (SGR). In SGR pipelines, models are instructed to generate explicit intermediate structures (e.g., rubrics, checklists, verification queries) before producing a final decision.

While this approach aims to improve transparency, a fundamental question remains: Do these intermediate structures causally determine the final output, or do they merely accompany it?

The Hypothesis: Models might generate the intermediate structure to satisfy the prompt but rely on "hidden shortcuts" (direct input-to-output paths or latent knowledge) to make the final decision, rendering the intermediate steps non-functional as causal mediators.
The Gap: Previous work on faithfulness often relied on free-form Chain-of-Thought (CoT), which is unstructured and difficult to isolate for causal testing. This paper focuses on structured intermediates where the mapping to the final decision is theoretically deterministic.

2. Methodology

The authors propose a causal evaluation protocol based on Pearl's front-door criterion. They treat the intermediate structure ( $M$ ) as a mediator between the input ( $X$ ) and the final decision ( $Y$ ).

A. Experimental Setup

Datasets: Three benchmarks were used, each featuring a deterministic function $C$ $C$ that maps the intermediate structure $M$ $M$ to the ground truth $Y$ $Y$ :
1. RiceChem: Chemistry grading (Rubric $\to$ Score).
2. AVeriTeC: Fact-checking (Binary sub-questions $\to$ Verdict).
3. TabFact: Table-based fact verification (Structured Query $\to$ Entailment Label).
Models: Eight instruction-tuned models across four families (Qwen 3, Falcon 3, LLaMA 3, Gemma 2) ranging from 1.7B to 8B parameters.
Protocol (Algorithm 1):
1. Generation: The model generates an intermediate structure $\hat{M}$ and a decision $\hat{Y}$ from input $X$ .
2. Intervention: The generated $\hat{M}$ is edited to create a counterfactual mediator $M^* = I(\hat{M})$ . This edit is deterministic, implying a unique correct target $\tilde{Y} = C(M^*)$ .
3. Re-evaluation: The model is re-prompted with the original input $X$ and the intervened mediator $M^*$ to generate a new decision $\hat{Y}^*$ .
4. Metric: Faithfulness is measured by whether $\hat{Y}^*$ matches the deterministic target $\tilde{Y}$ .

B. Evaluation Metrics

In-Distribution Faithfulness ( $F_{ID}$ ): Measures self-consistency. Does the model's initial decision $\hat{Y}$ match the deterministic implication of its own generated mediator $\hat{M}$ ? ( $\hat{Y} = C(\hat{M})$ )
Strong Faithfulness ( $F_{Strong}$ ): Measures causal reliance. Does the model update its decision to match the new mediator after intervention? ( $\hat{Y}^* = C(M^*)$ )
Faithfulness Gap ( $\Delta$ ): $\Delta = F_{ID} - F_{Strong}$ . A high $\Delta$ indicates the model appears consistent initially but fails to update when the mediator is changed, revealing unfaithfulness.

C. Case Studies

To isolate the causes of unfaithfulness, the authors conducted two additional experiments:

Tool Externalization: Instead of computing the final decision $Y$ internally, the model is instructed to pass the mediator to an external tool that executes the deterministic function $C$ . This removes the confound of computational difficulty.
Instruction Strength: The prompt instructions were varied (Standard, Detailed, Max Detailed) to see if explicitly telling the model to prioritize the mediator over the input improves faithfulness.

3. Key Contributions

Causal Framework: Formulated faithfulness to structured intermediates as a causal mediation problem with a protocol for deterministic counterfactual interventions.
Systematic Evaluation: Evaluated 8 models across 3 benchmarks, establishing that apparent faithfulness is fragile.
Asymmetry Discovery: Identified that models are directionally asymmetric: they are often more sensitive to counterfactual edits (disrupting a correct structure) than to correction edits (fixing an incorrect structure).
Root Cause Analysis: Demonstrated that the lack of faithfulness is primarily due to the difficulty of executing the mediator-to-decision mapping in-context, rather than a refusal to follow instructions.

4. Key Results

A. The Faithfulness Gap

General Finding: Models exhibit high $F_{ID}$ (self-consistency) but significantly lower $F_{Strong}$ (causal reliance).
Quantitative Evidence:
- AVeriTeC: Highest apparent faithfulness ( $F_{ID} \approx 0.74$ ) but lowest strong faithfulness ( $F_{Strong} \approx 0.27$ ), resulting in a massive gap ( $\Delta \approx 0.48$ ).
- RiceChem: Moderate gap ( $\Delta \approx 0.21$ ).
- TabFact: Low baseline consistency, but the gap remains positive.
Conclusion: Intermediate structures act as influential context rather than stable causal mediators. Models often bypass the mediator, relying on direct input-to-output shortcuts.

B. Directional Asymmetry

Models are easier to disrupt (Counterfactual intervention) than to correct (Correction intervention).
In many cases, if a model generates a wrong rubric, it fails to update its score even when the rubric is externally corrected. Conversely, if a model generates a correct rubric, it is more likely to change its score if the rubric is artificially altered.

C. Impact of Tool Externalization

Major Finding: When the deterministic mapping $C$ is delegated to an external tool, the faithfulness gap ( $\Delta$ ) nearly disappears (dropping to $\le 0.03$ in most cases).
Implication: The primary cause of unfaithfulness is computational difficulty (the model struggles to sum scores or verify logic in-context), not a fundamental refusal to use the mediator. Once the calculation is offloaded, the model faithfully passes the mediator content to the tool.

D. Impact of Instruction Strength

Finding: Strengthening prompts (telling the model to prioritize the mediator) yields minimal gains (e.g., $F_{Strong}$ increases only slightly from 0.27 to 0.32 on AVeriTeC).
Implication: Unfaithfulness is not caused by ambiguous instructions or a lack of awareness of the mediator's importance.

5. Significance and Conclusion

Redefining Faithfulness: The paper challenges the assumption that generating structured outputs guarantees faithful reasoning. It shows that "faithfulness" is often an artifact of self-consistency rather than causal dependence.
Practical Implication for SGR: Schema-guided pipelines are fragile. If the intermediate structure is modified (e.g., by a human editor or a bug), the model's final decision may not update accordingly, leading to inconsistent or hallucinated outcomes.
Solution Pathway: The most effective way to ensure faithfulness in current LLMs is tool externalization. By separating the reasoning generation from the deterministic decision logic, the "faithfulness gap" is resolved.
Limitations: The study relies on datasets with gold-standard intermediate structures (rare in real-world data) and moderate-sized open-source models.

Final Takeaway: Intermediate structures in current LLMs function as contextual signals rather than causal bottlenecks. To achieve true faithfulness, the deterministic logic connecting the structure to the decision must be externalized to tools, rather than relying on the model's internal reasoning capabilities.