Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation

The Big Problem: The "Cheat Sheet" vs. The "Real Understanding"

Imagine you are taking a math test.

Student A actually learned how to do long division. They understand the logic.
Student B didn't learn the math. Instead, they memorized the answers to the specific practice problems the teacher gave them.

If the test questions look exactly like the practice problems, both students get 100%. Standard AI evaluation (like "Accuracy") is like a teacher who only looks at the final answer. They see two 100% scores and assume both students are geniuses.

But if you give them a new problem that requires actual logic, Student A will solve it, and Student B will fail miserably.

The Problem: Current AI models are often like Student B. They are incredibly good at spotting patterns and memorizing data, but they might not actually understand the rules of the task. Standard tests can't tell the difference between a model that "knows" and a model that is just "guessing based on tricks."

The Solution: The "Mechanistic" Detective

The authors of this paper propose a new way to test AI called Symbolic-Mechanistic Evaluation.

Instead of just checking the final answer (the "What"), they want to open up the AI's brain and check how it got there (the "How"). They treat the AI like a machine with gears and circuits, and they want to verify that the right gears are turning.

Think of it like a mechanic checking a car:

Standard Test: Does the car drive from Point A to Point B? (Yes/No).
Mechanistic Test: Did the engine actually turn the wheels, or did someone just push the car while the engine was off?

The Experiment: The "Database Translator"

To prove their point, the researchers created a specific test using a task called NL-to-SQL (translating English questions into database commands).

They trained two identical AI models:

The "Honest" Model: This model was given the database "blueprint" (the schema) so it could learn the real rules of how to translate the question.
The "Cheater" Model: This model was not given the blueprint. It had to guess the answers based only on the English words, hoping to memorize patterns.

The Shocking Result:
When they tested both models on new questions:

The "Cheater" model got 93.5% of the answers right!
The "Honest" model got 99.1% right.

To a standard observer, the Cheater looks almost as smart as the Honest model. But the researchers knew the Cheater was just guessing.

The New Test: The "Rule Check"

The researchers then applied their new Symbolic-Mechanistic test. They didn't just ask, "Did you get the right answer?" They asked three specific questions about the AI's internal brain activity:

Rule 1 (The Sensitivity Check): "If I change a tiny word in your instructions, does your answer change?"
- Analogy: If you tell a chef, "Add salt," and then change it to "Add pepper," a real chef changes the dish. A robot that just memorized "Add salt" might ignore the change.
- Result: The Cheater model barely cared when words changed. The Honest model reacted strongly.
Rule 2 (The Localization Check): "Can we pinpoint exactly where in your brain this decision happened?"
- Analogy: If you fix a specific gear in a clock, does the clock start working again? If the fix works, it means the problem was in that specific gear, not the whole machine.
- Result: The Honest model had a specific "gear" (a layer in the neural network) that handled the database rules. The Cheater model was messy and scattered.
Rule 3 (The Consistency Check): "Do you use the same brain-gear for every single question?"
- Analogy: A real driver uses the same steering wheel for every turn. A confused driver might grab the wheel, then the radio, then the window, depending on the moment.
- Result: The Honest model used the same "gear" every time. The Cheater model was inconsistent.

The Verdict

When they ran these new tests:

Standard Score: The Cheater looked 93% as good as the Honest model.
Mechanistic Score: The Cheater only passed the "real understanding" checks 59% of the time, while the Honest model passed 76% of the time.

The new test revealed that the Cheater was actually failing the core logic of the task, even though it looked perfect on the surface.

Why This Matters

This paper argues that we need to stop just looking at Accuracy (the final score) and start looking at Mechanism (how the model thinks).

For Safety: In high-stakes fields like medicine or law, we can't just hope the AI gets the right answer by luck. We need to know it followed the correct reasoning steps.
For the Future: As AI gets better at mimicking human answers, we need "mechanic's tests" to ensure the engine is actually running, not just that the car is moving.

In short: Don't just ask, "Did you get an A?" Ask, "Did you actually learn the material, or did you just memorize the cheat sheet?" This new method gives us the tools to find out.

1. Problem Statement

Current NLP evaluation relies heavily on surface-level metrics (e.g., Exact Match, F1, BLEU) derived from held-out test sets. The authors argue these metrics are fundamentally flawed because they cannot distinguish between genuine generalization (learning the correct algorithm) and pattern exploitation (memorization, data leakage, or brittle heuristics).

Contamination: Benchmarks are often contaminated by training data, leading to inflated scores.
Spurious Heuristics: Even on clean data, models exploit annotation artifacts or statistical shortcuts rather than learning the intended task logic.
The "Black Box" Limitation: High accuracy does not guarantee the model is using the correct internal computational mechanisms. In small-data regimes (common in specialized domains), standard test sets lack the statistical power to separate competence from pattern matching.

2. Methodology: Symbolic-Mechanistic Evaluation

The paper proposes a framework that combines symbolic logic (defining "non-negotiable" rules for task correctness) with mechanistic interpretability (probing internal model states) to generate algorithmic pass/fail scores.

A. Core Concept

Instead of asking "Is the output correct?", the framework asks: "Does the model use the correct internal circuit to solve the task?"

Symbolic Rules: Users define rules ( $R$ ) describing the necessary properties of the task's algorithm (e.g., "The model must causally depend on the schema token").
Mechanistic Verification: The system uses interventions (activation patching, logit analysis) to verify if the model's internal behavior satisfies these rules.

B. Case Study Setup

The authors validate this framework using a Natural Language to SQL (NL-to-SQL) task with the TinySQL CS1 Synonyms dataset.

Task: Map English prompts to SQL queries where column names in the prompt are synonyms of the actual database schema (e.g., prompt "website" vs. schema "url").
Models: Two identical architectures trained under different conditions:
1. Schema Model: Trained with the database schema provided in the context (enabling genuine grounding).
2. NO Schema Model: Trained without the schema, forcing reliance on shallow heuristics or memorization.
Evaluation: Both models are tested on unseen data. Standard evaluation suggests they perform similarly, but the symbolic-mechanistic approach probes their internal logic.

C. The Hierarchical Rule System

The evaluation decomposes the existence of a "schema-checking circuit" into three verifiable conditions for each example:

R1 (Schema Sensitivity / Causal Dependence):
- Test: Does corrupting the schema token (e.g., swapping a column name) change the model's preferred answer?
- Metric: A positive "schema-sensitivity gap" ( $\Delta$ ) in logit differences between clean and corrupted prompts.
R2 (Recovery Efficacy / Localization):
- Test: Can the effect of the corruption be reversed by "patching" (replacing) the activation at the schema token position with the clean activation?
- Metric: If patching a specific layer recovers $\ge 90\%$ of the performance gap, the mechanism is localized to that layer.
R3 (Circuit Reusability / Consistency):
- Test: Is the same layer (or set of layers) responsible for recovery across different examples?
- Metric: If the "best recovery layer" is consistent across the dataset, it indicates a reusable, generalizable circuit rather than input-specific heuristics.

Scoring: An example passes only if it satisfies R1, R2, and R3. The aggregate score represents the percentage of examples where the model demonstrates genuine algorithmic generalization.

3. Key Results

Standard Accuracy vs. Mechanistic Reality

Standard Metrics: The NO Schema model achieved 93.5% field-name accuracy on unseen data (when schema was withheld at test time). Standard evaluation would classify this as highly competent.
Mechanistic Metrics: The same NO Schema model achieved only 59% on the symbolic-mechanistic pass rate.
Comparison: The Schema model achieved 76% on the mechanistic pass rate.
Insight: While standard accuracy showed a negligible gap (5%) between the two models, the mechanistic evaluation revealed a massive 17-point gap, correctly identifying that the NO Schema model was exploiting patterns rather than generalizing.

Detailed Findings

Sensitivity: The Schema model showed significantly higher sensitivity to schema corruption ( $\Delta = 1.88$ ) compared to the NO Schema model ( $\Delta = 0.65$ ).
Localization: Passing examples in the Schema model showed strong convergence in internal layers (Layers 0–2 accounted for 89% of successful recoveries), indicating a localized, reusable circuit. The NO Schema model showed distributed, inconsistent processing.
Robustness: The results held across various corruption types (synonyms, scrambled words, non-DB words), with the Schema model consistently outperforming the baseline in mechanistic consistency.

4. Key Contributions

New Evaluation Paradigm: Introduces a "mechanism-aware" evaluation framework that moves beyond output correctness to verify the computational process of the model.
Symbolic-Mechanistic Integration: Successfully bridges symbolic AI (rule-based verification) with mechanistic interpretability (activation patching) to create interpretable pass/fail criteria.
Demonstration of Metric Failure: Provides empirical evidence that high accuracy can be a "false positive" for competence, particularly when models rely on spurious correlations or memorization.
Diagnostic Granularity: The hierarchical rule system (R1-R3) provides specific diagnostics:
- R1 failure = Ignoring critical inputs.
- R2 failure = Distributed/unclear computation.
- R3 failure = Lack of generalizable circuitry.

5. Significance and Future Implications

Trust and Safety: In high-stakes domains (e.g., healthcare, finance), knowing how a model reaches a conclusion is as important as the conclusion itself. This framework ensures models implement intended algorithms rather than approximating outputs.
Benchmarking Evolution: The authors argue that future benchmarks should require "mechanism certification" alongside accuracy reporting.
Limitations & Future Work:
- Hyperparameter Sensitivity: Rules require calibration (thresholds for $\Delta$ and recovery fraction).
- Intervention Risks: Activation patching can introduce distribution shifts; future work should triangulate with causal tracing or ablation.
- Scope: Currently best suited for tasks with well-defined algorithmic primitives (parsing, grounding, retrieval) rather than open-ended creative generation.

Conclusion: The paper demonstrates that accuracy is an insufficient proxy for intelligence. By verifying that models utilize consistent, localized, and causally dependent internal circuits, the proposed symbolic-mechanistic approach offers a rigorous method to distinguish true generalization from superficial pattern matching.