When Does Critique Improve AI-Assisted Theoretical… — Plain-Language Explanation

Original authors: Vasilis Niarchos, Constantinos Papageorgakis, Alexander G. Stapleton, Sokratis Trifinopoulos

Published 2026-05-11

📖 4 min read☕ Coffee break read

Original authors: Vasilis Niarchos, Constantinos Papageorgakis, Alexander G. Stapleton, Sokratis Trifinopoulos

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to solve a very difficult, graduate-level physics problem (like calculating how particles interact or how strings vibrate). You have a smart AI assistant, but it sometimes gets stuck or makes mistakes. The paper asks a simple question: If you have a second AI act as a "critic" to review and correct the first AI's work, does that actually help? And if so, how should that second AI behave?

To find out, the authors built a system called SCALAR. Think of it as a three-person team working on a math test:

The Actor (The Student): This is the AI trying to solve the problem.
The Critic (The Teaching Assistant): This AI looks at the Student's work, finds errors, and gives feedback.
The Judge (The Teacher): This AI sits outside the conversation, looks at the final answer, and gives it a grade based on a strict rubric. It doesn't talk to the Student or the TA; it just grades the result.

The Experiment: How the Critic Behaves Matters

The researchers tested different "personalities" for the Student and different "teaching styles" for the Critic.

The Student's Personality: They tried telling the AI, "You are a world-class expert," or "You are a nervous student," or just leaving it blank.
The Critic's Style: They tried different ways of giving feedback:
- Pedagogical: Asking guiding questions (Socratic method).
- Lenient: Being gentle and accepting partial progress.
- Strict: Pointing out every single error precisely.
- Adversarial: Aggressively challenging every claim.

What They Found

1. Talking back and forth is better than a one-shot guess.
Just like a human student improves when they get feedback and try again, the AI "Student" almost always got a better score when it was allowed to have a conversation with the "Critic" rather than just giving one answer. The multi-turn dialogue fixed errors that the first attempt missed.

2. The "Expert" Persona is a myth.
The authors tested if telling the AI "You are a genius" made it smarter. It didn't. Whether the AI was prompted to be an expert, a novice, or just itself, the results were basically the same. The "persona" didn't change the outcome.

3. The Critic's style depends on the Student.
This is the most important finding. The "best" way for the Critic to talk depends entirely on which AI model is acting as the Student.

For a smaller, lighter AI (like "Haiku"): The Critic worked best when it was constructive and lenient. It helped the student by pointing out what they got right and gently suggesting improvements. Being mean or overly strict actually made the smaller AI perform worse.
For a larger, smarter AI (like "DeepSeek"): The Critic's style mattered much less. Whether the Critic was strict, lenient, or neutral, the large AI performed similarly. It seemed to be robust enough to handle different types of feedback without getting confused or discouraged.

4. Bigger isn't always a magic bullet.
They tested a small version of a smart model (8 billion parameters) and a huge version (70 billion parameters).

The bigger model was better at the "easy" physics problems.
However, on the hardest problems, both the small and big models hit a "wall." Even with a huge model and a helpful critic, they still got stuck on the most complex string theory calculations. Scaling up the model size didn't fix the hardest bottlenecks.

The Big Picture

The paper concludes that if you want to use AI to help with complex scientific reasoning:

Don't just ask once: Let the AI try, get feedback, and try again.
Don't waste time on "role-playing" prompts: Telling the AI to "act like an expert" doesn't help.
Tune your feedback: If you are using a smaller, cheaper AI, give it gentle, constructive feedback. If you are using a massive, powerful AI, the feedback style matters less, but being mean doesn't help either.

The study suggests that the interaction between the AI and the feedback loop is more important than the specific "personality" you assign to the AI. It's not about who the AI thinks it is, but how it is guided during the process.

Title: When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic–Actor Loop for Agentic Reasoning

Problem Statement
As Large Language Models (LLMs) and agentic AI systems increasingly engage in research-level tasks, a critical question arises regarding the efficacy of human-AI or AI-AI collaboration structures. While early evidence suggests LLMs can contribute to theoretical physics, mathematical discovery, and scientific workflows, the optimal structure for this collaboration remains an open question. Existing literature notes that multi-turn interactions often suffer from "sticky error states" and capability degradation, yet structured multi-agent approaches can reduce hallucinations. Furthermore, while prompt-engineering folklore suggests that assigning specific personas or feedback styles significantly alters performance, these claims have not been systematically tested on current-generation reasoning models within the specific context of theoretical physics. The authors aim to determine which interaction structures between an "Actor" (problem solver) and a "Critic" (feedback provider) effectively improve outcomes in graduate-level quantum field theory (QFT) and string theory problems.

Methodology: The SCALAR Pipeline
The authors introduce SCALAR (Structured Critic–Actor Loop for Agentic Reasoning), a controlled testbed designed as an Actor–Critic–Judge pipeline. This framework is modeled after pedagogical scaffolding (Wood et al., 1976; Vygotsky, 1978), where an AI agent attempts a problem, receives formative feedback, and is ultimately evaluated against a ground truth.

Roles:
- Actor: An LLM agent tasked with solving graduate-level physics problems. The Actor's behavior is modulated by a Persona, defined by two orthogonal dimensions: Expertise Level (Expert, Novice, Default) and Reasoning Style (Meticulous, Physical, Skeptical, Default). This yields 12 distinct persona configurations.
- Critic: An LLM agent that reviews the Actor's attempt, flags errors, and provides structured feedback without revealing the reference solution. The Critic's behavior is modulated by a Feedback Strategy: Adversarial, Strict, Pedagogical, Lenient, or Default.
- Judge: An independent LLM evaluator that scores the Actor's solution against a reference solution. The Judge operates outside the dialogue loop, scoring based on six dimensions: Correctness (50 pts), Mathematical Rigor, Logical Flow, Justification Quality, Completeness, and Physical Consistency (10 pts each).
Experimental Setup:
- Problems: Three problems from standard textbooks were selected to test different facets of reasoning: Peskin 2.3 (Feynman propagator calculation), Peskin 4.2 (Scalar particle decay lifetime), and Polchinski 2.7 (Operator Product Expansion coefficients in CFT).
- Model Variations: The study varied the Actor model family and scale:
  - DeepSeek-R1 70B (DS70B) and DeepSeek-R1-8B (DS8B), both paired with a DS70B Critic and a QwQ-32B (QWQ) Judge.
  - Claude Haiku 4.5 paired with a Claude Sonnet 4.6 Critic and Judge.
- Metrics: Performance was measured via Mean Per-Turn Score ( $\bar{s}$ ), Gain ( $g$ , the improvement from turn 0 to the final turn), and Convergence Rate ( $R$ , the percentage of runs achieving a passing verdict). The authors also utilized problem-normalized contrasts ( $D\bar{s}$ , $D_R$ ) to isolate the effects of feedback strategies from baseline problem difficulty.

Key Results

Multi-Turn Dialogue Improves Outcomes: Across all model settings, iterative dialogue significantly improved upon single-shot attempts. For the DS70B model, the mean score increased from ~67.3 to ~80.6, closing roughly 40% of the gap to saturation. This improvement is attributed to the iterative structure rather than prompt optimization alone.
Critic Feedback Strategy is Model-Dependent:
- Asymmetric Pairing (Haiku + Sonnet): The feedback strategy had a statistically significant impact. Constructive feedback (Pedagogical, Lenient, Default) yielded higher mean scores than Strict or Adversarial strategies.
- Same-Family Pairings (DeepSeek): In settings where the Actor and Critic were from the same model family (e.g., DS70B Actor with DS70B Critic), the feedback strategy had negligible statistical effect on mean scores or convergence rates. While a slight tendency toward Lenient feedback was observed, strict or adversarial feedback was never stably beneficial.
Actor Persona Prompting is Ineffective: Varying the Actor's persona (expertise level and reasoning style) produced no measurable or consistent effect on performance for either the DeepSeek or Haiku models. The 12 persona configurations for DS70B spanned a score range of only 5 points, indistinguishable from sampling variation.
Scaling Effects and Bottlenecks: Increasing the parameter count within the DeepSeek family (from 8B to 70B) improved performance on easier problems (e.g., Peskin 4.2) but did not remove the bottleneck observed on the hardest problem (Polchinski 2.7). Score-update curves revealed that while DS70B remained in a positive-drift regime for intermediate problems, both DS8B and DS70B exhibited a "fixed point" (stagnation) near a score of 63 on Polchinski 2.7, indicating that scaling alone does not solve the hardest reasoning challenges.
Dialogue Dynamics: The authors analyzed score-update curves to identify "regimes" of interaction. Easy instances often passed before Critic feedback was needed; intermediate instances benefited from structured feedback; and hard instances often remained stuck despite additional turns.

Significance and Claims
The paper positions SCALAR as a controlled testbed for evaluating interaction structures in AI-driven scientific discovery. Its primary contributions are:

Empirical Validation of Interaction Structures: It demonstrates that while multi-turn dialogue is generally superior to single-shot queries, the specific mechanism of improvement depends heavily on the Actor–Critic pairing.
Refutation of Prompt Engineering Folklore: The study provides evidence that assigning specific personas to reasoning models does not reliably improve outcomes in complex scientific tasks, challenging the notion that "role-playing" is a universal lever for performance.
Conditional Value of Critique: The paper argues that the value of Critic feedback is not universal; it is most effective in asymmetric settings (lightweight Actor, strong Critic) and with constructive (lenient/pedagogical) strategies. In same-family settings, the specific feedback style matters less.
Limitations of Scaling: The results suggest that simply increasing model scale within a family improves performance on easier tasks but fails to resolve fundamental bottlenecks in harder, conceptually dense problems.

The authors conclude that for AI-assisted scientific discovery, the focus should shift from static prompt engineering (personas) to dynamic interaction design (feedback strategies and agent pairing). They note that their current setup relies on reference-conditioned Critic feedback, and future work must address how to scaffold agents for open-ended problems where the "answer" is not known in advance.

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

The Experiment: How the Critic Behaves Matters

What They Found

The Big Picture

More like this