Agentic Critical Training

Here is an explanation of the paper "Agentic Critical Training" (ACT), broken down into simple concepts with creative analogies.

The Big Problem: The "Parrot" vs. The "Detective"

Imagine you are teaching a robot butler how to clean a house.

The Old Way (Imitation Learning):
You show the robot a video of a human expert cleaning. The robot watches and says, "Okay, I see. The human picked up the cup, walked to the sink, and put it down."

The Flaw: The robot is just a parrot. It memorized the moves, but it doesn't understand why those moves work. If the cup is slippery and falls, or if the sink is full, the robot doesn't know what to do. It just keeps trying to put the cup in the sink, even if it's already full, because that's what it saw in the video. It has no concept of "good" vs. "bad" actions; it only knows "copy the human."

The "Early Experience" Attempt:
Researchers tried to fix this by making the robot watch the expert, then watch a wrong action (like trying to put the cup in the fridge), and then reading a script that says, "The sink was better because..."

The Flaw: The robot is now a script-reader. It memorized the text of the explanation. It didn't actually learn to think; it just learned to recite the right answer when asked. If the situation changes slightly, it gets confused because it's just reciting a script, not reasoning.

The New Solution: Agentic Critical Training (ACT)

The authors propose a new method called Agentic Critical Training (ACT). Instead of making the robot copy actions or read scripts, they turn it into a Judge or a Detective.

How it Works (The Analogy)

Imagine a cooking competition.

The Setup: The robot is shown two options for the next step in a recipe.
- Option A (The Expert): "Add salt to the soup."
- Option B (The Robot's Guess): "Add sugar to the soup."
The Task: The robot isn't asked to cook yet. It is asked to critique. It must look at both options and decide: "Which one is better, and why?"
The Reward:
- If the robot correctly picks Option A and explains why (e.g., "Soup needs salt, not sugar"), it gets a point.
- If it picks Option B, it gets zero points.
- Crucially: The robot isn't told what to say in its explanation. It has to figure out the reasoning itself to win the point.

The Magic Result: "Genuine" Thinking

Because the robot is rewarded for getting the judgment right (not for copying a specific sentence), it is forced to build its own internal logic. It learns:

"Oh, I see. Putting the cup in the sink works because the sink is empty. Putting it in the fridge fails because the fridge is for cold things."
It develops critical reasoning. It learns to evaluate the quality of an action before doing it.

Why This is a Game-Changer

The paper tested this on three different "worlds":

ALFWorld: A text-based house cleaning game.
WebShop: An online shopping simulator.
ScienceWorld: A chemistry lab simulator.

The Results:

Better at Tasks: Robots trained with ACT were much better at completing tasks than those trained by just copying (Imitation Learning) or just guessing.
Handling Mistakes (The "Loop" Breaker):
- Old Robot: If it tries to open a locked door and fails, it tries again. And again. And again. It gets stuck in an infinite loop of failure.
- ACT Robot: It tries, fails, and then its internal "Judge" says, "Wait, that didn't work. The door is locked. I need to find a key first." It breaks the loop and finds a new solution.
The "Superpower" (General Reasoning):
- This is the most surprising part. The robot was only trained on house cleaning and shopping tasks. It never saw a math problem or a science quiz.
- However, when tested on hard math problems (like the MATH-500 benchmark), the ACT robot performed better than the original robot.
- Why? Because the "Judge" muscle it built while deciding between "put cup in sink" vs. "put cup in fridge" is the same muscle used to decide between "Option A" vs. "Option B" in a math problem. It learned how to think, not just what to do.

Summary Metaphor

Imitation Learning is like a student who memorizes the answer key. If the test question changes slightly, they fail.
Early Experience is like a student who memorizes the teacher's explanation. If the teacher explains it differently, they fail.
Agentic Critical Training (ACT) is like a student who is forced to grade their own practice tests. They have to figure out why an answer is right or wrong. By the time they take the real exam, they aren't just reciting answers; they are thinking critically, which helps them solve problems they've never seen before.

In short: ACT teaches AI agents not just to do, but to judge. And by learning to judge, they become smarter, more flexible, and better at solving complex problems.

Here is a detailed technical summary of the paper "Agentic Critical Training (ACT)".

1. Problem Statement

Current methods for training Large Language Model (LLM) agents primarily rely on Imitation Learning (IL) (Supervised Fine-Tuning) or Reinforcement Learning (RL) based on action generation. These approaches suffer from a fundamental limitation:

Lack of Critical Reasoning: IL teaches agents what to do by replicating expert trajectories but fails to teach them why certain actions are superior or how to identify suboptimal alternatives. Agents lack awareness of action quality.
Limitations of Existing "Self-Reflection" Approaches: Recent methods (e.g., Early Experience) attempt to inject self-reflection by generating text that explains why an expert action is better than an alternative. However, these methods still rely on Imitation Learning to train the model to reproduce pre-constructed reflection text. Consequently, the model learns to imitate reflection patterns rather than autonomously developing the reasoning capability to critique actions.

The core problem is how to train agents to autonomously develop critical reasoning regarding action quality, rather than merely memorizing reflection scripts.

2. Methodology: Agentic Critical Training (ACT)

The authors propose Agentic Critical Training (ACT), a Reinforcement Learning (RL) paradigm designed to train agents to identify the better action among alternatives, thereby fostering genuine self-reflection.

A. Core Concept

Instead of training the model to generate the next action directly, ACT trains the model to judge which of two candidate actions is better.

Input: A state $s_t$ and two candidate actions: an expert action ( $a^+$ ) and a model-generated alternative action ( $a^-$ ).
Objective: The model must reason through the context and select the superior action.
Mechanism: The model is rewarded only for correctly identifying the expert action. No supervision is provided for the reasoning text itself. This forces the model to autonomously develop Chain-of-Thought (CoT) reasoning to maximize the reward, leading to "genuine" self-reflection rather than imitated text.

B. Training Pipeline

The process consists of three stages:

Data Construction:
- Extract state-action pairs from expert demonstration trajectories ( $D_{expert}$ ).
- Sample $K$ alternative actions from an initial policy ( $\pi_{\theta_0}$ ) for each state.
- Filter out duplicates of the expert action to form contrastive pairs $(s, a^+, a^-)$ .
Stage 1: Agentic Critical Training (ACT):
- The model is trained using Group Relative Policy Optimization (GRPO).
- Prompt: The model is presented with the current state and two candidate actions (randomized order) and asked to select the better one with reasoning.
- Reward: The model receives a reward if it correctly selects the expert action ( $a^+$ ). It receives partial credit for valid but suboptimal actions and a penalty for formatting errors.
- Outcome: The model internalizes the ability to evaluate action quality and generate critical reasoning.
Stage 2: RL Action Training:
- The ACT-enhanced model is further trained via GRPO for direct action generation on expert trajectories.
- The critical reasoning foundation developed in Stage 1 allows the model to optimize its policy more effectively, avoiding suboptimal paths.

C. Reward Design

The composite reward function $R(s, y)$ includes:

Accuracy ( $R_{acc}$ ): 1.0 if the selected action matches the expert action exactly.
Admissibility ( $R_{adm}$ ): 0.1 if the action is valid but not the expert action (provides partial credit).
Format ( $R_{fmt}$ ): -0.5 if the output lacks required <action> tags.

3. Key Contributions

Novel Training Paradigm: ACT shifts the learning objective from "imitating expert actions" to "discriminating between expert and suboptimal actions" via RL. This drives the emergence of autonomous critical reasoning.
Superior Performance: ACT consistently improves agent performance when combined with both IL and RL, outperforming standard baselines and the "Early Experience" approach.
Generalization:
- Out-of-Distribution (OOD): ACT enables strong generalization to unseen task configurations (e.g., novel room layouts in ALFWorld).
- General Reasoning: Remarkably, ACT improves performance on general reasoning benchmarks (MATH-500, GPQA-Diamond) without any reasoning-specific training data. It prevents the "reasoning collapse" often seen in standard IL.

4. Experimental Results

The authors evaluated ACT on three benchmarks: ALFWorld (embodied), WebShop (web navigation), and ScienceWorld (scientific reasoning), using Qwen3-8B and Qwen3-4B models.

Performance Gains:
- vs. Imitation Learning (IL): ACT + IL achieved an average improvement of 5.07 points.
- vs. Standard RL: ACT + RL achieved an average improvement of 4.62 points.
- vs. Early Experience (Self-Reflection): ACT + IL outperformed Early Experience by 2.42 points on average, proving that generating reasoning via RL is superior to imitating pre-generated reflection text.
OOD Generalization: On ALFWorld's unseen splits, ACT + RL achieved the highest success rates. The performance gap between ACT and non-ACT methods was larger on OOD tasks than on in-distribution tasks, indicating better generalization.
General Reasoning Transfer:
- Standard IL and Early Experience failed to improve (or degraded) performance on MATH-500 and GPQA-Diamond.
- ACT achieved the highest scores on both benchmarks (e.g., +1.85 points on GPQA-Diamond over the CoT baseline), demonstrating that learning to critique actions transfers to abstract reasoning capabilities.
Case Studies:
- Failure Recovery: In ALFWorld, IL models entered infinite loops when an action failed (e.g., "Nothing happens"). ACT-trained models diagnosed the error via self-critique and corrected their path.
- Self-Verification: On GPQA-Diamond, ACT models exhibited "self-verification" behavior (substituting answers back into equations to check consistency), a behavior absent in IL models.

5. Significance

Paradigm Shift: ACT demonstrates that Reinforcement Learning is a more effective mechanism for instilling critical reasoning than Supervised Fine-Tuning (Imitation Learning), even when the training data is purely agentic.
Scalability: The method shows strong cross-size transferability; ACT data collected from a larger model (Qwen3-8B) effectively trained a smaller model (Qwen3-4B), amortizing data collection costs.
Dual Benefit: It solves the "reasoning collapse" problem where standard fine-tuning on short action sequences degrades a model's general reasoning. By focusing on judgment rather than just generation, ACT preserves and enhances the model's underlying cognitive capabilities.
Future Direction: The paper suggests that agentic RL environments can serve as a viable pathway for improving general reasoning in LLMs, bridging the gap between specific task performance and general intelligence.