Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

Imagine you are hiring a brilliant but very expensive consultant to help you solve a complex mystery. This consultant has two modes:

The "Deep Thinker" Mode: They spend hours analyzing every clue, cross-referencing databases, and writing a 50-page report. This is incredibly accurate but costs a fortune in time and money.
The "Quick Glance" Mode: They look at the clue and give an answer in 10 seconds. This is cheap and fast, but they might miss a crucial detail.

The Problem:
Most AI agents (the "consultants" of the digital world) currently work like this: They either use "Deep Thinker" mode for every single step of a task (wasting money on easy steps like "open the door") or they use "Quick Glance" mode for everything (saving money but failing the hard parts like "solve the algebra puzzle").

The Solution: ARES
The paper introduces ARES (Adaptive Reasoning Effort Selection). Think of ARES as a smart project manager who sits next to the consultant.

Here is how ARES works in everyday terms:

1. The Smart Project Manager (The Router)

Instead of the consultant deciding how hard to think, ARES acts as a traffic cop. Before the consultant takes a single step, ARES looks at the current situation and asks: "Do we really need a 50-page report for this, or is a quick glance enough?"

Scenario A (Easy Step): The agent needs to click a link to open a website.
- ARES says: "Easy peasy! Use Quick Glance mode." (Saves money).
Scenario B (Hard Step): The agent needs to figure out why a flight booking failed or navigate a confusing website maze.
- ARES says: "This is tricky. Switch to Deep Thinker mode immediately." (Ensures accuracy).

2. How Did We Teach the Project Manager?

You can't just tell a project manager to "guess." You have to train them. The authors created a special training pipeline:

Phase 1: The Gold Standard: First, they let the consultant work in "Deep Thinker" mode to solve the whole mystery perfectly. This gives them the "correct answer key."
Phase 2: The Stress Test: They go back through the steps one by one. For each step, they ask: "Could the consultant have solved this specific step using 'Quick Glance' mode and still gotten it right?"
- If yes, they label that step as "Easy."
- If no, they label it as "Hard."
Phase 3: The "Why" (Rationale): They don't just teach the manager what to do; they teach them why. The manager learns to say, "I'm choosing 'Quick Glance' because this is just opening a door," or "I'm choosing 'Deep Thinker' because this involves complex logic."

3. The Results: Getting the Best of Both Worlds

When they tested ARES, the results were like finding a magic switch that saved money without losing quality:

The "Always Deep Thinker" approach: Spent a lot of money (tokens) to get a high score.
The "Always Quick Glance" approach: Spent very little money but failed the hard tasks.
The "Random" approach: Sometimes worked, sometimes failed.
ARES: It spent about half the money (up to 52% less) but still got the same high score as the expensive "Deep Thinker" mode. In some cases, it even did better because it avoided "overthinking" simple tasks, which sometimes confuses the AI.

The Big Picture Analogy

Imagine you are driving a car across the country.

Old Way: You drive at 200 mph (Deep Thinker) the whole time, burning tons of gas, even when you are just turning into your driveway. Or, you drive at 10 mph (Quick Glance) the whole time, which saves gas but means you'll never make it to the destination on time.
ARES Way: You have a smart autopilot. It drives at 200 mph on the open highway (hard steps) but slows down to 30 mph in the parking lot (easy steps). You arrive on time, but you used way less fuel.

In summary: ARES is a system that teaches AI agents to know when to think hard and when to think fast. It stops them from wasting energy on easy tasks while ensuring they don't get lazy on the hard ones, making AI cheaper and faster without sacrificing smarts.

1. Problem Statement

Modern Large Language Model (LLM) agents achieve high accuracy in complex, multi-step tasks by utilizing extended Chain-of-Thought (CoT) reasoning. However, this comes at a substantial inference cost due to the accumulation of reasoning tokens at every step.

The Dilemma: While many state-of-the-art LLMs support configurable "reasoning levels" (e.g., High/Medium/Low), current strategies are suboptimal:
- Static Low-Effort: Applying low-effort modes uniformly across all steps leads to significant performance degradation (e.g., ~20% drop in accuracy) because complex steps require deep reasoning.
- Static High-Effort: Applying high-effort modes to every step is computationally wasteful, as simple steps (e.g., opening a URL) do not require intensive reasoning.
- Random Selection: Randomly choosing effort levels fails to preserve accuracy or provide consistent cost reductions.
The Gap: Existing model routing methods often route tasks between different models (heterogeneous), incurring overhead and losing KV cache reuse. There is a lack of frameworks that dynamically select reasoning effort within a single model to balance efficiency and accuracy in multi-turn agent trajectories.

2. Methodology: The ARES Framework

ARES (Adaptive Reasoning Effort Selection) is a plug-and-play framework that employs a lightweight Reasoning-Effort Router to dynamically predict the optimal reasoning level (Low, Medium, High) for each step of an agent's trajectory based on interaction history.

A. Core Architecture

Router: A small, lightweight LLM (e.g., Qwen3-1.7B) that takes the task query, interaction history, and current observation as input.
Output: The router generates a rationale (analyzing task complexity and progress) followed by a predicted effort level ( $e_t \in \{low, mid, high\}$ ).
Integration: The predicted effort level is passed to the main Agent LLM (e.g., gpt-oss-20b) to configure its reasoning depth for the next action. This allows for KV cache reuse across different effort levels within the same model, avoiding the latency of switching models.

B. Training Pipeline

The training process involves three distinct phases to generate high-quality supervision data:

Phase 1: Trajectory Collection:
- Collect successful trajectories using the Agent LLM at the maximum effort level (High).
- Select the most concise successful trajectory to serve as the "Ground Truth" path, minimizing noise from unnecessary steps.
Phase 2: Reasoning Effort Annotation (Labeling):
- Decompose the successful trajectory into individual steps.
- For each step, test the Agent LLM at Low, Medium, and High effort levels multiple times (e.g., $K=3$ trials).
- Verification: Use an LLM judge to verify if the action produced at a lower effort level is functionally equivalent to the ground truth action.
- Labeling: Assign the lowest sufficient effort level that reliably reproduces the correct action. If no level works, the step is discarded.
Phase 3: Rationale Generation:
- A powerful teacher model generates a semantic justification (rationale) for why a specific effort level is appropriate for a given step, considering task complexity and current progress.
- This creates a dataset of $(History, Observation, Rationale, Effort Label)$ .

C. Optimization Strategies

Supervised Fine-Tuning (SFT): The router is fine-tuned to predict the rationale and the effort label using the generated dataset.
Reinforcement Learning (RL): To address the limitations of SFT (which treats steps independently), the authors employ Group Relative Policy Optimization (GRPO).
- Reward Function: A composite reward $R(\tau)$ $R (τ)$ includes:
  1. Outcome Reward ( $R_{out}$ ): High positive reward for task success.
  2. Cost Reward ( $R_{cost}$ ): Negative penalty proportional to the effort level (e.g., -1.0 for High, -0.2 for Low), normalized by trajectory length.
  3. Format Reward ( $R_{form}$ ): Penalty for violating output templates.
- Filtering: Only prompts where the agent succeeds with 100% probability across rollouts but shows high variance in cost are used for RL, ensuring the router learns to optimize efficiency without sacrificing success.

3. Key Contributions

Dynamic Effort Selection: Proposes the first framework for per-step, intra-model reasoning effort selection tailored for multi-step agent tasks, moving beyond static configurations.
Efficient Training Pipeline: Develops a novel data generation pipeline that identifies the minimum sufficient reasoning effort for each step via multi-trial verification, decoupling effort selection from error propagation.
KV Cache Preservation: Unlike heterogeneous model routing, ARES maintains the same model instance, allowing for KV cache reuse and eliminating the latency/cost of re-encoding context.
Rationale-Driven Routing: Demonstrates that forcing the router to generate a reasoning rationale before selecting an effort level significantly improves selection accuracy compared to direct classification.

4. Experimental Results

The framework was evaluated on three diverse benchmarks: TAU-Bench (Tool Use), BrowseComp-Plus (Deep Research), and WebArena (Web Navigation).

Performance vs. Cost Trade-off:
- Token Reduction: ARES reduces reasoning token usage by up to 52.7% compared to fixed High-effort strategies.
- Accuracy: ARES maintains task success rates comparable to (and sometimes exceeding) fixed High-effort baselines.
  - TAU-Bench Retail: 54.8% accuracy (matching High baseline) with ~35% fewer tokens.
  - BrowseComp-Plus: 41.3% accuracy (near High baseline of 42.7%) with ~42% fewer tokens.
  - WebArena: 46.5% accuracy, actually outperforming the fixed High-effort baseline (45.0%), suggesting High effort can cause "overthinking" in web navigation.
RL Impact: The RL phase further improved results. In the TAU-Bench Airline domain, RL increased accuracy from 36.0% to 42.0% while reducing token consumption by nearly 80% compared to the SFT-only version.
Generalization: ARES trained on a 20B parameter model backbone successfully generalized to a 120B parameter backbone, achieving 65.2% accuracy (vs. 67.8% for fixed High), proving the learned reasoning patterns are scale-invariant.

5. Significance

Economic Efficiency: ARES offers a practical solution to the high cost of deploying reasoning LLM agents, potentially reducing inference costs by half without compromising reliability.
Paradigm Shift: It shifts the focus from "more reasoning is always better" to "context-aware reasoning," recognizing that complex environments require variable cognitive depth.
Scalability: The ability to generalize across model sizes suggests that the principles of adaptive effort selection are fundamental to agent design, not just specific to a single model architecture.
Future Direction: The work lays the groundwork for more sophisticated agent behaviors, such as multi-modal reasoning and self-correction mechanisms that dynamically allocate computational resources.

Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

1. The Smart Project Manager (The Router)

2. How Did We Teach the Project Manager?

3. The Results: Getting the Best of Both Worlds

The Big Picture Analogy

1. Problem Statement

2. Methodology: The ARES Framework

A. Core Architecture

B. Training Pipeline

C. Optimization Strategies

3. Key Contributions

4. Experimental Results

5. Significance

More like this

On the security of 2-key triple DES

Security issues in a group key establishment protocol

The impact of quantum computing on real-world security: A 5G case study

Yet another insecure group key distribution scheme using secret sharing

How not to secure wireless sensor networks: A plethora of insecure polynomial-based key pre-distribution schemes