In-Context Reinforcement Learning for Tool Use in Large Language Models

Imagine you have a brilliant student named LLM (Large Language Model). This student has read almost every book in the library, so they know a lot. But, they have two big problems:

They don't know today's news: Their knowledge is frozen in time (like a textbook from 2023).
They are bad at math: They are great at writing essays but terrible at calculating complex equations in their head.

To fix this, we want to teach the student to use tools: a Search Engine (to find fresh info) and a Python Calculator (to do the math).

The Old Way: The Expensive Tutor

Traditionally, to teach the student these skills, schools used a two-step process:

Supervised Fine-Tuning (SFT): You hire a team of expensive human tutors to write thousands of examples showing the student exactly how to use the tools. "First, write this tag. Then, ask the search engine this question. Then, read the answer."
- The Problem: This is incredibly expensive and slow. You need a massive library of perfect examples before the student can even start learning.
Reinforcement Learning (RL): Once the student knows the basics from the tutors, you let them practice on their own, giving them a gold star (reward) when they get the right answer.

The New Way: ICRL (The "Shadowing" Method)

The paper introduces ICRL (In-Context Reinforcement Learning). Think of this as a smarter, cheaper way to train the student without hiring a massive army of tutors.

Here is how ICRL works, using a Video Game Analogy:

1. The "Training Wheels" Phase (Few-Shot)

Imagine you are teaching someone to ride a bike. Instead of writing a 50-page manual (SFT), you put training wheels on the bike.

In ICRL, these "training wheels" are examples (demonstrations) pasted right into the student's prompt.
Example: "Here is how I solved a similar problem: I thought, then I searched, then I answered."
The student watches these examples and tries to copy the pattern while playing the game (generating answers).

2. The "Practice" Phase (Reinforcement Learning)

The student plays the game.

If they use the search engine correctly and get the right answer, they get a Gold Star (Reward).
If they mess up the format or get the wrong answer, they get a Time-Out (Penalty).
Crucially, the student learns by doing, not just by memorizing a manual.

3. The "Gradual Release" (The Curriculum)

This is the magic trick.

Start: The student sees 3 examples (training wheels are on).
Middle: After a few days of practice, you remove one example. Now they only see 2. They have to rely a bit more on their own brain.
End: You remove all examples. The training wheels are gone. The student is now riding the bike completely on their own, having internalized the skill.

Why is this a Big Deal?

It's Cheaper: You don't need thousands of human-written examples. You just need a few good ones to start, and the AI learns the rest through trial and error.
It's Smarter: Because the AI learns by doing (RL) rather than just copying (SFT), it becomes better at figuring out when to use a tool, not just how.
It Works Everywhere: The paper tested this on:
- Web Search: Answering tricky questions that need up-to-date info (like "Who won the game last night?").
- Math: Using a calculator to solve hard math problems.
- Results: The AI using ICRL beat the "Old Way" (SFT + RL) on almost every test, even though it never saw a single human-written example of how to solve the specific test questions!

The Bottom Line

ICRL is like teaching a child to cook.

Old Way: You write a 100-page cookbook, make them memorize it, and then let them cook.
ICRL Way: You stand next to them, show them how to chop an onion once or twice, let them try. If they burn the toast, you say "Ouch, try again." If they make a great soup, you say "Yum!" Slowly, you step back until they are cooking a gourmet meal all by themselves, without needing the cookbook anymore.

This method makes AI smarter, faster to train, and much cheaper to run.

1. Problem Statement

Large Language Models (LLMs) possess strong reasoning capabilities but are often limited by their static pre-training knowledge, making them ineffective on tasks requiring up-to-date information or complex external computations. While augmenting LLMs with external tools (e.g., search engines, Python interpreters) is a promising solution, current training paradigms face significant bottlenecks:

Data Scarcity and Cost: The dominant approach involves a "cold-start" pipeline: Supervised Fine-Tuning (SFT) on high-quality, labeled tool-use trajectories followed by Reinforcement Learning (RL). Generating these labeled trajectories is expensive and labor-intensive.
Exploration Failure: Applying RL directly from scratch (without SFT) often fails because the model lacks the initial ability to invoke tools correctly, leading to ineffective exploration and sparse reward signals.

2. Methodology: In-Context Reinforcement Learning (ICRL)

The authors propose ICRL, a framework that eliminates the need for SFT by integrating few-shot prompting directly into the RL rollout process. The core idea is to use in-context examples as "soft supervision" to guide exploration, gradually removing them as the model learns.

Key Components:

Curriculum Learning via Prompt Reduction:
- Initialization: Training begins with rollout prompts containing a small number of few-shot demonstrations ( $N$ ) that illustrate how to reason step-by-step, invoke tools (e.g., <search>, <code>), and format answers.
- Progressive Reduction: As training progresses, the number of in-context examples is iteratively reduced (e.g., from $N \to N-1 \to \dots \to 0$ ).
- Goal: This transitions the model from imitation (relying on examples) to autonomous tool use (zero-shot), forcing the model to internalize the tool-use strategy.
RL Objective & Loss Masking:
- The framework uses GRPO (Group Relative Policy Optimization) to optimize the policy.
- Loss Masking: Since tool outputs (e.g., search results, code execution logs) are retrieved from external sources and not generated by the model, they are masked out during loss computation. The model is only penalized/rewarded for its own generated tokens (reasoning, tool invocation, and final answers).
Reward Design:
- A composite reward function balances Task Accuracy (Exact Match with ground truth) and Format Correctness (adherence to structured XML tags like <answer>, <search>, <thought>).
- Format penalties are applied for missing tags, unbalanced structures, or empty answers to ensure the model learns the correct interaction protocol.

3. Key Contributions

SFT-Free Training: ICRL is the first framework to successfully train tool-augmented LLMs using RL only, completely removing the dependency on expensive, labeled SFT data for cold-start initialization.
Scalable Curriculum: The method introduces a novel curriculum where in-context examples are gradually removed during RL training, effectively merging the sample efficiency of prompting with the adaptability of RL.
Data Efficiency: By leveraging few-shot examples within the rollout rather than requiring full trajectory supervision, ICRL significantly reduces the data annotation burden.
Generalization: The framework is demonstrated to work across diverse tool types, including Web Search (retrieval) and Code Execution (Python interpreters).

4. Experimental Results

The authors evaluated ICRL on Qwen2.5 models (3B, 7B, 14B) and Qwen3-8B across multiple benchmarks.

Question Answering (QA) Benchmarks:
- ICRL achieved State-of-the-Art (SOTA) performance on difficult QA datasets (TriviaQA, HotpotQA, 2Wiki, Musique, Bamboogle).
- On Qwen2.5-3B, ICRL achieved an average Exact Match (EM) of 40.16%, outperforming the best baseline (Search-R1) by +8.94%.
- On Qwen2.5-7B, it reached 49.12% average EM, surpassing ParallelSearch by +7.34%.
- Comparison with SFT-based methods: ICRL outperformed O2-Searcher (which uses cold-start SFT) on 4 out of 5 datasets, despite O2-Searcher having access to labeled tool traces.
Math Reasoning (Code Execution):
- Evaluated on AIME2024 and AIME2025 benchmarks using code as a tool.
- ICRL (without SFT) matched or exceeded the performance of ReTool (an SFT+RL baseline) on AIME2025, demonstrating its ability to generalize to code-based tool use without costly pre-training.
Ablation Studies:
- Curriculum Design: A 3-stage reduction (3-shot $\to$ 2-shot $\to$ 0-shot) proved superior to a 4-stage reduction. Aggressively removing examples too early (e.g., dropping to 1-shot) led to premature stopping and lower reasoning quality.
- Scaling: Performance improved consistently as model size increased from 3B to 14B, confirming the method's scalability.

5. Significance and Impact

Paradigm Shift: ICRL challenges the prevailing "SFT + RL" dogma for tool use, proving that high-quality tool-augmented reasoning can be learned purely through RL guided by in-context examples.
Cost Reduction: By eliminating the need for synthesizing or annotating thousands of tool-use trajectories for SFT, ICRL offers a highly data-efficient and cost-effective path to training capable agents.
Autonomy: The curriculum design ensures models do not become dependent on prompt scaffolding, fostering true autonomous tool invocation capabilities.
Broad Applicability: The framework is domain-agnostic, applicable to both information retrieval (search) and computational reasoning (code), making it a unified solution for building next-generation agentic LLMs.