Learning Adaptive LLM Decoding

Imagine you have a brilliant, super-smart chef (the Large Language Model, or LLM) who can write code, solve math problems, and tell stories. This chef is incredibly talented, but they have a habit: they always cook with the exact same settings.

Whether they are making a simple sandwich or a complex 10-course banquet, they always use the same heat, the same amount of salt, and the same stirring speed. Sometimes, this works fine. But often, for a tricky dish, they might be too cautious (boring the food) or too chaotic (burning the kitchen).

The problem is that we, the users, usually just set these "cooking knobs" (like temperature or randomness) once at the beginning and leave them alone. We don't tell the chef, "Hey, this step is tricky, be more creative!" or "This next step is simple, just be precise."

This paper introduces a "Smart Sous-Chef" (the Decoding Adapter) that sits next to the main chef and adjusts the cooking settings in real-time.

Here is how it works, broken down into simple concepts:

1. The Problem: The "One-Size-Fits-All" Trap

Currently, when an AI generates text, it picks words based on fixed rules.

Low Randomness (Greedy): The AI plays it safe, picking the most obvious word every time. It's like a robot reading a script. Good for facts, bad for creativity.
High Randomness: The AI goes wild, picking surprising words. Good for poetry, bad for math.

The issue? A single math problem might need the AI to be rigid when doing simple arithmetic but creative when figuring out a new strategy. Using the same setting for the whole problem is like trying to drive a car with the gas pedal stuck in one position.

2. The Solution: The "Smart Sous-Chef"

The authors created a tiny, lightweight AI (the Adapter) whose only job is to watch the main chef and tweak the settings. They didn't retrain the main chef (which is expensive and slow); they just trained this new, tiny assistant.

The assistant learns two ways to help:

A. The "Big Picture" Strategist (Sequence-Level)

Before the chef starts cooking a dish, this strategist looks at the recipe (the prompt) and the budget (how much time/compute we have).

Scenario: "We have a huge budget and a hard math problem."
Action: The strategist says, "Let's try a chaotic, high-temperature approach to explore many different solutions!"
Scenario: "We have a tiny budget and a simple question."
Action: The strategist says, "Let's be super precise and stick to the most likely answer."

It picks one strategy for the whole task, like choosing the right tool for the job before starting.

B. The "Micro-Manager" (Token-Level)

This is the more impressive part. The assistant watches the chef word by word.

The "Forking" Moment: Imagine the chef is solving a math problem. For 90% of the steps, the answer is obvious (e.g., "2 + 2 ="). The assistant says, "Keep it simple, Chef. Just pick the obvious answer."
The "Critical" Moment: Suddenly, the problem hits a tricky logic jump. The chef hesitates. The assistant notices the uncertainty and says, "Whoa, this is a fork in the road! Turn up the heat! Let's explore a few different possibilities here!"
The Result: The AI becomes deterministic (precise) when it's sure, and stochastic (creative/exploratory) exactly when it's confused.

3. How Did They Teach It? (The "Reward" System)

They didn't teach the assistant with human feedback or complex rules. They used Reinforcement Learning with Verifiable Rewards.

Think of it like training a dog:

You don't tell the dog how to sit.
You just say "Good boy!" when it sits correctly.
If it fails, you say nothing.

In this paper, the "dog" is the AI. The "Good boy!" is a correct answer on a math test or a working code snippet. The assistant learns: "Every time I chose 'High Randomness' at step 5 and 'Low Randomness' at step 10, the final answer was correct. I should do that again!"

4. The Results: Why It Matters

The paper tested this on hard math problems (MATH dataset) and coding contests (CodeContests).

The Win: By letting the AI switch strategies on the fly, they got significantly better results without needing more computer power or a bigger model.
The Analogy: It's like giving a student a calculator that knows when to switch from "Standard Mode" to "Scientific Mode" automatically. The student (the model) was already smart; they just needed the right tool at the right moment.

Summary

This paper is about teaching AI to be self-aware about its own uncertainty. Instead of blindly following a fixed set of rules, the AI learns to:

Know when to be boring and precise.
Know when to be wild and exploratory.
Switch between these modes instantly based on the difficulty of the specific sentence it is writing.

It turns the AI from a rigid robot into a flexible, adaptive thinker, all by adding a tiny, smart layer that decides how to think, rather than what to think.

Here is a detailed technical summary of the paper "Learning Adaptive LLM Decoding".

1. Problem Statement

Large Language Models (LLMs) typically rely on fixed sampling hyperparameters (e.g., temperature, top-k, top-p) during inference. These parameters are usually static, chosen once for an entire model or dataset, ignoring the significant heterogeneity in:

Task difficulty: Some prompts require more exploration than others.
Reasoning styles: Different problems benefit from different levels of stochasticity.
Token-level uncertainty: Recent research indicates that uncertainty is often concentrated in a small number of "forking tokens" (high-entropy decisions) that disproportionately influence the final outcome.

Current approaches often treat decoding as a fixed process or rely on static heuristics. Furthermore, in Reinforcement Learning with Verifiable Rewards (RLVR) pipelines, decoding strategies are often fixed during generation, creating a train-test mismatch where models are optimized under one decoding distribution but evaluated under different constraints.

The Core Question: Can we learn a lightweight policy that dynamically selects the optimal sampling strategy at inference time, conditioned on the prompt, the model's internal state, and available compute budgets, without fine-tuning the underlying LLM?

2. Methodology

The authors propose Learned Decoding Adapters, a family of reinforcement learning (RL) based policies that modulate decoding strategies while keeping the base LLM frozen. The framework operates at two granularities:

A. Sequence-Level Adaptation (Contextual Bandit)

Formulation: The problem is framed as a contextual bandit. For a given prompt, the adapter selects a single decoding configuration (e.g., greedy, top-k, top-p, min-p) to be applied uniformly throughout the entire generation trajectory.
Input State: The policy takes the prompt embedding ( $e$ ) and the parallel sampling budget ( $B$ ) as input.
Action Space: A discrete set of decoding strategies selected via a data-driven greedy procedure to maximize coverage of high-performing behaviors across a validation set.
Objective: Maximize the expected terminal reward (e.g., Pass@k) given the selected strategy and budget.

B. Token-Level Adaptation (POMDP)

Formulation: The problem is modeled as a Partially Observable Markov Decision Process (POMDP). The adapter selects a decoding action at every token step, allowing the sampling strategy to vary within a single trajectory.
Input State: The policy observes the hidden state embedding of the current token ( $e_t$ ) and the remaining token budget ( $b_t$ ).
Action Space: In experiments, this is restricted to temperature-based actions (e.g., varying temperature values) to provide a simple, interpretable axis for controlling stochasticity.
Objective: Maximize the expected terminal reward over the sequence of actions.
Stabilization: To prevent high-variance gradients, the training filters out prompts with sparse rewards and masks tokens where the model's next-token distribution is already highly concentrated (probability > 0.95).

C. Training Framework

Algorithm: Both adapters are trained using Policy Gradient (REINFORCE).
Reward Signal: The system uses verifiable terminal rewards (e.g., correctness of a math solution or code execution) rather than learned reward models or human preference labels.
Budget Conditioning: The policy is explicitly conditioned on compute budgets (parallel sampling count or token limits) during training, enabling it to learn robust behaviors across different resource constraints.

3. Key Contributions

Unified RL Framework for Decoding: The paper formulates decoding-time inference as a policy learning problem, introducing a unified framework for both prompt-level (sequence) and token-level adaptation under explicit compute budgets.
Lightweight, Model-Agnostic Adapters: The approach trains small, lightweight adapters (MLPs) without modifying the base LLM parameters. It relies solely on task-verifiable correctness signals, avoiding the need for learned reward models or preference data.
Empirical Gains on Reasoning Benchmarks: The authors demonstrate significant improvements in the accuracy-budget tradeoff on mathematical (MATH) and coding (CodeContests) benchmarks.
Insight into Stochasticity Allocation: The work provides empirical evidence that learned policies effectively allocate stochasticity—using deterministic decoding for low-uncertainty tokens and exploratory sampling for high-uncertainty "forking" points.

4. Experimental Results

Experiments were conducted on MATH and CodeContests using Qwen models (1.5B, 4B, 8B).

Token-Level Adapter Performance:
- Under a fixed token budget, the token-level adapter improved Pass@1 accuracy by up to 10.2% over the best static baseline on MATH.
- Conditioning on the remaining token budget yielded the strongest performance.
- Ablation studies showed that using only token entropy as input was insufficient, proving the value of learning from full contextual representations.
Sequence-Level Adapter Performance:
- Under fixed parallel sampling budgets, the sequence-level adapter yielded 2–3% gains over strong static baselines (Best single strategy and fixed mixtures).
- Performance improved as the parallel sampling budget increased, indicating better utilization of additional rollouts.
Generalization:
- Adapters trained on MATH data generalized effectively to CodeContests and the harder AIME 2025 dataset without additional tuning.
- Mixed-domain training (math + code) showed the adapter could learn compromise policies for heterogeneous workloads.
Qualitative Analysis:
- The learned policies did not collapse to a single strategy but maintained a distribution over high-performing actions.
- They successfully identified and applied higher stochasticity to high-entropy tokens while collapsing low-entropy tokens to near-deterministic behavior, though no simple heuristic (like a fixed entropy threshold) could fully explain the policy's decisions.

5. Significance and Impact

This paper establishes inference-time control as a critical, underexplored axis for improving LLM reasoning, complementary to model scaling and fine-tuning.

Efficiency: It offers a way to improve accuracy without the computational cost of training larger models or performing extensive fine-tuning.
Flexibility: By learning to adapt to compute budgets, these adapters enable robust deployment in resource-constrained environments.
Paradigm Shift: It moves the field away from static, hand-tuned hyperparameters toward learnable, dynamic decoding policies that treat the sampling strategy as a learnable component of the inference process.

In summary, the authors demonstrate that by treating decoding as a control problem and using reinforcement learning with verifiable rewards, one can significantly enhance the reasoning capabilities of frozen LLMs by dynamically allocating stochasticity where it is most needed.