Inference-time Alignment in Continuous Space

The Big Problem: The "Lottery Ticket" Approach

Imagine you have a large language model (like a very smart but sometimes mischievous robot) and you want to make sure it gives you safe, truthful, and helpful answers.

Currently, most methods work like a Lottery.

You ask the robot a question.
The robot generates 64 different answers (tickets).
You have a "Judge" (a reward model) look at all 64 tickets and pick the best one.

The Flaw: If the robot is bad at math or safety, or if you only have time to generate 10 tickets, you might just get 10 bad tickets. You can't pick a winner if there are no winners in the pile. This is called searching in a discrete space (picking from a fixed list of separate options).

The New Solution: "Simple Energy Adaptation" (SEA)

The authors propose a new method called SEA. Instead of buying 64 lottery tickets and hoping one is a winner, SEA is like navigating a ship toward a lighthouse.

The Analogy: The Foggy Mountain vs. The Compass

The Old Way (Discrete Search):
Imagine you are in a thick fog on a mountain. You want to find the highest peak (the best answer).

Old Method: You take 64 random steps in 64 different directions. You shout out, "Which of these 64 spots is the highest?" If you didn't happen to step near the peak, you fail. If the mountain is huge and you are a slow walker (a weak model), you will never find the peak.

The New Way (SEA - Continuous Optimization):
Imagine you have a magical compass that points uphill (toward the best answer).

SEA Method: You start at your current spot. Instead of jumping randomly, you look at the compass (the gradient from the reward model). It tells you, "The ground slopes up that way." You take a small step in that direction. Then you check the compass again and take another step.
You keep doing this, sliding smoothly up the hill until you reach the very top. You aren't guessing; you are optimizing your path.

How It Works (The "Secret Sauce")

The paper introduces a few clever tricks to make this "sliding up the hill" possible for a computer:

The "Soft" Version: Computers usually speak in discrete words (like "cat" or "dog"). But to slide smoothly up a hill, you need a continuous surface. SEA temporarily turns the robot's output into "soft" numbers (probabilities) instead of hard words. This creates a smooth landscape where the robot can slide.
The Energy Function: Think of "Energy" as "Badness." The robot wants to minimize its energy (be as good as possible). The reward model acts like a gravity well, pulling the robot toward the "low energy" (high quality) zone.
The Iterative Dance: The robot starts with a rough answer. It then runs a loop (like a dance) where it:
- Looks at the "slope" of the reward.
- Adjusts its answer slightly to go "uphill" (better).
- Repeats this 10, 20, or 50 times until the answer is perfect.

Why Is This Better?

The paper shows that SEA beats the old "Lottery" methods in three major ways:

It works even with weak robots: If the base model is bad, the "Lottery" method needs millions of tickets to find a good one. SEA just needs a few steps up the hill. It doesn't matter how bad the starting point is; the compass will guide it to the top.
It fixes "Shallow" Safety: Sometimes, robots say "I can't do that" at the start but then give you the bad instructions anyway (like a polite liar).
- Analogy: The old method only checks the first few words.
- SEA: Because it optimizes the whole sentence at once, it ensures the entire story is safe, not just the first sentence. It fixes the "deep" alignment problem.
It's efficient: Generating 64 full answers takes a lot of computer power. SEA generates one answer and refines it. It's like editing a draft 10 times vs. writing 10 different drafts.

The Results

When the researchers tested this:

Safety: On a test of harmful requests (AdvBench), SEA reduced harmful answers by 77% compared to the second-best method.
Math: On hard math problems, it improved accuracy by 16%.
Truthfulness: It stopped the robot from making up facts more effectively than before.

Summary

Inference-time Alignment is about fixing a robot's behavior while it is talking, without retraining it from scratch.

Old Way: Throw darts at a board 64 times and hope one hits the bullseye.
SEA: Use a laser-guided missile that adjusts its flight path in real-time to hit the bullseye perfectly.

It's a simple, elegant shift from guessing and checking to guiding and refining, making AI safer and smarter without needing a massive computer upgrade.

1. Problem Statement

Large Language Models (LLMs) require alignment with human preferences to ensure safety, truthfulness, and reasoning capabilities. While Reinforcement Learning from Human Feedback (RLHF) is the standard training-time approach, Inference-time Alignment has emerged as a flexible alternative that adjusts model behavior during generation without retraining.

Existing inference-time methods (e.g., Best-of-N, Reward-Guided Search) operate under a "search within a discrete space" paradigm. They generate $N$ discrete candidate responses and select the one with the highest reward. The paper identifies two critical limitations of this approach:

Dependence on Base Policy: If the base model is weak, the probability of generating a high-quality response is low.
Exponential Scaling: To find a good response when the base policy is weak, the candidate set size $N$ must grow exponentially, which is computationally prohibitive.
Shallow Alignment: Discrete search often fails to correct harmful behaviors that occur deep in the generation sequence (the "shallow alignment" problem), as it cannot easily modify tokens once they are sampled.

2. Methodology: Simple Energy Adaptation (SEA)

The authors propose Simple Energy Adaptation (SEA), a novel algorithm that shifts the paradigm from "discrete search" to "optimization within a continuous space."

Core Concept

SEA formulates the optimal RLHF policy as an Energy-Based Model (EBM). The optimal policy $\pi^*(y|x)$ is defined via an energy function $E(x, y)$ :
$\pi^*(y|x) \propto \exp(-E(x, y))$
where the energy function is defined as:
$E(x, y) = \log \pi_{ref}(y|x) + \alpha r(x, y)$
Here, $\pi_{ref}$ is the reference (base) policy, $r(x, y)$ is the reward model, and $\alpha$ is a weighting coefficient.

The Algorithm

Instead of sampling discrete tokens, SEA performs iterative optimization on the continuous logits of the response using Langevin Dynamics.

Initialization: Start with the logits $y^{(0)}$ sampled from the base model $\pi_{ref}$ .
Continuous Representation: Treat the logits as continuous variables rather than mapping them immediately to discrete tokens. This allows for gradient-based optimization.
Iterative Update: Apply Langevin dynamics to update the logits over $N$ $N$ steps:
$y^{(n+1)} \leftarrow y^{(n)} - \eta \nabla_y E(x, y^{(n)}) + \epsilon^{(n)}$
Where:
- $\eta$ is the step size (learning rate).
- $\nabla_y E(x, y^{(n)})$ is the gradient of the energy function (derived from the reward model and reference model).
- $\epsilon^{(n)}$ is Gaussian noise to ensure exploration.
Decoding: After $N$ iterations, the final continuous logits $y^{(N)}$ are decoded into discrete text (using a straight-through estimator or argmax).

Key Technical Innovation

To handle the non-differentiable nature of discrete tokens, SEA operates directly on logits (soft outputs). It uses a Straight-Through Estimator (STE): the forward pass uses discrete sampling (argmax) for generation, while the backward pass uses the continuous softmax gradients to update the logits. This enables end-to-end gradient descent guided by the reward model.

3. Key Contributions

Paradigm Shift: Moves inference-time alignment from discrete candidate selection (Best-of-N) to continuous gradient-based optimization.
Deep Alignment: By optimizing the entire sequence of logits simultaneously (non-autoregressive style during the optimization phase), SEA addresses the "shallow alignment" problem. It can correct harmful content even if the initial tokens were unsafe, unlike token-by-token search methods.
Efficiency: SEA avoids the exponential cost of increasing the candidate set size ( $N$ ) required by Best-of-N to handle weak base models.
Robustness: The method is shown to be robust against weak reward models and resistant to reward hacking.

4. Experimental Results

The authors evaluated SEA on three tasks: Safety (AdvBench), Truthfulness (TruthfulQA), and Reasoning (GSM8K, MATH).

Safety (AdvBench):
- SEA significantly outperformed Best-of-N (BoN) and other baselines (ARGS, CBS).
- On the LLaMA-3.2-1B-Base model, SEA achieved a 91.54% relative improvement in Harmful Rate reduction compared to SFT, whereas BoN-64 only achieved ~33%.
- SEA effectively mitigated "Prefilling Attacks" (where harmful prefixes force the model to be unsafe), maintaining 0% Attack Success Rate even with 7 harmful prefix tokens, while BoN failed.
Truthfulness (TruthfulQA):
- SEA improved Truthful Rate, Informativeness, and Diversity simultaneously.
- Unlike BoN, where increasing $N$ led to diminishing returns and reduced diversity, SEA consistently improved all metrics.
Reasoning (MATH/GSM8K):
- On the MATH dataset, SEA achieved a 16.36% relative improvement in accuracy and a 74.96% increase in reward compared to the baseline, significantly outperforming search-based methods which struggled to find high-reward reasoning paths.
Efficiency:
- SEA is computationally more efficient than token-level search (ARGS) and comparable to sentence-level BoN, while delivering superior performance.

5. Significance and Conclusion

The paper demonstrates that continuous optimization is a powerful, underexplored avenue for LLM alignment.

Theoretical Insight: It proves that the optimal RLHF policy can be approximated by sampling from an energy function defined over continuous logits, bypassing the need for expensive discrete search.
Practical Impact: SEA offers a "plug-and-play" solution for aligning any unaligned LLM without retraining. It is particularly effective for models with weaker base policies or when the candidate set for discrete search is limited.
Safety: The ability to perform "deep alignment" (correcting the entire sequence globally rather than just the first few tokens) represents a significant step forward in robust AI safety, effectively countering adversarial attacks that bypass initial safety filters.

In summary, SEA replaces the brute-force search of discrete spaces with a guided, gradient-based optimization in continuous space, achieving state-of-the-art performance in safety, truthfulness, and reasoning with greater efficiency and robustness.