When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Imagine you have a team of three brilliant chefs (Large Language Models, or LLMs) trying to write a long, complex recipe together. Each chef has their own unique way of chopping vegetables and measuring spices (their tokenizers).

Usually, when you want the best result, you ask all three chefs to taste the dish at every single step and vote on the next ingredient. This is called Ensembling.

However, the paper "When to Ensemble" discovers a major problem with this approach: If you ask them to vote on every single word, the recipe often turns into a disaster.

Here is the simple breakdown of why that happens and how the authors' new method, SAFE, fixes it.

The Problem: The "Bad Ingredient" Trap

Imagine Chef A wants to write the word "Sofia."

Chef A sees it as one big chunk: Sofia.
Chef B sees it as three small pieces: So, fi, a.

If the team votes on the first piece, So, and decides to move on, Chef B is now confused. Chef B was expecting the whole word Sofia to appear at once, but instead, they are forced to continue from just So. To Chef B, So looks like a weird, broken fragment (an "OOV-like token").

Because Chef B is confused, they might start hallucinating. Instead of writing fi, they might write ~A or fia or repeat the same weird letter over and over. Once one chef gets confused, the whole team's output starts to degrade, especially in long stories or complex math problems.

The Old Way: The team votes on every single word. This happens so often that the chefs get confused constantly, leading to gibberish.

The Solution: The SAFE Method

The authors propose a new system called SAFE (Stable And Fast LLM Ensembling). Think of SAFE as a smart Project Manager who knows exactly when to call a meeting and when to let the chefs cook alone.

SAFE uses a Draft-and-Verify strategy, similar to how a writer might draft a paragraph and then have an editor check it.

1. The "Drafter" (The Fast Writer)

One chef (the best one) is chosen to write a whole chunk of the text quickly, say 5 words at a time, without stopping to ask the others.

Analogy: The lead writer types out a sentence: "The quick brown fox jumps."

2. The "Verifiers" (The Quality Checkers)

The other chefs look at that chunk of text. They don't rewrite it; they just check two things:

Check 1: Is the text broken? (Tokenization Mismatch)
- They ask: "If we continue from this word, will it confuse any of us?"
- If the word is Sofia but Chef B only sees So, they flag it. No voting happens here. The team just accepts the Drafter's word to avoid confusion.
Check 2: Do we all agree? (Consensus)
- They ask: "Are we all 100% sure this is the right word?"
- If everyone agrees the word is correct, no voting happens here. Why waste time?
- If they disagree (e.g., one chef thinks it's "fox" and another thinks it's "box"), THEN they stop and vote.

3. The "Ensemble" (The Vote)

Only at the specific points where the chefs are confused OR disagree does the team stop, combine their knowledge, and pick the best word.

Analogy: The team only gathers to discuss the tricky parts of the recipe, not the easy parts like "add salt."

Why is this "SAFE"?

Stability (No More Gibberish): By skipping the votes on words that might confuse the tokenizers, the team avoids the "broken ingredient" problem. The text flows naturally without weird typos or repetitions.
Speed (Fast & Efficient): Since they only vote on a tiny fraction of the words (sometimes less than 1%), the process is almost as fast as just using one chef. They don't waste time voting on obvious words.
Sharpening the Vote: Sometimes, when they do vote, the results are too "mushy" (everyone is 50% sure of two different things). SAFE uses a trick called Probability Sharpening to force the team to pick the most confident answer, like a referee blowing a whistle to make a final call.

The Real-World Result

The paper tested this on hard math problems and logic puzzles.

Old Method: Tried to vote on every word. Result: Slow, and the math answers were often wrong because the models got confused by the token mismatches.
SAFE Method: Only voted on the critical moments. Result: Faster (almost as fast as a single model) and Smarter (higher accuracy), even when the models had very different ways of processing language.

In a nutshell: Don't ask a committee to vote on every single step of a journey. Let the leader drive, and only stop to consult the group when the road gets tricky or the leader is unsure. That is how you get a stable, fast, and accurate journey.

1. Problem Statement

While ensembling Large Language Models (LLMs) by aggregating their next-token probability distributions has proven effective for short-form answers and multiple-choice questions, its application to long-form generation (e.g., Chain-of-Thought reasoning) remains problematic. The paper identifies two critical issues with existing "ensemble-at-every-token" methods:

Instability (Tokenization Mismatch): Different LLMs often use different tokenizers. When an ensemble selects a token that is valid for one model but splits a word differently in another, it creates an "OOV-like" (Out-of-Vocabulary-like) token.
- Example: If Model A generates "So" and Model B tokenizes "Sofia" as a single token, feeding "So" to Model B forces it to predict the next token based on an unnatural prefix. This corrupts the probability distribution, leading to error propagation (e.g., generating "˜A" instead of "fia"), which degrades output quality in long sequences.
Inefficiency: Standard ensembling requires aligning probability distributions across different vocabularies at every generation step. This alignment is computationally expensive, and performing it for every token in a long sequence significantly increases inference latency, negating the speed benefits of using smaller models.

2. Methodology: SAFE Framework

The authors propose SAFE (Stable And Fast LLM Ensembling), a framework that selectively ensembles models only at specific token positions where it is both safe and necessary. SAFE operates on a Generate-Verify-Ensemble cycle, inspired by speculative decoding but adapted for heterogeneous tokenizers.

Roles

Drafter ( $M_{draft}$ ): The best-performing model among the ensemble. It generates a lookahead sequence of $n$ tokens.
Verifiers ( $M_{ver}$ ): The remaining models. They do not generate autoregressively but verify the Drafter's tokens in a single forward pass.

The Three-Step Cycle

Generate: The Drafter generates a sequence of tokens ( $t_i, \dots, t_{i+n-1}$ ).
Verify: The Verifiers examine the Drafter's tokens to determine if ensembling is required. Ensembling is triggered only if two conditions are met for a specific token $t_j$ $t_{j}$ :
- Condition 1: No OOV-like Token. The preceding token ( $t_{j-1}$ ) must not be an OOV-like token. The system checks if the token boundary up to $t_j$ aligns with the tokenization boundaries of all Verifiers. If a Verifier cannot tokenize the prefix validly, ensembling is skipped to prevent distribution corruption.
- Condition 2: Lack of Consensus. The token $t_j$ $t_{j}$ must not be the most confident token in the ensemble distribution. The system checks for consensus without full alignment:
  - Unanimous Consensus: All Verifiers agree $t_j$ is the most probable token.
  - High Average Probability: The average probability of $t_j$ across all models exceeds 0.5.
  - If either holds, ensembling is skipped to save computation.
Ensemble: If a token passes verification (i.e., it is an OOV-like risk or lacks consensus), the system performs ensembling:
- It aggregates the probability distributions of all models (aligned to a shared vocabulary).
- Probability Sharpening: To counteract the "smoothing" effect of averaging (which can lower confidence), SAFE applies a sharpening strategy. This either reallocates probability mass from variant subword tokens to a common prefix or uses a geometric mean instead of an arithmetic mean to concentrate probability on tokens supported by all models.
- The most confident token from the sharpened distribution replaces the Drafter's token.
- KV Cache Management: The system updates the Key-Value (KV) caches of all models to align with the new ensembled token, ensuring consistency for the next generation step.

3. Key Contributions

Identification of Critical Factors: The paper establishes that tokenization mismatch and distribution consensus are the two governing factors for successful long-form ensembling.
SAFE Algorithm: A novel "Generate-Verify-Ensemble" framework that:
- Prevents OOV-like token injection, ensuring stability in long sequences.
- Skips unnecessary ensembling operations when models agree, drastically improving efficiency.
- Introduces Probability Sharpening to handle smooth ensemble distributions.
KV Cache Optimization: A specific implementation strategy to maintain cache consistency when tokens are replaced during ensembling, a challenge previously unaddressed in ensemble settings.
Plug-and-Play Compatibility: SAFE can be integrated with existing ensemble methods (like UniTE or GaC) simply by adding the verification logic.

4. Experimental Results

The authors evaluated SAFE on diverse benchmarks (MMLU-redux, MATH500, GSM8K, BBH, ARC-Challenge) using 7B and 32B scale models with heterogeneous tokenizers (e.g., InternLM, Qwen, EXAONE).

Accuracy:
- Existing methods (like UniTE) often degrade performance in Chain-of-Thought (CoT) settings due to OOV-like token accumulation.
- SAFE + UniTE outperformed individual models and the baseline UniTE in 9 out of 15 experimental configurations.
- In math datasets, SAFE achieved significant gains (e.g., +4.2% on MATH500 with UniTE) while ensembling fewer than 5% of tokens.
Efficiency:
- SAFE reduced the number of ensembling operations (E/T ratio) to <20% of tokens (often <5% in math tasks).
- Latency: SAFE achieved inference speeds comparable to running a single model, even for long sequences, whereas standard ensembling was significantly slower.
- The KV cache management strategy further reduced latency compared to approaches without cache alignment.
Ablation Studies:
- Sharpening: Geometric mean sharpening yielded the best results, though the heuristic approach was also effective.
- Sequence Length: A drafter sequence length of 5 tokens provided the best balance between capturing tokenization differences and maintaining efficiency.

5. Significance

This paper addresses a fundamental bottleneck in LLM ensembling: the trade-off between the performance gains of combining models and the stability/efficiency costs of doing so in long-form generation.

Practicality: It demonstrates that LLM ensembling is viable for complex reasoning tasks (CoT) without requiring massive computational overhead or sacrificing output quality.
Robustness: By solving the tokenization mismatch problem, SAFE makes ensembling robust across models with different architectures and training data.
Deployment: The "Plug-and-Play" nature and efficiency gains suggest that SAFE is a practical step toward deploying robust, high-performance LLM systems in real-world applications where reliability and speed are critical.