LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

Imagine you are a master chef (the Target Model) trying to cook a complex meal. You are incredibly talented, but you are also very slow because you have to taste and adjust every single ingredient one by one before moving to the next. This is how current AI models generate text: word by word, very carefully.

To speed this up, you hire a fast, energetic sous-chef (the Draft Model). The sous-chef quickly guesses the next 5 or 10 ingredients you might need and lines them up on the counter. You then glance at them. If your master chef intuition agrees, you accept the whole batch instantly. If the sous-chef guessed wrong, you toss the batch and start over.

The speed of your kitchen depends entirely on how often the sous-chef guesses correctly. This is called the Acceptance Rate.

The Problem: The "Good Enough" Trap

For a long time, the way we trained these sous-chefs was like this:
"Sous-chef, try to copy my exact recipe book as closely as possible."

In math terms, this is called minimizing KL Divergence. It's like asking the sous-chef to memorize your entire cookbook.

The Theory: If the sous-chef copies you perfectly, they will guess every word right, and you'll be super fast.
The Reality: The sous-chef is small and has a tiny brain (limited computing power). They can't memorize your whole 1,000-page cookbook. So, they try their best to copy the overall style of the book.
The Glitch: Sometimes, the sous-chef copies the style so well that they look like a perfect student, but they still guess the wrong specific words for your current dish. They minimized the "difference in style" but failed to maximize the "number of correct guesses."

It's like a student who memorizes the vibe of a history textbook but fails the specific multiple-choice questions. They look smart, but they don't get the points.

The Solution: LK Losses (The "Direct Hit" Strategy)

The authors of this paper say: "Stop asking the sous-chef to copy the whole book. Just ask them to guess the next word correctly."

They introduced a new training method called LK Losses. Instead of saying, "Be like me," they say, "If you guess this word, you get a point. If you don't, you get zero."

They offer two ways to do this:

1. The "All-or-Nothing" Coach (Likelihood-Based)

This coach looks at the sous-chef's guesses and says: "I don't care how close your grammar is to mine. I only care if the word you picked is the one I would have picked. If it is, great! If not, try again."

Analogy: It's like a dartboard. You don't care if the dart landed near the bullseye; you only care if it hit the bullseye. This forces the sous-chef to focus entirely on the high-probability targets.

2. The "Smart Hybrid" Coach (Adaptive Blending)

This is the paper's superstar. This coach is very smart about when to use which strategy.

Early Training (The "Learning to Walk" Phase): When the sous-chef is a total beginner and guessing randomly, the coach says, "Okay, just try to copy my style (KL Divergence) so you don't get totally lost." This gives the sous-chef a smooth path to follow.
Late Training (The "Pro" Phase): Once the sous-chef is decent, the coach switches tactics. "Okay, you know the style. Now, stop worrying about looking like me and start worrying about hitting the target (Acceptance Rate)."
Analogy: Think of it like learning to drive. First, you learn the rules of the road and how to steer (copying the teacher). Once you're comfortable, the instructor stops caring about your steering technique and only cares if you stay in your lane and avoid hitting the curb (the actual goal).

Why This Matters

The paper tested this on massive AI models (some as big as 685 billion parameters) and found that:

It works everywhere: Whether the AI is writing code, solving math problems, or chatting, the "Direct Hit" strategy makes the sous-chef guess correctly more often.
It helps the small guys most: The smaller, weaker sous-chefs (low-capacity models) benefited the most. They couldn't memorize the whole book, so forcing them to focus on the specific "next word" was a game-changer.
It's free: This new method doesn't make the training slower or require more computer power. It's just a different way of grading the student.

The Bottom Line

The paper fixes a flaw in how we train AI assistants to be faster. Instead of training them to be perfect copies of the big brain (which is impossible for small brains), we train them to be lucky guessers who hit the right answer more often.

By switching from "Copy my style" to "Hit the target," the AI can generate text 8% to 10% faster on average, making our interactions with AI feel much snappier and more responsive.

1. Problem Statement

Speculative decoding accelerates Large Language Model (LLM) inference by using a lightweight "draft" model to propose multiple candidate tokens, which are then verified in parallel by a larger "target" model. The efficiency of this process is governed by the acceptance rate (the probability that a drafted token is accepted by the target model).

The Core Issue: Standard training for draft models minimizes Kullback-Leibler (KL) divergence between the draft and target distributions. While KL divergence and acceptance rate share the same global optimum (perfect alignment), draft models have limited capacity (typically 1–5% of target parameters).
The Limitation: In suboptimal solutions (which limited-capacity models inevitably reach), minimizing KL divergence does not guarantee maximizing the acceptance rate. The paper argues that KL acts as a poor proxy when the draft model cannot perfectly match the target distribution, often leading to convergence on solutions that are suboptimal for the actual goal of speculative decoding.

2. Methodology: LK Losses

The authors propose LK Losses, a family of training objectives designed to directly optimize the acceptance rate rather than relying on distributional alignment proxies. They introduce two variants:

A. Likelihood-Based Approach ( $L^\alpha_{LK}$ )

Concept: This objective treats the acceptance rate ( $\alpha$ ) as a marginal probability and minimizes the negative log-likelihood of acceptance:
$L^\alpha_{LK} = -\log \sum_{x \in V} \min(p(x), q(x))$
Mechanism: The gradient of this loss is proportional to the gradient of the Total Variation (TV) distance but scaled by $1/\alpha$ .
Benefit: This scaling automatically amplifies gradients when the acceptance rate is low (early training), effectively solving the "vanishing gradient" problem associated with pure TV distance optimization while maintaining the directionality of TV optimization.

B. Hybrid Objective with Adaptive Blending ( $L^\lambda_{LK}$ )

Concept: A weighted combination of KL divergence and TV distance:
$L^\lambda_{LK} = \lambda \cdot KL(p\|q) + (1-\lambda) \cdot TV(p, q)$
Adaptive Schedule: The weight $\lambda$ $λ$ is dynamically adjusted based on the current acceptance rate ( $\alpha$ $α$ ):
$\lambda = \exp(-\eta \cdot \text{sg}[\alpha])$
- Early Training (Low $\alpha$ ): $\lambda \to 1$ . The model relies on KL divergence, which provides smooth, well-conditioned gradients to navigate the loss landscape and establish a "trust region" where the draft distribution is close enough to the target.
- Late Training (High $\alpha$ ): $\lambda \to 0$ . The model shifts focus to TV distance, which directly maximizes the overlap between distributions (and thus the acceptance rate).
Analogy: This mimics trust-region methods in policy optimization, using KL as a soft constraint to stabilize the optimization of the true objective (TV).

C. Vocabulary Truncation Handling

The paper notes that some architectures (like EAGLE-3) use truncated vocabularies. Standard KL divergence becomes infinite if the draft probability is zero for a token the target model predicts. LK losses naturally handle this because tokens outside the draft vocabulary contribute zero to the acceptance rate, requiring no special masking or approximation.

3. Key Contributions

Direct Optimization: Proposes two loss variants ( $L^\alpha_{LK}$ and $L^\lambda_{LK}$ ) that directly target the acceptance rate, bypassing the limitations of KL divergence as a proxy.
Theoretical Insight: Provides a gradient analysis showing why pure TV optimization fails at random initialization (vanishing gradients) and how the hybrid approach mitigates this.
Architecture Agnosticism: Demonstrates that the method works across diverse draft architectures (EAGLE-3, MEDUSA, MLP speculators, and DeepSeek's MTP) and target models.
Open Source: Releases training datasets and draft model weights to facilitate reproducibility.

4. Experimental Results

The authors evaluated LK losses across six target models (ranging from 8B to 685B parameters) and four draft architectures on three domains: general conversation (MT-bench), coding (HumanEval), and math (GSM8K).

Performance Gains:
- Average Acceptance Length ( $\tau$ ): LK losses consistently outperformed standard KL training.
- Magnitude: Improvements ranged from 3.8% to 10% in average acceptance length.
- Specific Highlights:
  - Low-Capacity Models: Smaller draft models (e.g., MEDUSA, MLP) saw the largest relative gains (up to 8.3%), confirming that direct optimization is most beneficial when the capacity gap between draft and target is large.
  - Large Target Models: Significant gains were observed when using small draft models for massive targets (e.g., +8.2% for Qwen3-235B and +7.7% for GPT-OSS 120B).
  - DeepSeek-V3: Fine-tuning the native Multi-Token Prediction (MTP) module with LK losses yielded a 5.6% improvement over KL fine-tuning, proving the method works even on pre-trained components.
Comparison: The hybrid objective with an adaptive scheduler ( $L^\lambda_{LK}$ ) generally outperformed the pure likelihood-based approach and significantly outperformed fixed-weight mixtures or pure TV/KL losses.

5. Significance and Impact

Efficiency: By increasing the acceptance rate, LK losses directly translate to faster inference speeds without requiring additional computational resources during the training phase.
Simplicity: The method is a "drop-in" replacement for standard objectives, requiring no changes to the inference pipeline or draft architecture.
Paradigm Shift: The paper challenges the long-standing assumption that minimizing distributional divergence (KL) is sufficient for speculative decoding. It establishes that for capacity-constrained models, optimizing the specific metric of interest (acceptance rate) yields superior results.
Scalability: The approach is particularly crucial for the future of LLM deployment, where using small, efficient draft models to accelerate massive, parameter-heavy target models is a primary strategy for reducing latency and cost.

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

The Problem: The "Good Enough" Trap

The Solution: LK Losses (The "Direct Hit" Strategy)

1. The "All-or-Nothing" Coach (Likelihood-Based)

2. The "Smart Hybrid" Coach (Adaptive Blending)

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: LK Losses

A. Likelihood-Based Approach (LLKαL^\alpha_{LK}LLKα​)

B. Hybrid Objective with Adaptive Blending (LLKλL^\lambda_{LK}LLKλ​)

C. Vocabulary Truncation Handling

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

A. Likelihood-Based Approach ( $L^\alpha_{LK}$ )

B. Hybrid Objective with Adaptive Blending ( $L^\lambda_{LK}$ )