Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Imagine you are a master chef (the Large Language Model) trying to cook a complex, multi-course meal for a huge banquet. The chef is incredibly talented but also very slow and expensive to run. Every time they chop an onion or stir a sauce (generating a single word), they have to stop, think, and check their recipe book. If they have to do this for a 1,000-word essay, the kitchen grinds to a halt, and the customers (users) get impatient.

To speed things up, the chef hires a fast, junior sous-chef (the Draft Model). The sous-chef is quick but not perfect. They try to guess the next few ingredients before the master chef even finishes chopping the current one.

The Problem: The "All-or-Nothing" Guess

In the old days, the sous-chef would just shout out a list of guesses: "Next is salt, then pepper, then garlic!" The master chef would check them one by one.

If the master chef said, "Yes, salt is right," they kept going.
But if the master chef said, "No, that's too much salt," they had to throw away everything the sous-chef guessed after that point. The pepper and garlic were wasted.

Later, smart cooks invented a Tree Structure (like in the EAGLE-2 and EAGLE-3 methods). Instead of one long line of guesses, the sous-chef would branch out: "Maybe it's salt? Or maybe it's pepper? Or maybe it's garlic?" This way, if the master chef rejects "salt," they might still accept "pepper." It's like having a backup plan ready.

But here's the catch: The previous methods (EAGLE-2/3) were like a sous-chef who just kept branching out forever because they thought "more guesses = better." They didn't care about the kitchen's reality.

If the kitchen is small (a weak GPU), making too many branches causes a traffic jam.
If you are cooking for 100 tables at once (Batch Size), the sous-chef gets overwhelmed trying to manage too many branches, and the whole line slows down.

The Solution: CAST (The Smart Kitchen Manager)

The paper introduces CAST (Cost-Aware Speculative Tree). Think of CAST not just as a sous-chef, but as a Smart Kitchen Manager who understands the economics of the kitchen.

CAST asks two critical questions before the sous-chef starts guessing:

"How expensive is it to check these guesses?" (This depends on your specific computer hardware/GPU).
"How many people are we cooking for?" (This is the Batch Size).

The Analogy of the "Diminishing Returns"

Imagine the sous-chef is trying to guess the next 10 words.

Guesses 1–3: High confidence. The master chef will likely say "Yes." Great!
Guesses 4–6: Okay confidence. Maybe the master chef says "Yes."
Guesses 7–10: Low confidence. The master chef will likely say "No."

In the old methods, the sous-chef would waste energy generating guesses 7–10 just to have them rejected. It's like the sous-chfer running to the pantry to get a spice the master chef will definitely throw away. That running takes time (inference cost).

CAST's Strategy:
CAST looks at the "cost" of running those extra guesses. It realizes that after a certain point, the time spent generating the extra guesses is longer than the time saved by having them ready.

CAST says: "Stop! We have enough branches. If we add more, we'll actually slow down the whole kitchen because the GPU is getting crowded."
It dynamically shrinks or expands the tree based on how busy the kitchen is. If the kitchen is huge (large batch), it keeps the tree smaller to avoid traffic. If the kitchen is empty (small batch), it lets the tree grow bigger to grab every possible speed boost.

Why This Matters (The Results)

The researchers tested this "Smart Manager" in six different scenarios (like writing code, solving math, or chatting) using six different chefs (various AI models).

The Result: CAST was consistently faster than the previous best methods.
The Speedup: In some cases, it was 5.2 times faster than the old, slow way of cooking (standard decoding). Compared to the previous "smart" methods (EAGLE-3), it was still 5% to 20% faster.

In a Nutshell

Previous AI speed-up methods were like a driver who floors the gas pedal no matter what, ignoring traffic jams or road conditions. They thought "more speed = better."

CAST is the driver who checks the GPS, sees the traffic (hardware limits), and adjusts their speed and route dynamically. It knows exactly when to push the gas and when to coast, ensuring the car (the AI) gets to the destination (the answer) as fast as possible without crashing or getting stuck in a jam.

The Code: If you want to see this "Smart Manager" in action, the authors have made their code available on GitHub (linked in the paper), allowing others to build faster, more efficient AI systems.

1. Problem Statement

Large Language Models (LLMs) suffer from significant inference latency due to their autoregressive nature, where each token generation depends on all previous tokens. Speculative Decoding has emerged as a solution, using a lightweight "draft" model to propose multiple tokens that a larger "target" model validates in parallel.

While recent state-of-the-art (SOTA) methods like EAGLE-2 and EAGLE-3 utilize dynamic tree structures to improve acceptance rates, they suffer from a critical limitation:

Heuristic-Driven Construction: They rely on fixed heuristics or confidence scores to determine tree depth and width.
Neglect of System Variables: They fail to account for crucial system-level constraints, specifically GPU hardware configurations and batch sizes.
Inefficiency: In batched processing, blindly increasing the number of draft tokens (tree depth/width) can lead to GPU resource contention, increased memory overhead, and synchronization delays, ultimately slowing down inference rather than speeding it up.

2. Methodology: CAST (Cost-Aware Speculative Tree)

The authors propose CAST, a unified approach that dynamically refines the speculative decoding tree structure by modeling inference costs. The core philosophy is that more tokens do not always equal better performance; there exists a critical threshold where the marginal gain in acceptance is outweighed by the computational cost.

Key Components:

A. Cost Modeling & Lookup Tables

The method precomputes inference times for both the target model ( $f_T$ ) and draft model ( $f_D$ ) based on batch size ( $B$ ), context length ( $c$ ), and sequence length ( $n$ ).
These values are stored in lookup tables ( $S_T$ and $S_D$ ) to avoid real-time overhead, allowing the system to query the cost of processing specific tree configurations instantly.

B. Dynamic Expansion Stage (Breadth and Depth Pruning)
CAST treats node selection as a utility maximization problem (inspired by economic utility theory), balancing the probability of token acceptance against the inference cost.

Breadth Pruning (Width): For each layer, the algorithm selects the top- $k$ $k$ nodes. It calculates the cumulative utility (sum of confidence scores) and the normalized cost (draft cost relative to target cost).
- It retains nodes only if their marginal utility exceeds a threshold ( $C_1$ ).
- This prevents the inclusion of low-confidence tokens that incur high computational costs with little gain.
Depth Pruning: The algorithm decides whether to generate a new layer ( $i+1$ $i + 1$ ) based on a predictive relationship. It generates a new layer only if the confidence gain ratio multiplied by the current utility-to-cost ratio exceeds a threshold ( $C_2$ $C_{2}$ ).
- This ensures the tree stops growing when the expected return on investment (in terms of accepted tokens) drops below the cost of further expansion.

C. Dynamic Reranking Stage
After expansion, the tree may contain too many nodes. CAST performs a final selection to determine the optimal number of tokens to verify.

It sorts all candidate nodes by their cumulative probability scores.
Using Algorithm 1, it selects the top- $m$ nodes that maximize the trade-off between the cumulative probability (acceptance likelihood) and the inference cost, ensuring the final verification batch is optimized for the specific hardware and batch size.

3. Key Contributions

Cost-Aware Framework: Introduction of CAST, the first speculative decoding method to explicitly model and optimize for inference costs (GPU type, batch size) during dynamic tree construction.
Generalization of SOTA: The proposed selection algorithm generalizes EAGLE-2 and EAGLE-3. The authors prove mathematically that EAGLE-2/3 are special cases of CAST when specific cost parameters are set.
System-Level Optimization: Unlike previous works focusing solely on algorithmic acceptance rates, CAST integrates hardware constraints (batching effects) into the tree construction logic.
Extensive Evaluation: Comprehensive benchmarking across 6 diverse tasks (conversation, code, math, etc.) and 6 different LLMs (Vicuna, LLaMA3, Qwen2, DeepSeek-R1).

4. Experimental Results

The experiments were conducted on Nvidia A800 GPUs across various batch sizes and temperatures.

Single Sample (Batch Size = 1):
- CAST consistently outperformed EAGLE-3 and other baselines.
- Peak Performance: Achieved a 5.23x speedup on the HumanEval benchmark using LLaMA-3.3-70B.
- General Gain: Typically achieved 5% to 20% speedup over the previous SOTA (EAGLE-3).
Batched Inference (Batch Size = 8):
- CAST demonstrated significant advantages in batched settings where traditional methods often degrade due to resource contention.
- Peak Performance: Achieved 3.12x speedup on V13B-HumanEval.
- Consistency: Maintained relative improvements of 5% to 20% across all tested models and tasks compared to EAGLE-2 and EAGLE-3.
Metric Analysis: The authors noted that "Average Acceptance Length" can be misleading in batched scenarios (as longer acceptance chains can sometimes reduce throughput). CAST optimizes for Speedup Ratio (actual inference time reduction) rather than just acceptance length.

5. Significance

Practical Deployment: CAST bridges the gap between theoretical algorithmic improvements and real-world deployment constraints. By accounting for GPU batching and hardware limits, it offers a more robust solution for production LLM inference.
Efficiency without Quality Loss: The method is "lossless," meaning it does not require fine-tuning the target model or altering its acceptance conditions, preserving output quality while drastically reducing latency.
Scalability: The method scales particularly well with larger models and higher batch sizes, addressing the growing need for efficient inference in high-throughput applications like chatbots and code generation services.

In summary, CAST represents a shift from purely algorithmic speculative decoding to system-aware speculative decoding, ensuring that the dynamic tree structure is always optimized for the specific computational environment in which it runs.

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

The Problem: The "All-or-Nothing" Guess

The Solution: CAST (The Smart Kitchen Manager)

The Analogy of the "Diminishing Returns"

Why This Matters (The Results)

In a Nutshell

1. Problem Statement

2. Methodology: CAST (Cost-Aware Speculative Tree)

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá