Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Imagine you are running a busy coffee shop (the Large Language Model or LLM). Your goal is to serve as many customers (requests) as possible as quickly as possible.

The Problem: The "One-at-a-Time" Bottleneck

Normally, your barista makes coffee one cup at a time. They grind beans, brew, and pour, then wait for the next order. This is how AI models usually work: they generate text one word at a time, waiting for the previous word to be finished before starting the next. It's slow, and your expensive espresso machine (the GPU) spends a lot of time just waiting for ingredients (data) to arrive.

The Old Solution: The "Speedy Assistant" (Speculative Decoding)

To fix this, you hire a fast, cheap intern (the Draft Model) to guess the next few words before the master barista even finishes the current one.

The Intern guesses: "The cat sat on the..." -> "mat", "rug", "sofa".
The Master checks: The master barista quickly looks at the guesses. If they are right, great! You serve three cups of coffee in the time it usually takes to serve one.
The Catch: If the intern guesses wrong, the master has to throw away the wrong guesses and start over. This "checking" takes time and energy.

The Dilemma:

When the shop is empty (Low Load): The intern is a hero. The master has time to check the guesses, and you get a huge speed boost.
When the shop is packed (High Load): The master is already running at full speed. The intern's guesses now become a distraction. The time spent checking the intern's work slows the master down. Plus, the intern needs their own little workspace (memory), which takes up space that could be used for more waiting customers (the KV Cache).

The New Solution: Nightjar (The Smart Manager)

The paper introduces Nightjar, a smart manager that watches the coffee shop and makes real-time decisions. It doesn't just blindly use the intern; it adapts to the crowd.

Nightjar does two main things:

1. The "Traffic Light" System (Dynamic Length Selection)

Nightjar uses a clever learning system (like a Multi-Armed Bandit, imagine a slot machine where you pull different levers to see which one pays off best) to decide:

Is it worth using the intern?
If yes, how many words should the intern guess? (1 word? 5 words?)
If no, should we fire the intern for now?

If the shop is empty, Nightjar tells the intern to guess 5 words ahead. If the shop is packed, Nightjar says, "Stop guessing! Just let the master work alone." This prevents the "checking" overhead from slowing things down when the system is already stressed.

2. The "Hot Desk" System (Elastic Memory Management)

Here is the cleverest part. The intern needs a desk (GPU memory) to work. The waiting customers also need space to sit (KV Cache memory).

When the shop is crowded: Nightjar kicks the intern out of the building (offloads the draft model to the CPU) and gives their desk to the waiting customers. This allows you to serve more people at once.
When the shop empties out: Nightjar quietly brings the intern back, sets up their desk, and starts guessing again to speed things up.

Most other systems keep the intern's desk reserved even when the intern isn't working, wasting valuable space. Nightjar dynamically reclaims that space.

The Result

By being a "smart manager" that knows when to use the intern and when to kick them out to save space, Nightjar achieves:

27% more customers served per hour (Throughput).
20% faster service for the customers (Latency).

In a Nutshell

Think of Nightjar as a traffic cop for AI.

When traffic is light, it opens all lanes and lets the "guessing" cars speed ahead.
When traffic is heavy, it closes the "guessing" lanes to clear space for more cars, preventing a gridlock.
It constantly watches the road and changes the rules instantly to keep everything flowing smoothly.

This ensures that the expensive AI hardware is always working at its peak efficiency, whether the demand is low or overwhelming.

1. Problem Statement

Large Language Model (LLM) serving faces a critical trade-off in Speculative Decoding (SD). While SD accelerates inference by using a small "draft" model to propose multiple tokens for parallel verification by a larger "target" model, its effectiveness is highly dependent on system load:

Low Load (Memory-Bound): SD improves throughput by increasing arithmetic intensity.
High Load (Compute-Bound): The verification overhead of SD becomes detrimental, reducing throughput compared to standard autoregressive (AR) decoding.
Memory Contention: Model-based draft models consume valuable GPU memory. Under high load, this memory competes with the KV Cache (Key-Value cache), limiting the batch size and further degrading throughput.

Limitations of Existing Solutions:

Fixed Lengths: Most systems (e.g., vLLM) use a static speculative length ( $\gamma$ ), which is suboptimal for dynamic request loads.
Static Policies: Existing dynamic methods (e.g., DSD, BanditSpec) often fail to adapt to real-time batch size changes or suffer from "deadlocks" where disabling speculation prevents re-activation due to lack of data collection.
Ignored Switching Costs: Re-enabling SD after a period of standard decoding requires reconstructing the KV cache for the draft model, incurring a significant one-time latency penalty that existing methods ignore.
Rigid Memory Management: Current systems keep draft model weights resident in GPU memory even when SD is disabled, wasting space that could be used for the KV cache.

2. Methodology: The Nightjar Framework

Nightjar is a resource-aware, adaptive framework that jointly optimizes speculative length selection and GPU memory management. It consists of three core components:

A. Contextual Multi-Armed Bandit (MAB) for Length Selection

Nightjar employs a hierarchical MAB algorithm to dynamically select the optimal speculative length ( $\gamma$ ) for every decoding step based on the current batch size ( $B$ ).

Objective: Minimize effective latency per token, explicitly accounting for the switching cost ( $C_{switch}$ ) incurred when transitioning from $\gamma=0$ (disabled) to $\gamma>0$ (enabled).
Hierarchical Structure: The algorithm organizes time into Blocks and Bins for each batch size.
- Blocks: Exponentially growing time windows ( $2^{j-1}$ ) to ensure long-term stability.
- Bins: Sub-units within blocks that control the exploration-exploitation trade-off.
Cost Modeling: The loss function includes a penalty term for switching: $L_t(\gamma_t) = \ell_t(\gamma_t) + \mathbb{I}(\gamma_{t-1}=0 \land \gamma_t>0) \cdot \frac{C_{switch}}{\gamma_t}$ . The switching cost is pre-profiled based on input length and batch size.
Theoretical Guarantee: The design ensures sublinear cumulative regret ( $\tilde{O}(\sqrt{T})$ ), guaranteeing convergence to the optimal strategy without excessive thrashing.

B. Elastic Memory Management (Offloading)

Nightjar introduces a "Squeeze/Expand" mechanism to manage the competition between draft model weights and the KV Cache.

Offloading (High Load): When the system detects memory pressure (low free KV cache blocks) and disables speculation ( $\gamma=0$ ), it offloads the draft model weights to CPU host memory. This reclaims GPU memory, allowing the KV cache to expand and support larger batch sizes.
Reloading (Low Load): When load decreases and memory is available, the draft model is asynchronously reloaded to the GPU in the background, preparing for the re-enablement of speculation.
Non-Blocking Migration: Memory operations (offloading/reloading) and KV cache block contraction/expansion are performed using asynchronous CUDA streams and Triton-accelerated vectorized kernels. This ensures data migration overlaps with computation, incurring negligible latency on the critical path.
Consistency: The system guarantees mathematical equivalence to the standalone target model by ensuring memory migrations only occur when speculation is fully disabled and the KV cache state is consistent.

C. System Architecture

Scheduler: Handles continuous batching and feeds real-time batch statistics to the planner.
Planner: The MAB engine that decides $\gamma$ (0 to $\Gamma_{max}$ ) for the current step.
Memory Manager: Executes the elastic memory policies, managing the physical allocation of GPU memory between model weights and KV cache blocks.

3. Key Contributions

Dynamic Adaptive Speculation: First framework to dynamically toggle speculative decoding on/off and adjust length in real-time based on batch size and load, explicitly modeling the KV cache reconstruction switching cost.
Elastic Memory Management: A novel mechanism that offloads draft model weights to CPU during high-load phases to maximize KV cache capacity, directly addressing the resource contention bottleneck in model-based SD.
Theoretical Guarantees: Provides a formal proof of sublinear regret for the proposed MAB algorithm, ensuring system stability and convergence.
High-Performance Implementation: Integrated into vLLM with asynchronous memory management, achieving significant throughput and latency improvements without compromising output correctness.

4. Experimental Results

Evaluated on DeepSeek-R1-Distill-Qwen-7B and Vicuna-13B models across ShareGPT, Alpaca, and SpecBench datasets under dynamic request loads.

Throughput: Nightjar achieves an average 27.29% higher throughput compared to standard autoregressive decoding (w/o SD) and significantly outperforms existing dynamic baselines (DSD, BanditSpec, TETRIS).
- Outperforms DSD by 22.89% and BanditSpec by 19.76% on average.
Latency: Reduces mean end-to-end latency by 12.90% compared to standard AR decoding and up to 20.18% compared to standard fixed-length SD.
Dynamic Load Handling: Under high request rates (compute-bound), Nightjar successfully disables speculation and offloads the draft model, preventing the performance degradation seen in other methods. Under low load, it maximizes speculation.
Scalability: Demonstrated effectiveness on a 30B model across two GPUs, showing consistent gains over baselines in multi-GPU distributed settings.
Overhead: The overhead of memory operations is negligible (e.g., KV cache contraction takes ~11.9ms; CPU overhead for reload dispatch is ~21.9 $\mu$ s).

5. Significance

Nightjar addresses a fundamental gap in LLM serving: the inability of current speculative decoding systems to adapt to dynamic workloads and resource constraints. By treating speculative decoding not just as an algorithmic optimization but as a system-level resource management problem, Nightjar:

Unlocks Higher Concurrency: By freeing up GPU memory for the KV cache, it allows serving larger batch sizes under high load.
Prevents Performance Collapse: It avoids the "overhead trap" where speculation hurts performance in compute-bound regimes.
Sets a New Standard: It demonstrates that effective LLM serving requires the coordinated optimization of inference policies (speculation) and infrastructure resources (memory), paving the way for more robust and efficient production LLM systems.