Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

Imagine you are writing a novel on your computer, and you have a helpful assistant sitting right next to you, trying to guess the next word you'll type. This is what code completion does for programmers.

However, there's a classic problem with these assistants:

The "Super-Genius" Assistant: This assistant (a massive AI model running in the cloud) knows almost everything. It can write perfect sentences. But, it lives far away. Every time you ask it a question, it takes time to think and send the answer back. If it's too slow, you get annoyed and stop using it.
The "Quick-Thinker" Assistant: This assistant (a small AI model on your laptop) is right there. It answers instantly. But, it's not very smart. It often guesses wrong, giving you suggestions that don't make sense.

For a long time, developers had to choose: Speed (but bad suggestions) or Accuracy (but slow speed).

The Solution: The "Local-Cloud Relay Team" (MCCom)

The authors of this paper, MCCom, came up with a brilliant idea: Why not use both?

They created a system that acts like a relay race team. Here is how it works, using a simple analogy:

1. The "Default" Player: The Local Model

When you start typing a line of code, the system first asks the Local Model (the Quick-Thinker on your laptop).

Why? Because it's super fast. It can guess the next few words almost instantly.
The Trick: The system doesn't just blindly trust it. It checks the Local Model's "confidence." If the Local Model is 90% sure it's right, it just gives you the suggestion. If you accept it, great! You saved time.

2. The "Safety Net": The Cloud Model

What if the Local Model is unsure, or you type over its suggestion because it was wrong?

The Signal: The system watches your behavior. If you ignore the suggestion and keep typing, it knows, "Oh, the local model messed up."
The Escalation: Instead of wasting time asking the Cloud Model for every single word, it only calls the Cloud Model (the Super-Genius) when the Local Model fails or seems unsure. This saves a massive amount of time and money.

3. The "Magic Shortcuts": Two-Stage Speculative Decoding

This is the paper's secret sauce. Imagine the Cloud Model is a chef cooking a complex meal. Usually, they chop one vegetable, then the next, then the next (one by one). That takes time.

MCCom uses a trick called Speculative Decoding:

Step 1: The Local Model quickly writes a "draft" of the sentence.
Step 2: The Cloud Model doesn't start from scratch. It looks at the Local Model's draft and says, "Okay, I'll check if this is right."
The Result: The Cloud Model can verify many words at once in a single glance, rather than checking them one by one. It's like the Cloud Model is speed-reading the Local Model's work to confirm it, making the whole process incredibly fast.

4. The "Second Chance": Iterative Retrieval

Sometimes, even the Cloud Model needs a hint.

If the Local Model makes a mistake, the system looks at what the Local Model wrote. Even if it was wrong, it might contain a clue (like a specific variable name).
The system uses that clue to search the code repository for similar examples and feeds them to the Cloud Model. It's like saying to the Cloud Model, "Hey, the local guy tried to write 'obs_pool' but got it wrong. Here are some examples of how we usually write that in this project. Try again."

Why This Matters

The paper tested this system and found amazing results:

Speed: It's up to 48% faster than just using the slow, smart Cloud Model alone.
Accuracy: It's actually 9% more accurate than just using the Cloud Model alone! (Yes, you read that right. By using the Local Model to filter out easy cases and the Cloud Model to fix the hard ones, the final result is better).
Cost: It reduces the need to use the expensive Cloud Model by nearly 50%.

The Big Picture

Think of MCCom as a smart manager.

It lets the junior employee (Local Model) handle the easy, routine tasks instantly.
It only calls the senior expert (Cloud Model) when the junior employee is stuck or makes a mistake.
And when the senior expert does show up, the manager gives them a head start with notes from the junior employee, so the expert can finish the job faster.

This approach solves the age-old trade-off between speed and quality, making coding feel seamless and uninterrupted, just like a human typing naturally.

Here is a detailed technical summary of the paper "Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading".

1. Problem Statement

Line-level code completion is a critical feature in modern IDEs, requiring real-time suggestions as developers type. The field faces a fundamental trade-off between latency and accuracy:

Large Language Models (LLMs): Provide high-quality, context-aware suggestions but suffer from high inference latency and computational costs, especially when deployed in the cloud.
Static Analysis & Small Models: Offer low latency and run locally but often produce suboptimal or inaccurate completions for complex coding scenarios.
The Gap: Existing solutions struggle to achieve both low latency and high accuracy simultaneously. The challenge lies in determining when to escalate a request from a local small model to a cloud large model and how to make these two models collaborate effectively without wasting resources.

2. Methodology: MCCom Framework

The authors propose MCCom (Model-Cascading-based code Completion), a hybrid framework that cascades a lightweight local small model (SM) with a high-performance cloud large model (LM). The system operates on the principle of "default to small, escalate only when necessary."

A. Routing Strategy (When to Escalate)

MCCom employs a dual-layer routing mechanism to decide whether to invoke the cloud model:

Confidence-Based Static Routing: The system calculates the average probability of the first $N$ tokens (set to 3) generated by the local model. If this confidence score ( $w_{conf}$ ) exceeds a threshold (0.8), the suggestion is accepted. If not, the system prepares to escalate.
User-Action-Based Dynamic Routing: The system monitors implicit user feedback. If a user accepts the suggestion (e.g., presses Tab), the process ends. If the user continues typing (ignoring or rejecting the suggestion), this is treated as a signal of dissatisfaction, triggering an escalation to the cloud model.

B. Two-Stage Speculative Decoding

To minimize latency during both local and cloud inference, MCCom utilizes a two-stage speculative decoding strategy:

Stage 1 (Local Model): Instead of using another small neural network to draft tokens (which adds latency), MCCom uses context-based matching. It matches the line preceding the cursor against the local context and retrieved code snippets to find a matching statement. The successor of that match serves as a "draft" for the local model to verify.
Stage 2 (Cloud Model): If the local model's output is rejected, that output is reused as a speculative draft for the cloud model. The cloud model validates these tokens in parallel, accelerating the generation of the final completion.

C. Iterative Retrieval

Recognizing that even "failed" completions from the small model contain valuable semantic hints, MCCom introduces an Iterative Retrieval mechanism:

When the local model's suggestion is rejected, the system uses that suggestion as a query to perform a second round of retrieval from the repository.
A weighted scoring mechanism combines the original retrieval scores with scores derived from the small model's output. The weight is dynamically adjusted based on the small model's confidence ( $w_{conf}$ ) to prevent misleading suggestions from degrading the context.
This enriched context is then fed to the cloud model to improve the quality of the final generation.

D. Model Training

Due to the lack of suitable pre-trained small models for code completion, the authors trained a custom 121M parameter model based on the LLaMA architecture. It was trained on 41M Python samples using a mix of Next-Token Prediction (NTP) and Fill-in-the-Middle (FIM) tasks. Despite its small size, it achieves ~73.8% of the performance of a state-of-the-art 7B model.

3. Key Contributions

MCCom Framework: A novel local-cloud cascading framework that dynamically balances latency and accuracy using behavior-driven routing.
Collaborative Techniques:
- Two-Stage Speculative Decoding: Accelerates inference for both the local and cloud models by leveraging context-based matching and reusing rejected outputs.
- Iterative Retrieval: Leverages rejected local outputs to refine context retrieval, guiding the cloud model to more accurate completions.
New Benchmark (StmtEval): The authors constructed a new benchmark that treats a "line" as a complete, functionally valid code statement (unlike existing benchmarks that may target syntactic fragments like a single brace), better reflecting real-world utility.
Custom Small Model: Training and evaluation of a 121M parameter model optimized for line-level code completion.

4. Experimental Results

The framework was evaluated on RepoEval and the new StmtEval benchmark using three 7B Code LLMs (Qwen2.5-Coder, DeepSeek-Coder, CodeLlama).

Latency Reduction: MCCom reduced inference latency by 5.8% to 47.9% compared to using the LLM-only approach, with an average speedup of 25.6%. Compared to baselines that invoke the LLM twice, the speedup was even more significant (up to 75.3%).
Accuracy Improvement: Contrary to typical trade-offs, MCCom improved the exact match (EM) rate of the large model by an average of 8.9% compared to the LLM-only baseline. This is attributed to the iterative retrieval mechanism and the ability of the small model to solve cases the large model might miss (or vice versa).
Cloud Cost Reduction: The framework reduced cloud-side LLM usage by an average of 46.3%, as the local model successfully handled a significant portion of requests.
Efficiency: MCCom outperformed existing retrieval-augmented methods (RepoCoder) and speculative decoding methods (CSDrafting) in both speed and accuracy.

5. Significance

This work addresses a critical bottleneck in AI-assisted development: the latency-accuracy trade-off. By demonstrating that local-cloud cascading can simultaneously reduce latency and improve accuracy, MCCom offers a practical path forward for integrating powerful LLMs into real-time IDEs without compromising user experience. The use of implicit user feedback as a routing signal is a novel contribution that aligns system behavior with actual developer needs. Furthermore, the introduction of StmtEval highlights the need for more realistic evaluation metrics in code completion research, moving beyond syntactic line matching to functional statement completion.