RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Imagine you run a busy restaurant with a massive menu of chefs, ranging from a quick, affordable food truck to a world-famous, Michelin-starred culinary genius.

The Food Truck (Small Model) is cheap and fast but might struggle with a complex dish like "Deconstructed Beef Wellington."
The Michelin Chef (Large Model) can make anything, but they are expensive, slow, and might get "overwhelmed" by a simple request like "Make me a grilled cheese," wasting time and money.

In the world of Artificial Intelligence, we have "Reasoning Models" (AI chefs) that can think through problems step-by-step. But just like our restaurant, we face a dilemma: How do we decide which chef to use for which order without wasting money or time?

If you always hire the Michelin chef, you go broke. If you always hire the food truck, you fail the hard orders.

This is exactly the problem the paper RADAR solves.

What is RADAR?

RADAR stands for Reasoning-Ability and Difficulty-Aware Routing. Think of it as a super-smart, instant maître d' who stands at the front of your restaurant.

When a customer walks in with an order (a question), RADAR doesn't just guess. It instantly analyzes two things:

How hard is the dish? (Is it a grilled cheese or a 10-course tasting menu?)
Who is the best chef for this specific dish? (Do we need the genius, or will the food truck do?)

How Does It Work? (The Magic Behind the Curtain)

The paper uses a clever concept borrowed from psychology and education, called Item Response Theory (IRT). You might know this from standardized tests like the SAT or GRE.

The Old Way: In school, you take a test, and the teacher figures out your "score" based on how many questions you got right.
The RADAR Way: RADAR flips this around. It looks at the questions (the "items") and the AI models (the "students") to figure out:
- Question Difficulty: How hard is this specific math problem?
- Model Ability: How good is this specific AI configuration at solving problems of that difficulty?

The "Budget" Twist:
In this paper, the "chefs" aren't just different people; they are the same AI model running with different settings.

Low Budget: The AI is told, "Think for 5 seconds and give me an answer." (Fast, cheap).
High Budget: The AI is told, "Think for 5 minutes, write a long essay, and then answer." (Slow, expensive).

RADAR learns that for a simple question, a "Low Budget" setting on a small model is perfect. For a complex physics problem, it routes the question to a "High Budget" setting on a giant model.

The "Pareto Front" (The Perfect Balance)

The paper talks about something called the Pareto Front. Imagine a graph where the X-axis is Cost and the Y-axis is Quality.

Bad Strategy: You pay $100 for a quality of 90%.
RADAR Strategy: You pay $10 for a quality of 85%.

RADAR finds the "sweet spot" on the curve. It ensures you are never paying for more quality than you need, and never skimping on quality when you need it. It's like finding the perfect price-to-quality ratio for every single order.

Why Is This a Big Deal?

It's Fast: RADAR makes its decision in about 7 milliseconds. That's faster than a human can blink. It decides before the AI even starts thinking.
It's Adaptable: If you buy a new, super-expensive chef (a new AI model), RADAR doesn't need to be retrained for months. It can test the new chef on just a few sample dishes, figure out their skill level, and immediately start using them correctly.
It Saves Money: The paper shows that on hard math tests, RADAR can achieve 90% of the performance of the most expensive, top-tier AI, but at only 1.3% of the cost. That's like getting a 5-star meal for the price of a coffee.
It Handles the Unknown: Even if you ask a question about a topic the AI hasn't seen before (like a weird, long document), RADAR is surprisingly good at guessing, "This is hard, let's use the big brain," preventing the system from crashing or giving a bad answer.

The Bottom Line

RADAR is the ultimate traffic controller for AI. Instead of blindly throwing every question at the biggest, most expensive AI (which is wasteful) or the smallest one (which is risky), it acts as a smart router. It matches the difficulty of the question with the right amount of brainpower and budget, saving companies massive amounts of money while keeping performance high.

It turns the chaotic "guess and check" of AI usage into a precise, scientific, and highly efficient operation.

1. Problem Statement

Reasoning Language Models (RLMs) have achieved state-of-the-art performance on complex tasks (math, science, coding) by utilizing Chain-of-Thought (CoT) reasoning. However, practical deployment faces a critical performance-cost trade-off at two levels:

Model Size: Larger models generally perform better but are more expensive and slower.
Reasoning Budget: RLMs often offer configurable "reasoning budgets" (e.g., low, medium, high token limits). Higher budgets improve accuracy but increase latency and cost.

The Challenge: Blindly using the most capable (largest, highest budget) configuration for every query is inefficient. Simple queries can be solved by smaller models with minimal budgets, while complex queries require powerful configurations. Furthermore, "over-thinking" (using high budgets on simple queries) can degrade performance. The goal is to dynamically route each query to the optimal {Model, Reasoning Budget} configuration that maximizes performance while minimizing cost, without requiring access to model weights (black-box setting).

2. Methodology: The RADAR Framework

RADAR (Reasoning-Ability and Difficulty-Aware Routing) addresses this by formulating routing as a Multi-Objective Optimization (MOO) problem, solved via Item Response Theory (IRT).

A. Discretization of Configurations

RADAR treats every combination of a specific RLM and a specific reasoning budget as a distinct "configuration" ( $g$ ).

Example: A Qwen3-8B model with a 4k token budget is a different configuration than Qwen3-8B with an 8k budget.
This discretization allows the router to select from a pool of heterogeneous configurations (e.g., OpenAI o4-mini with varying budgets, Qwen3 models of varying sizes with varying budgets).

B. Multi-Objective Optimization (MOO) Formulation

The routing decision is framed as finding the configuration $g^*$ that optimizes two objectives for a query $q$ :

Performance ( $p_q(g)$ ): Probability of a correct answer.
Cost ( $c_q(g)$ ): Normalized cost (tokens $\times$ price per token).

The goal is to find $g^*$ on the Pareto front of the performance-cost trade-off curve. RADAR solves this using Scalarization:

Linear Scalarization: Maximizes $w_1 \cdot p_q(g) - (1-w_1) \cdot c_q(g)$ .
Chebyshev Scalarization: Minimizes the maximum weighted distance from an ideal point (perfect performance, zero cost). The authors find Chebyshev scalarization superior for Out-of-Distribution (OOD) queries as it can handle non-convex Pareto fronts.

C. IRT-Based Calibration

To estimate $p_q(g)$ without running the model, RADAR uses a Two-Parameter Logistic (2PL) Item Response Theory model:

Query Difficulty ( $b_j$ ) & Discrimination ( $a_j$ ): Derived from query embeddings using learnable linear transformations. This allows generalization to unseen (OOD) queries.
Model Ability ( $\theta_i$ ): A scalar value representing the capability of configuration $g_i$ .
Probability Model: $P(\text{correct}) = \sigma(a_j(\theta_i - b_j))$ .
Training: The model is trained on a binary response matrix (correct/incorrect) from a calibration set of queries.

D. Adaptive Testing for New Models

To integrate a new RLM configuration without retraining the entire system:

RADAR uses Computerized Adaptive Testing (CAT) principles.
It dynamically selects a small subset of queries (based on Fisher Information) to evaluate the new model.
This allows the system to estimate the new model's scalar ability ( $\theta$ ) rapidly with minimal computational overhead.

3. Key Contributions

Novel Problem Formulation: Casts adaptive reasoning as routing over discretized {Model, Budget} configurations, optimizing via MOO rather than simple regression.
Interpretable IRT Adaptation: Adapts 2PL-IRT to learn scalar model abilities and query difficulties, enabling low-latency routing and interpretability (unlike opaque neural routers).
Plug-and-Play Scalability: Introduces an adaptive testing mechanism to estimate the ability of new models using only a small, dynamically selected subset of queries, avoiding full retraining.
Superior Performance: Demonstrates state-of-the-art results across 8 benchmarks, achieving strong Pareto-optimal trade-offs and robust generalization to OOD queries (including long-context tasks).

4. Experimental Results

The authors evaluated RADAR on 8 challenging reasoning benchmarks (MATH-500, AIME, GPQA-Diamond, LSAT, MMLU variants, FRAMES).

Performance-Cost Trade-off:
- On MATH-500, RADAR matches 90% of the performance of OpenAI o4-mini (high budget) at only 1.31% of the cost.
- On GPQA-Diamond, RADAR outperforms the second-best baseline by 8% in hypervolume (area under the performance-cost curve).
- RADAR consistently outperforms baselines like RouterBench, IRT-Router, and heuristic methods (Random, All-Large, All-Small).
Generalization (OOD):
- RADAR maintains strong performance on OOD queries (e.g., long-context multi-document QA in FRAMES), despite being trained primarily on shorter queries.
- It successfully generalizes to new model configurations (e.g., adding Qwen3-14B) using the adaptive testing module, improving routing performance immediately.
Efficiency & Latency:
- Latency Overhead: The routing decision adds negligible latency (~7ms per query), which is insignificant compared to the inference time of even the smallest RLM (~870ms).
- Throughput: The routing overhead reduces throughput by less than 1%.
Interpretability:
- The estimated query difficulties correlate moderately (Pearson $r \approx 0.51$ ) with ground-truth difficulty levels.
- The scalar ability parameters provide a clear, interpretable ordering of model configurations (e.g., Qwen3-8B with 16k budget > Qwen3-0.6B with 0 budget).

5. Significance

RADAR represents a paradigm shift in how reasoning models are deployed:

From Static to Adaptive: It moves away from static model selection to dynamic, query-aware configuration routing.
Black-Box Friendly: It operates entirely in a black-box setting, making it applicable to proprietary APIs (like OpenAI) and open-source models alike without requiring weight access.
Cost Efficiency: It offers a principled way to drastically reduce inference costs for RLMs by avoiding "over-thinking" on simple tasks while ensuring high performance on complex ones.
Scalability: The adaptive testing mechanism ensures the routing framework can evolve rapidly as new, more capable models are released.

In summary, RADAR provides a lightweight, interpretable, and highly effective framework for optimizing the performance-cost trade-off in the era of complex reasoning LLMs.