Predicting LLM Reasoning Performance with Small Proxy Model

Imagine you are a chef trying to create the world's most delicious, complex dish (a Large Language Model). To do this, you need a massive kitchen, tons of expensive ingredients, and months of cooking time. But before you commit to cooking the giant dish, you want to know: Which recipe ingredients will actually make it taste good?

Traditionally, to test a new recipe, you'd have to cook the full giant dish every time. That's incredibly expensive and slow. So, chefs usually try to cook a tiny "taster" version (a Small Proxy Model) to guess how the big dish will turn out.

The Problem:
For simple dishes (like making toast), the tiny taster works great. If the tiny toast is burnt, the big toast will be burnt too.

But for Reasoning (like solving a complex math problem or writing a clever story), the tiny taster fails miserably. It's like trying to predict how a 10-year-old chess player will do in the World Championship by watching a 3-year-old play. The 3-year-old doesn't just play "badly"; they play in a completely different, chaotic way. The tiny model gets confused, makes random guesses, and gives you the wrong signal about whether the big model will succeed.

The Solution: RBRIDGE
The authors of this paper built a new tool called RBRIDGE. Think of it as a "Magic Translator" that helps the tiny taster understand the big chef's mind.

Here is how RBRIDGE works, using simple analogies:

1. The "Gold Standard" Guide (The Frontier Model)

Instead of just asking the tiny model, "What do you think?" RBRIDGE first asks a Super-Expert Chef (a massive, state-of-the-art AI like GPT-4) to solve the problem and write down their step-by-step thought process.

Old Way: The tiny model just guesses the final answer.
RBRIDGE Way: The tiny model is shown the Super-Expert's detailed notes (the "Reasoning Trace") and asked, "How well does your cooking match these notes?"

2. The "Highlighter" Pen (Token Weighting)

Even with the expert's notes, the tiny model might get distracted by boring stuff.

Imagine the expert's notes say: "First, I need to add salt. Then, I need to stir. Finally, I need to serve."
The words "First," "Then," and "Finally" are just formatting. They aren't the real cooking.
RBRIDGE acts like a smart highlighter. It looks at the expert's notes and realizes: "Stirring" and "Adding salt" are the critical steps. The word 'Then' is just a connector."
It tells the tiny model: "Ignore the boring words. Focus your energy on the critical steps where the expert was confident."

3. The Result: A Crystal Ball

By combining the expert's step-by-step notes with this "highlighter" focus, the tiny model suddenly becomes incredibly accurate at predicting the big model's performance.

Why is this a big deal?

Massive Cost Savings: Instead of spending $50,000 to train a big model just to test one idea, you can use RBRIDGE on a tiny model for pennies. The paper says it saves you 100 times more computing power.
It Works on Hard Stuff: It works even for the hardest tasks (math, science, coding) where tiny models usually fail.
It's a "One-Time" Setup: You only need to ask the Super-Expert to write the notes once. After that, you can use your tiny, cheap model to test thousands of different recipes instantly.

The Bottom Line

RBRIDGE is like giving a small, cheap car a GPS system that connects directly to a supercomputer's map. Even though the car is small, it can now navigate complex terrain perfectly because it's following the right path, highlighted by an expert. This allows researchers to experiment with AI recipes much faster, cheaper, and smarter than ever before.

1. Problem Statement

Pre-training Large Language Models (LLMs) at scale (e.g., 7B+ parameters) requires prohibitive computational and economic resources, making it infeasible to exhaustively test different pre-training dataset mixtures or scaling strategies directly on large models. Consequently, researchers rely on small proxy models (e.g., <1B parameters) to predict the performance of larger target models.

However, this approach faces a critical limitation regarding reasoning capabilities:

Emergent Behavior: Reasoning skills (e.g., math, logic) often exhibit "emergent" behavior, appearing reliably only in models larger than 7B parameters.
Proxy Failure: Small proxy models (1B–3B) often fail to capture these reasoning trends. Experiments show that small models exhibit high noise, and in some cases, their performance trends slope in the wrong direction compared to larger models when evaluated on standard reasoning benchmarks (e.g., MATH500, GSM8K).
Current Limitations: Existing proxy methods often require large proxy models (up to 15B) to capture reasoning trends, negating the cost-saving benefits, or they rely on evaluation metrics (like Accuracy) that are misaligned with the pre-training objective of small models.

2. Methodology: RBRIDGE

The authors propose RBRIDGE, a framework designed to align small proxy models with large target models specifically for reasoning tasks. The method addresses two primary misalignments found in prior approaches:

Pre-training Objective Misalignment: Small models are trained to minimize Next Token Prediction (NLL), but standard evaluation metrics (Accuracy, Pass@K) do not correlate well with NLL for small models.
Task Alignment Misalignment: Standard evaluation often uses "gold labels" (final answers) that are Out-of-Distribution (OOD) for the proxy model or treat all tokens equally, ignoring that some tokens (reasoning steps) are more critical than others (formatting).

Core Components of RBRIDGE:

Gold Label Selection (Reasoning Traces): Instead of using the final answer as the gold label ( $Y^*$ ), RBRIDGE uses the reasoning trace ( $R_\phi$ ) generated by a frontier model (e.g., GPT-4o).
- Rationale: Reasoning traces consist of continuous text that is more In-Distribution (ID) with the pre-training data of the proxy model compared to formatted final answers (which often contain artifacts like "Final Answer:").
- Evidence: Using $R_\phi$ reduces the Negative Log-Likelihood (NLL) by ~74.7% compared to standard dataset labels, indicating a stronger signal.
Token-Level Weighting (Task Alignment): RBRIDGE does not treat all tokens in the reasoning trace equally. It weights the NLL of each token based on the confidence of the frontier model.
- Formula:
  $\text{RBRIDGE NLL}(\tau_i) = -\log p_p(\tau_i) \times \text{MinMax}\left( \frac{1}{|\tau_i|} \sum_{l \in \tau_i} p_\phi(l) \right)$
  Where:
  - $p_p(\tau_i)$ is the probability of the token from the small proxy model.
  - $p_\phi(l)$ is the probability of the letter/token from the frontier model (acting as a confidence weight).
  - Min-Max Normalization: Applied to the weights to amplify the effect of high-confidence (task-critical) tokens and suppress low-confidence (formatting/noise) tokens.
- Mechanism: This ensures that tokens crucial for the reasoning process (e.g., "sum modulo 9") contribute more to the score than formatting tokens (e.g., newlines), aligning the proxy's evaluation with the target task.

3. Key Contributions

Novel Evaluation Metric: Introduction of RBRIDGE, which combines frontier-model reasoning traces with token-level confidence weighting to create a proxy metric that correlates strongly with large-model reasoning performance.
Theoretical Insight: Identification that successful proxying of reasoning requires alignment with both the pre-training objective (using ID traces) and the target task (weighting critical tokens).
Zero-Shot Transferability: Demonstration that the functional relationship learned between a proxy and target model on one dataset can be zero-shot transferred to a different pre-training dataset, allowing for performance prediction without re-training the target model.

4. Experimental Results

The authors evaluated RBRIDGE across three experimental settings:

Dataset Ranking (<100M $\to$ 1.2B):
- RBRIDGE achieved 80.8% Decision Accuracy in ranking 25 pre-training datasets, outperforming 5 baselines.
- Efficiency: It reduced dataset ranking compute costs by 100.2 $\times$ to 733.4 $\times$ compared to the best baseline while maintaining the same ranking accuracy.
Proxy-Target Relationship (1B $\to$ 13B/32B):
- RBRIDGE achieved the strongest correlation (highest $R^2$ , lowest MAE) across 6 reasoning benchmarks (Math, Science, Engineering, Commonsense, Coding).
- Performance: On 1B $\to$ 13B, RBRIDGE achieved an average Train $R^2$ of 0.874 and Test MAE of 1.384, significantly outperforming baselines like Accuracy, Pass@1, and standard NLL.
- Scaling: Even when the proxy model size was increased by 7 $\times$ or 13 $\times$ , standard target metrics (Accuracy) failed to outperform RBRIDGE, proving that the method of evaluation is more critical than just model size.
Zero-Shot Transfer (1B $\to$ 7B):
- Functions fitted on one dataset (OLMo-Mix) successfully transferred to a different dataset ( $D'$ ) with minimal error (MAE < 1.5% on most benchmarks).
- This allowed for perfect dataset ranking (5/5) on the new dataset using only the proxy model, saving a 7 $\times$ reduction in compute compared to training the target model.

5. Significance and Impact

Cost Reduction: RBRIDGE enables practitioners to optimize pre-training data mixtures and predict scaling laws at a fraction of the cost (up to 700 $\times$ savings), making large-scale LLM development more accessible.
Demystifying Emergence: The paper challenges the notion that only large models can predict reasoning performance. It shows that with the right evaluation alignment, small models (<1B) can effectively proxy reasoning capabilities of models up to 32B.
Environmental Impact: By drastically reducing the compute required for dataset selection and scaling experiments, RBRIDGE contributes to lowering the environmental footprint of AI development.
Practical Framework: The authors propose a two-stage optimization framework: (1) Filter poor datasets using small proxies (<100M) with RBRIDGE, and (2) Rank the remaining candidates using 1B proxies, avoiding the need to train multiple large models.

In summary, RBRIDGE solves the "reasoning gap" in proxy modeling by shifting the evaluation paradigm from simple accuracy to weighted, trace-based NLL, enabling reliable, low-cost prediction of large-model reasoning performance.

Predicting LLM Reasoning Performance with Small Proxy Model

1. The "Gold Standard" Guide (The Frontier Model)

2. The "Highlighter" Pen (Token Weighting)

3. The Result: A Crystal Ball

The Bottom Line

1. Problem Statement

2. Methodology: RBRIDGE

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks