Eliciting Numerical Predictive Distributions of LLMs Without Autoregression

Imagine you have a super-smart, all-knowing librarian (the Large Language Model, or LLM) who has read every book in the world. You ask this librarian to predict the weather for tomorrow.

Usually, when you ask an LLM for a number, it acts like a very slow, meticulous scribe. It doesn't just "know" the number; it has to write it down one letter at a time. If the answer is "1,234.56," the librarian has to think: "Okay, first I'll write '1', then I'll think about '2', then '3'..." It has to generate every single digit sequentially. This is called autoregressive generation.

If you want to know not just what the weather will be, but how sure the librarian is (e.g., "It might be 10 degrees, or maybe 12, or maybe 8"), you have to ask the librarian to write the answer 100 different times to see the spread of possibilities. This is incredibly slow and expensive, like asking a scribe to write the same book 100 times just to check if the spelling is consistent.

The Big Discovery: The "Brain" Knows Before the "Hand" Writes

This paper asks a fascinating question: Does the librarian's brain actually know the answer before the hand starts writing the first letter?

The researchers discovered that yes, it does.

They found that the internal "thoughts" of the LLM (its hidden states) already contain the full picture of the number it intends to generate, including the uncertainty, long before it starts typing out the digits.

The Analogy: The Architect vs. The Bricklayer

Think of the LLM as a construction project:

The Autoregressive Process (The Bricklayer): This is the slow part where the machine lays bricks one by one to build a wall. To get a wall 100 feet high, it takes a long time.
The Internal Representation (The Architect): This is the blueprint hidden inside the machine's mind. The blueprint already shows the entire wall, its height, its width, and even the probability that a brick might fall off.

The researchers built a special tool called a "Probe" (think of it as a X-ray machine or a decoder ring). Instead of waiting for the bricklayer to finish the wall, they used the X-ray to look at the Architect's blueprint.

How They Did It (The "Magnitude-Factorised" Trick)

Predicting numbers is hard for AI because numbers vary wildly in size. A number like "0.0001" is very different from "1,000,000." If you try to teach a student to guess both, they get confused.

The researchers solved this with a clever two-step strategy, which they call Magnitude-Factorisation:

The Magnitude Classifier (The "Order of Magnitude" Guess): First, the probe asks, "Is the answer in the thousands? The millions? Or is it a tiny decimal?" It guesses the scale of the number.
The Value Regressor (The "Fine-Tuning" Guess): Once it knows the scale, it asks, "Okay, if it's in the thousands, is it 1,200 or 1,800?"

By splitting the problem into "How big is it?" and "What is the exact number?", the probe can accurately predict the answer without the LLM ever having to type a single digit.

What They Found

The Blueprint is Complete: The probe could accurately predict the average answer, the most likely answer, and even the "middle" answer just by looking at the LLM's internal state.
Uncertainty is Visible: The probe could also tell you how confident the LLM is. It could predict the range of possible answers (e.g., "It's likely between 10 and 12") without needing to ask the LLM to generate 100 different samples.
Speed and Cost: Because the probe only needs to look at the blueprint once, it is massively faster than the traditional method. It's like reading the architect's plan in 0.03 seconds versus waiting for the bricklayer to build the wall 100 times (which takes seconds or minutes).

Why This Matters

This is a game-changer for using AI in real life, especially for things like:

Financial forecasting: Where you need to know not just the stock price, but the risk.
Medical predictions: Where knowing the uncertainty of a diagnosis is as important as the diagnosis itself.
Robotics: Where a robot needs to make quick decisions without waiting for a slow computer to "think" through every possibility.

In short: The paper proves that LLMs don't need to "talk" to give you a number. They already "know" the number and how sure they are about it deep inside their neural networks. We just needed to build a better way to listen to that internal thought process without waiting for them to speak out loud.

1. Problem Statement

Large Language Models (LLMs) have shown promise in regression tasks (e.g., time series forecasting, tabular prediction) by leveraging in-context learning. However, a significant bottleneck exists: autoregressive decoding.

The Issue: To generate a single real-valued number, an LLM must predict multiple tokens sequentially (digits, decimal points, signs). To obtain a predictive distribution (for uncertainty quantification), this process must be repeated hundreds of times (sampling), leading to high computational costs and inference latency.
The Question: Can the LLM's internal representations (hidden states) encode sufficient information about its intended numerical output and its uncertainty before the autoregressive generation begins? If so, can we bypass the costly sampling process?

2. Methodology

The authors propose a probing framework that extracts statistical functionals of the LLM's predictive distribution directly from its internal embeddings, avoiding token-by-token generation.

A. Data Representation

Input: Time series sequences are serialized as text (e.g., "1.2, 3.4, 5.6").
Embeddings: The authors extract hidden states from the final token of the input sequence across multiple transformer layers (specifically the last 8 layers of Llama-2-7B). These vectors are concatenated to form the probe input.
Target Generation: To train the probes, the authors first generate ground truth distributions by running the LLM autoregressively (100 samples per input) to obtain empirical statistics (mean, median, quantiles, greedy output).

B. The Magnitude-Factorised Probing Model

A key innovation is the handling of the wide range of numerical magnitudes (orders of magnitude), which causes standard regression losses (like MSE) to fail. The authors introduce a two-component model:

Magnitude Classifier ( $f_{order}$ ): A classifier that predicts the order of magnitude (exponent) of the target number (e.g., $10^3$ vs $10^{-2}$ ).
Value Regressor ( $f_{val}$ ): A regressor that predicts the scaled value of the target, conditioned on the predicted magnitude class.

Training Strategy: A two-phase training procedure is used: first training the classifier, then the regressor, to ensure stable gradients.
Prediction: The final prediction is a marginalization over the top- $K$ predicted magnitude classes.

C. Uncertainty Elicitation (Quantile Regression)

To recover the full distribution shape, the authors extend the framework to predict quantiles (e.g., 5th, 25th, 50th, 75th, 95th percentiles).

Loss Function: They use Pinball Loss (quantile loss) combined with the magnitude classification loss.
Output: This allows the reconstruction of the Interquartile Range (IQR) and confidence intervals without sampling.

3. Key Contributions

Discovery of Pre-Generation Encoding: The paper demonstrates that LLMs encode detailed information about their intended numerical outputs (including point estimates and distributional spread) in their hidden states prior to autoregressive decoding.
Novel Probing Architecture: The introduction of the magnitude-factorised probe effectively solves the challenge of regressing on targets spanning multiple orders of magnitude, outperforming standard MLPs and log-scaled baselines.
Efficient Uncertainty Quantification: The method provides a lightweight alternative to sampling-based uncertainty estimation, capable of predicting calibrated confidence intervals directly from embeddings.
Generalization Analysis: The study evaluates the robustness of these probes across different context lengths and real-world datasets (Monash, Darts), showing promising transferability despite some calibration drops on out-of-distribution data.

4. Key Results

A. Point Prediction Accuracy

High Correlation: The probes achieve a Pearson correlation of 0.98 for mean and median predictions against the true LLM output statistics.
Error Rates: The Mean Squared Error (MSE) of the probe's predictions is comparable to the LLM's own sampling-based predictions. For example, on a scale of 1.0, the probe's MSE for the mean is 0.006, nearly identical to the LLM's direct sampling error.
Baseline Comparison: The probes significantly outperform simple baselines (e.g., using the last token or the mean of the input series).

B. Uncertainty and Distribution Recovery

IQR Prediction: The probes accurately predict the Interquartile Range (IQR) with a Spearman correlation of 0.90, indicating they capture the "spread" of the LLM's uncertainty.
Calibration: The predicted confidence intervals are well-calibrated. For a 95% target interval, the empirical coverage across various scales was approximately 95.4%, closely matching the nominal level.

C. Computational Efficiency

Speedup: The probe requires only a single forward pass through the LLM (to get embeddings) plus a lightweight MLP inference.
Comparison: Generating a single sample via autoregression takes ~1.6 seconds (on H100). The probe pipeline takes ~0.034 seconds.
Sampling Equivalence: The probe achieves accuracy comparable to 20–25 LLM samples but with the computational cost of a single pass. This represents a massive reduction in FLOPs (from billions to millions per statistic).

D. Generalization

Context Length: Probes trained on specific sequence lengths generalize reasonably well to unseen lengths, though training on a wider range improves robustness.
Real-World Data: When applied to real-world datasets (e.g., Air Passengers, Sunspots), the probes maintain good coverage, though performance degrades slightly when trained on synthetic data and tested on real data (due to scale distribution shifts).

5. Significance and Implications

Redefining LLM Regression: The work challenges the necessity of autoregressive decoding for numerical tasks. It suggests that the "reasoning" for numerical predictions happens during input processing, and decoding is merely a mechanism to surface these pre-computed values.
Practical Deployment: This approach enables lightweight, uncertainty-aware numerical prediction suitable for real-time applications where computational cost and latency are critical (e.g., financial forecasting, control systems).
Interpretability: It provides new insights into how LLMs internally represent numerical quantities and uncertainty, bridging the gap between mechanistic interpretability and practical regression tasks.
Future Directions: The authors suggest this paves the way for "universal probing models" that could be applied off-the-shelf to various LLMs for regression tasks without fine-tuning the base model.

Code Availability: The authors have released their code at https://github.com/kasia-kobalczyk/guess_llm.git.