LLM REgression with a Latent Iterative State Head

Imagine you have a brilliant, world-class librarian (the Large Language Model, or LLM) who has read every book in existence. This librarian is amazing at understanding stories, writing poems, and answering questions in full sentences.

But now, you have a different job for them: You need them to give you a single number.

Maybe you want to know:

How similar are these two sentences? (Score: 0 to 5)
How good is this machine translation? (Score: 0 to 100)
How likely is this movie to be a hit? (Score: 0 to 10)

The Problem: The "Wordy" Librarian

The problem is that our librarian is trained to speak in words, not numbers. If you ask them, "How similar are these sentences?", they might try to answer by writing out a number like "4.5".

This is like asking a chef to measure salt by writing the word "salt" on a piece of paper instead of just pinching the right amount. It's inefficient and prone to errors.

The "Wordy" Approach (Autoregressive Decoding): The librarian writes "4.5". But what if they wrote "4.49" or "4.500"? To a computer, these are totally different words, even though they mean almost the same thing. The librarian gets confused by the formatting.
The "Voting" Approach (Regression-Aware Inference): You ask the librarian to write down 16 different guesses ("4.5", "4.6", "4.4"...), and then you take the average. This is accurate, but it's slow and exhausting for the librarian.
The "Simple Summary" Approach (Predictive Heads): You tell the librarian, "Don't write anything. Just look at the book and point to a hidden switch that controls the number." Previous methods used a very simple switch (like a single lightbulb) that tried to summarize the whole book into one glow. It often missed the fine details.

The Solution: RELISH (The "Iterative Refiner")

The authors of this paper created a new tool called RELISH. Think of RELISH as a specialized, high-tech magnifying glass that sits on top of the librarian's brain.

Here is how it works, using a simple analogy:

1. The "Silent Observer" (Frozen Backbone)

The librarian (the LLM) stays exactly the same. We don't retrain them or change their personality. They are "frozen" because they already know everything they need to know about language.

2. The "Iterative Detective" (Latent Iterative State)

Instead of asking the librarian to write a number, RELISH sends a silent detective (a "latent state") into the librarian's mind.

Round 1: The detective looks at the first few words of the sentence and forms a rough guess.
Round 2: The detective goes back, looks at the whole sentence again, and asks, "Wait, did I miss something in the middle? Let me adjust my guess."
Round 3: The detective does one more pass, refining the guess one last time.

This is the "Iterative" part. Unlike previous methods that just took a quick, one-glance summary (like squinting at a painting), RELISH takes three careful, focused looks, refining its understanding with every pass.

3. The "Translator" (Linear Regressor)

Once the detective has a perfect, refined internal understanding, it hands that understanding to a simple calculator (a linear regressor) which instantly converts that "feeling" into a precise number (e.g., 4.7).

Why is RELISH a Game Changer?

1. It's a Speed Demon (Efficiency)

The Competitors: The "Voting" method is like asking the librarian to write 16 essays to get one number. It's slow and uses a lot of energy.
RELISH: It's like a single, focused conversation. It happens in one pass. It's incredibly fast.

2. It's a Lightweight Champion (Parameter Efficiency)

The Competitors: To make the "Voting" method work better, you often have to teach the librarian new tricks, which requires adding a massive amount of new memory (parameters) to their brain. If the librarian is huge (32 billion parameters), you might need to add 0.4% more memory just to do math.
RELISH: It only adds a tiny, tiny "add-on" (about 0.01% to 0.04% of the brain size). It's like adding a single, specialized pair of glasses to a giant robot. It's so small it barely weighs anything, yet it makes the robot see numbers perfectly.

3. It's Smarter than it Looks (Performance)
In the paper's tests, RELISH beat every other method.

It was better at guessing the "similarity" of sentences.
It was better at judging the quality of translations.
It did this across different sizes of librarians (from small 8B models to giant 32B models).

The Bottom Line

Imagine you need to measure the temperature of a soup.

Old Way 1: Ask the chef to describe the temperature in words ("hot," "very hot," "scalding") and guess the number. (Inaccurate).
Old Way 2: Ask the chef to taste it 16 times and average the results. (Slow).
Old Way 3: Stick a simple thermometer in the soup that only reads "hot" or "cold." (Too blunt).
RELISH: You use a high-tech probe that dips in, checks the heat, adjusts its sensor, checks again, and then gives you the exact temperature in one second.

RELISH is that high-tech probe. It lets us use the world's smartest AI models for precise number-crunching tasks without slowing them down or needing to rebuild their brains. It's fast, cheap, and surprisingly accurate.

1. Problem Statement

The paper addresses the challenge of adapting Large Language Models (LLMs) for natural language regression tasks (predicting continuous scalar values from text), such as semantic textual similarity (STS) and machine translation quality estimation (MTQE).

Current approaches suffer from a fundamental mismatch between the text-to-text paradigm of LLMs and the continuous nature of regression targets:

Autoregressive Decoding (ARD): Treats numbers as text tokens. This is suboptimal because language modeling objectives (cross-entropy) penalize token mismatches rather than numerical proximity (e.g., predicting "0.1" vs "0.9" for a target of "1.0" is treated as a discrete error, not a small numerical error).
Regression-Aware Inference (RAIL): Attempts to fix ARD by using Bayes-optimal decision rules (e.g., posterior mean) over the output distribution. However, this requires sampling multiple outputs or enumerating candidates, leading to high inference costs.
Predictive Heads: Map LLM hidden states directly to numbers via a linear layer or MLP. While efficient, existing methods rely on static pooling (e.g., mean pooling) of token representations, which often fails to extract fine-grained, regression-relevant signals distributed across the sequence.

2. Methodology: RELISH

The authors propose RELISH (REgression with a Latent Iterative State Head), a lightweight architecture that predicts scalar values directly from frozen LLM representations without generating text.

Core Architecture:

Frozen Backbone: The LLM backbone is frozen. The input text $x$ is passed through the LLM to obtain token-level hidden states $H(x) \in \mathbb{R}^{S \times d}$ .
Projection: Hidden states are projected to a head dimension $d_h$ to form token-level memory $X(x)$ .
Latent Iterative State: Instead of collapsing the sequence into a single vector immediately, RELISH maintains a learnable latent task state $r^{(0)}$ .
Iterative Refinement: The latent state is updated $L$ $L$ times via cross-attention over the token representations:
- $r^{(i)} = \text{Refine}(r^{(i-1)}, X(x))$
- The refinement block uses a Transformer-style residual structure with Multi-Head Attention (MHA) where the latent state acts as the query, and token representations act as keys and values.
- This allows the model to iteratively "attend" to specific parts of the input sequence to distill regression-relevant information.
Prediction: The final refined state $r^{(L)}$ is mapped to a scalar prediction via a linear regressor: $\hat{y} = w^\top r^{(L)} + b$ .

Training:

The LLM backbone remains frozen.
Only the projection matrix, the initial latent state, the refinement blocks (MHA, FFN, LayerNorm), and the final linear weights are trained.
The objective is Huber loss on standardized targets (to handle numerical stability and outliers).

3. Key Contributions

Novel Architecture: Introduction of RELISH, which replaces static pooling heuristics with an iterative latent state refinement mechanism using cross-attention. This allows for a more robust summarization of token-level representations for scalar prediction.
Unified Empirical Comparison: A comprehensive evaluation across five datasets (STS-B, SICK-R, and three WMT quality estimation tasks), four LLM backbones (Llama 3.1 8B, Qwen3 8B/32B, Gemma 3 27B), and two training regimes (frozen and LoRA fine-tuned).
Parameter Efficiency: RELISH achieves state-of-the-art performance with a drastically smaller trainable footprint compared to LoRA-based alternatives.
- RELISH: ~3.4–3.7M trainable parameters (0.01–0.04% of the model).
- LoRA/RAFT: ~21–134M parameters (0.26–0.42% of the model), scaling with model size.
Performance Gains: RELISH consistently outperforms baselines from all three major regression families (Autoregressive, Regression-Aware, and Predictive Heads).

4. Experimental Results

Predictive Performance: RELISH achieved the highest scores across all metrics (Pearson correlation, Spearman correlation, and range-normalized RMSE).
- Compared to the second-best method (RAFT), RELISH improved Pearson correlation by 6.7%, Spearman by 7.2%, and NRMSE by 19.9%.
- It consistently outperformed autoregressive prompting (zero-shot and many-shot) and regression-aware inference (RAIL/RAFT), even when those baselines used LoRA fine-tuning.
Efficiency:
- RELISH maintains single-pass inference (unlike RAIL/RAFT which require sampling).
- Its parameter count is invariant to the backbone size (scaling only with hidden dimension $d_h$ ), whereas LoRA parameters grow linearly with the total model size.
Ablation Studies:
- Iterative Refinement: Increasing refinement steps ( $L > 1$ ) significantly improves performance over a single-step attention pooling ( $L=1$ ), confirming the value of iterative state updates.
- Loss Function: Gains are primarily architectural; RELISH performs well with both Huber and MSE loss, though Huber offers slight stability benefits.
- Task Sensitivity: RELISH shows the largest gains on Machine Translation Quality Estimation (WMT), where fine-grained, localized errors are critical, compared to Semantic Textual Similarity (STS), which relies more on global semantics.

5. Significance and Implications

Bridging the Gap: RELISH successfully bridges the gap between the efficiency of predictive heads and the accuracy of regression-aware inference, without the computational overhead of sampling.
Scalability: As LLMs grow larger, LoRA-based regression becomes increasingly expensive. RELISH offers a scalable solution where the cost of adaptation remains low regardless of the base model size.
Broader Applications: The architecture is applicable to various tasks requiring scalar outputs from text, including:
- Reward Modeling (RM): Learning scalar rewards for RLHF.
- LLM-as-a-Judge: Generating rubric-based scores for evaluation.
- Confidence Estimation: Predicting model uncertainty scores.
Insight into LLM Representations: The success of RELISH suggests that LLM hidden states already encode rich numerical and regression-relevant signals. The bottleneck in prior methods was not the lack of knowledge, but the inability of static pooling or autoregressive generation to effectively extract and refine these signals.

In conclusion, RELISH represents a paradigm shift in LLM-based regression, demonstrating that iterative, attention-based summarization of frozen representations is a superior strategy to both text generation and static pooling, offering a highly efficient and accurate solution for continuous value prediction.