LLMs can construct powerful representations and streamline sample-efficient supervised learning

Imagine you are trying to teach a brilliant but slightly literal-minded robot (an AI) how to be a doctor. You have a massive stack of patient files, but these files are a mess. They contain typed notes, handwritten scribbles, lab numbers, and codes, all jumbled together in a long, unorganized stream of text.

If you just hand this messy stack to the robot and say, "Figure out who is sick," the robot gets confused. It's like trying to find a specific needle in a haystack while wearing blindfolded boxing gloves.

This paper introduces a clever new way to solve this problem. Instead of forcing the robot to learn from the messy haystack, the authors use a "Super-Intelligent Librarian" (a Large Language Model or LLM) to organize the haystack into neat, labeled boxes before the robot ever sees it.

Here is the breakdown of their method, Rubric Representation Learning, using simple analogies:

1. The Problem: The "Messy Desk"

Real-world medical data is like a doctor's desk that hasn't been cleaned in years.

The Data: It has blood pressure numbers from 1998 mixed with notes about a broken toe from last week, and a list of medications written in shorthand.
The Old Way: Previous methods tried to feed this whole messy desk to the AI. The AI had to guess what was important. Sometimes it worked, but often it missed critical clues because the signal was buried in the noise.

2. The Solution: The "Smart Librarian"

The authors propose using an LLM not as the final doctor, but as a Librarian who reorganizes the desk before the AI sees it. They call this process creating a "Rubric."

Think of a Rubric like a strict, pre-printed form or a checklist that the Librarian fills out for every single patient.

How the Librarian Works (The Two Types of Rubrics)

A. The "Global Rubric" (The Master Blueprint)
Imagine the Librarian looks at 40 different patient files (a mix of sick and healthy people) and says, "Okay, to predict if someone will get high blood pressure next year, I need to ignore the noise and only look at these specific things."

The Librarian then writes a Master Blueprint (the Global Rubric) that says:

"Look at blood pressure readings from the last 30 days."
"Check if they are taking specific heart meds."
"Ignore the notes about their broken toe."
"Convert all weight to kilograms."

Once this blueprint is written, the Librarian can use a simple, fast computer script (a parser) to fill out this form for thousands of other patients instantly. It's like having a stamp that automatically organizes every new file into the perfect format.

Why it's great: It's cheap, fast, and consistent. Every patient gets the exact same organized form.

B. The "Local Rubric" (The Personal Summary)
Sometimes, the Librarian looks at a specific patient and writes a short, custom summary just for them.

"This patient is young but has a rare heart defect and smokes. Even though they are young, the combination of the defect and smoking makes them high risk."
Why it's great: It captures the unique story of the patient very well.
The downside: It takes a long time and costs money to have the Librarian write a custom summary for every single patient.

3. The Results: Why This Matters

The authors tested this on 15 different medical prediction tasks (like predicting heart attacks, diabetes, or hospital readmissions).

The Competition: They compared their method against:
1. The "Naive" approach: Just feeding the messy text to the AI.
2. The "Super-Model": A massive, expensive AI trained on millions of patients (CLMBR-T).
The Winner: The Rubric method won.
- It beat the "Naive" approach easily.
- Crucially, it beat the "Super-Model" that had seen 2.5 million patients, even though the Rubric method only looked at a tiny handful of examples to learn the rules.

4. The "Aha!" Moment

The paper proves that how you organize the information is more important than how big the AI is.

Think of it like this:

The Super-Model is a genius student who has read every book in the library but is trying to read a messy, scribbled note.
The Rubric Method is a smart assistant who takes that messy note, rewrites it into a clear, perfect sentence, and hands it to a regular student.
Result: The regular student with the clear note understands the problem better than the genius student with the messy note.

5. Why This is a Big Deal for the Real World

Auditability: Because the "Global Rubric" is a fixed form (like a spreadsheet), doctors can look at it and say, "Yes, this makes sense," or "No, we should change this rule." You can't easily do that with a giant, black-box AI.
Cost: Once the "Master Blueprint" is written, you don't need to pay the expensive AI to process every new patient. You can use a simple, free computer script to fill out the forms. This makes it possible to use this technology in hospitals with tight budgets.
Flexibility: The organized forms can be turned into simple tables (like Excel sheets), allowing hospitals to use any standard statistical tool they already know how to use.

Summary

This paper shows that we don't always need bigger, more expensive AI models to solve complex problems. Instead, we can use AI to act as a smart organizer, turning messy, chaotic data into clean, structured information. By doing the "heavy lifting" of organization first, even simple models can become incredibly powerful doctors.

1. Problem Statement

In complex domains like healthcare, finance, and environmental science, supervised learning is often bottlenecked by input representation design. Real-world data is heterogeneous, combining structured fields, unstructured text, time-series events, and images. Traditional approaches require significant domain expertise and bespoke feature engineering to transform this raw data into formats suitable for machine learning models.

The Gap: While Large Language Models (LLMs) can process raw text serialization, treating text serialization as a fixed input (naive approach) often fails to extract the most predictive signals or standardize the data effectively. Conversely, massive pre-trained foundation models (like CLMBR-T) require orders of magnitude more data to achieve high performance, which is often unavailable in specific clinical settings (few-shot scenarios).
The Goal: To automate the design of powerful input representations using LLMs to enable sample-efficient supervised learning, reducing the need for manual feature engineering while outperforming both traditional models and massive foundation models.

2. Methodology: Rubric Representation Learning

The authors propose Rubric Representation Learning, an agentic pipeline where LLMs act as a representation layer to transform naive text serializations ( $x_{text}$ ) into standardized, information-rich representations ( $x_{rubric}$ ) before downstream training.

A. Core Concepts

The framework defines two types of "rubrics" (structured specifications for data extraction):

Global Rubrics: A task-level specification generated by an LLM based on a diverse, small cohort of examples. It defines what information to extract, how to organize it, and how to format it. It creates a shared structure across all input samples.
Local Rubrics: Task-conditioned summaries generated by an LLM for each individual input. These act as compact, structured reasoning traces or summaries of the patient's risk profile.

B. The Agentic Pipeline (Global Rubrics)

The paper details a multi-step agent workflow (Figure 3) to operationalize Global Rubrics:

Diverse Cohort Selection: A small, label-balanced subset of the training data (e.g., 40 patients) is selected using k-means clustering in the embedding space to ensure diversity.
Rubric Synthesis: An LLM analyzes this cohort and synthesizes a Global Rubric ( $R$ ). This rubric is a set of instructions defining fields, extraction rules, temporal windows, and normalization standards (e.g., "Extract BP trends over the last 30 days," "Categorize according to ACC/AHA guidelines").
Application Strategies: The learned rubric is applied to the full dataset via three methods:
- Global-Rubric (LLM-based): An LLM applies the rubric instructions to every patient's text serialization.
- Global-Rubric-Auto (Parser-based): An LLM generates a deterministic Python script (using regex/string parsing) to apply the rubric. This removes the need for LLM API calls during inference.
- Global-Rubric-Tabular: An LLM generates a script to convert the rubric-transformed text into a fixed-dimensional tabular feature vector (numeric/binary features), enabling the use of standard tabular learners like XGBoost.

C. Local Rubrics

For comparison, the authors also generate Local Rubrics where an LLM summarizes each patient's record into a structured format (e.g., "Patient Snapshot," "Risk Factors," "Protective Factors") based on a generic prompt. This tests the value of LLM preprocessing without the strict standardization of a global rubric.

3. Key Contributions

Rubric Representation Learning: A novel framework where LLMs automate the design of input representations, transforming heterogeneous raw data into structured, task-aligned formats.
Global vs. Local Rubrics: The introduction of Global Rubrics, which enforce a shared schema across all data points. This offers significant operational advantages:
- Auditability: Easier for domain experts to inspect and refine.
- Cost-Efficiency: Once the rubric is learned, it can be applied via deterministic scripts (O(1) cost) rather than per-example LLM calls (O(N) cost).
- Tabularization: The ability to convert complex text into tabular features, unlocking the vast toolkit of traditional machine learning (e.g., XGBoost, Logistic Regression).
Sample Efficiency: Demonstrating that high-performance models can be built with very few training examples (e.g., $n=40$ ) by leveraging the LLM's pre-trained knowledge to structure the data effectively.

4. Experimental Results

The methods were evaluated on the EHRSHOT benchmark, comprising 15 clinical prediction tasks (e.g., ICU transfer, new diagnosis, lab result prediction) using data from 6,739 patients.

Baselines:

Count-GBM: Traditional gradient boosting on code counts.
CLMBR-T: A massive clinical foundation model pretrained on 2.57M patients.
NaiveText: Direct text serialization fed into an LLM embedding + logistic regression.
CoT: Zero-shot Chain-of-Thought prompting.

Key Findings:

Superior Performance: Rubric-based approaches significantly outperformed the NaiveText baseline and, on average, surpassed the CLMBR-T foundation model (pretrained on 2.57M patients) despite using only a fraction of the data.
- Example: In the $n=40$ regime, Local-Rubric achieved an AUROC of 0.717 vs. 0.638 for NaiveText and 0.657 for CLMBR-T.
Task-Specific Gains: The largest improvements were seen in New Diagnosis and Lab Result tasks, where evidence is sparse and dispersed. Rubrics helped organize these scattered signals effectively.
Global vs. Local: Local-Rubric slightly outperformed Global-Rubric in AUROC, but Global-Rubric achieved the best AUPRC and offered superior operational benefits (auditability, cost, tabularization).
Parser-Based Success: Global-Rubric-Tabular (deterministic script application) remained highly competitive, proving that the representation design is the primary driver of performance, not the inference-time LLM reasoning.
Full Dataset Evaluation: When applied to the full dataset without subsampling, Global-Rubric-Tabular achieved a mean AUROC of 0.770, confirming scalability.

5. Significance and Implications

Representation as a First-Order Driver: The paper establishes that in complex, heterogeneous domains, how data is represented is as critical as the model architecture or the amount of training data. LLMs can automate this design process.
Bridging the Data Gap: This approach allows organizations with limited data (few-shot) to achieve performance comparable to or better than massive foundation models by leveraging the LLM's pre-trained world knowledge to structure the input.
Operational Viability: By converting LLM-generated representations into tabular features via deterministic parsers, the method solves the cost and latency issues associated with using LLMs at inference time. This makes the approach feasible for real-world healthcare deployment, where auditability and cost are paramount.
Generalizability: While tested on EHR data, the framework is domain-agnostic and applicable to any scenario involving heterogeneous data (finance, engineering logs, etc.) where manual feature engineering is a bottleneck.

In conclusion, the paper demonstrates that LLMs can serve as a powerful representation layer, automating the transformation of raw, messy data into structured, learnable formats, thereby unlocking high-performance, sample-efficient supervised learning without the prohibitive costs of massive pre-training or manual engineering.