GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency

Imagine you are trying to teach a brilliant but hungry student (the AI) how to become a master coder. You have a library with 300 billion pages of books (the data).

In the past, the strategy was simple: "Feed the student as many pages as possible, as fast as possible." But we've hit a wall. The library is running out of good books, and most of the remaining pages are just repetitive, noisy, or boring. If you just throw random pages at the student, they get overwhelmed, waste time on things they already know, and miss the few pages that could actually make them smarter.

This paper introduces GRIP, a new way to curate the student's diet. Instead of just counting pages, GRIP looks at the shape and value of the information.

Here is how GRIP works, broken down into three simple steps using everyday analogies:

1. The "Smart Map" (Geometric Refinement)

Imagine the library isn't just a pile of books, but a giant, 3D landscape.

The Problem: Some areas of the landscape are packed with identical books (dense clusters). Other areas are vast, empty deserts with only a few rare, valuable books (sparse clusters).
The Old Way: A robot would just grab books from the nearest pile, ignoring the empty deserts.
The GRIP Way: GRIP creates a 3D map of the library. It realizes that the "dense" areas are boring (the student already knows this stuff) and the "sparse" areas are where the student is missing knowledge. It decides to send the student specifically to the empty deserts to find those rare gems.

2. The "Quick Test" (Adaptive Information Potential)

Now that GRIP knows where to look, it needs to know what to pick up.

The Problem: How do you know if a book is hard to learn or easy?
The GRIP Way: GRIP uses a Rapid Adaptation Probe (RAP). Think of this as a "pop quiz."
- GRIP takes a small group of books from a specific area and gives them to the student for a quick, intense study session.
- If the student learns it instantly: The book was too easy. GRIP says, "Skip this, we don't need more of it."
- If the student struggles but then has an "Aha!" moment: The book is valuable. It fills a gap in their knowledge. GRIP says, "Get more of these! This is exactly what the student needs right now."
- This allows GRIP to constantly shift the student's diet based on what they are currently struggling to learn, rather than what was popular yesterday.

3. The "Long-Story Fix" (Length-Rectified Selection)

This is the most clever part of the paper.

The Problem: In the world of AI, long stories (long code snippets) often get "squished" together. Imagine trying to organize a library where all the short stories are spread out on the shelves, but all the long, complex novels are crammed into a single, tiny corner. Because they are crammed so tightly, a robot looking for "variety" might think, "Oh, these long novels all look the same," and throw them away.
The GRIP Way: GRIP realizes this is a trick of the light (a geometric distortion). It applies a "Length-Rectifier."
- It essentially says, "Wait a minute, just because these long stories are crowded together doesn't mean they are the same. They are actually unique and critical."
- It forces the selection process to pull out those long, complex stories that were being ignored, ensuring the student learns how to handle long, complicated logical chains (like writing a whole software program instead of just a single line of code).

The Result

When the researchers tested this on AI models:

They trained a model using GRIP's curated diet on a smaller amount of data.
They compared it to a model trained on 3 times more data that was just randomly picked (the "junk food" diet).
The Winner: The GRIP-trained model was smarter, better at reasoning, and more robust, even though it studied less.

Summary

GRIP is like a personal tutor who doesn't just hand you a stack of books. Instead, the tutor:

Maps your knowledge gaps.
Tests you to see what you are ready to learn next.
Fixes the bias that makes long, complex topics look boring.

By doing this, the AI learns more efficiently, skipping the noise and focusing on the high-value information that actually makes it smarter.

1. Problem Statement

The performance of Large Language Models (LLMs) is shifting from a reliance on raw data volume to data efficiency. As high-quality public corpora deplete, simply aggregating noisy web-scale data yields diminishing returns. Existing data selection methods suffer from a fundamental fragmentation:

Structural Budgeting: Adjusts mixture weights across predefined domains but ignores intra-cluster semantic variance and instance quality.
Instance-Level Saliency: Filters data based on difficulty or training dynamics but often decouples local importance from global topology, disrupting the hierarchical integrity of the dataset.

This is particularly critical for code corpora, which possess rigid logical topologies. Furthermore, Transformer embeddings suffer from Length-Induced Geometric Collapse, where long sequences collapse into dense, narrow cones in the embedding space, causing standard filters to erroneously discard them as "redundant" due to artificially high cosine similarities.

2. Methodology: The GRIP Framework

GRIP (Geometric Refinement and Adaptive Information Potential) unifies macro-level budgeting and micro-level selection within a hierarchical geometric optimization framework. It treats the corpus as an information-dense geometric space and operates in two coupled stages:

A. Inter-Cluster Budgeting (Macro-Level)

This stage dynamically allocates the global sampling budget ( $B_{total}$ ) across semantic clusters ( $C_k$ ).

Probe-Centric Representation: The corpus is partitioned into $K$ disjoint semantic clusters using spherical k-means. A Neyman-Optimal Probe Set is constructed to estimate cluster properties (size, geometric consistency $\sigma_k$ , and quality $Q_k$ ) with minimal variance.
Static Baseline Budget: A non-linear capacity allocation rule distributes the initial budget based on cluster size, geometric dispersion, and static quality scores (estimated via an LLM-as-a-Judge).
Dynamic Replay via Loss Dynamics: To adapt to the model's evolving state, GRIP employs a Rapid Adaptation Probe (RAP):
- The model is split into Frozen Layers and Retraining Layers.
- For each cluster, Retraining Layers are reset to a shared initialization, and $N$ -step gradient descent is performed.
- The Adaptation Delta ( $\Delta L_k$ ) measures the loss reduction. A small drop indicates a "Representation Deficit" (the model struggles to learn from this data), signaling a need for increased budget.
- A Replay Multiplier ( $r_k$ ) is applied inversely to the loss drop, prioritizing clusters where the model has high epistemic uncertainty.

B. Intra-Cluster Selection (Micro-Level)

Once a budget $n_k$ is assigned to a cluster, specific instances are selected to maximize local geometric coverage while correcting for embedding artifacts.

Kernel-Based Diversity Sampling: Uses Inverse Propensity Sampling to penalize samples in dense cluster centroids (common patterns) and select distinct examples defining the convex hull of the semantic span.
Length-Rectified Geometric Prior: To counteract the Length-Induced Collapse, a Length-Rectification Term ( $\beta$ ) is introduced. It up-weights long sequences that would otherwise be suppressed by pseudo-density artifacts, ensuring the preservation of long-tail logical sequences.

3. Key Contributions

Unified Selection Framework: GRIP bridges the gap between macro-budgeting and micro-instance selection. Experiments on 8B and 16B Mixture-of-Experts (MoE) models show a +4.6% average improvement across benchmarks, outperforming models trained on 3x larger uncurated datasets.
Adaptive Information Potential (RAP): Introduces a mechanism grounded in V-usable information theory to identify "representation deficits" dynamically, reallocating resources based on the model's current learnability rather than static heuristics.
Length-Rectified Geometric Selection: Characterizes and corrects the geometric collapse in Transformer embeddings for long-context data, preventing the loss of structurally critical, long-tail patterns.
Loss-Driven Quality Dynamics: Establishes a theoretical link between instantaneous loss reduction and data learnability, prioritizing samples that offer maximum incremental gain throughout pre-training.

4. Experimental Results

The framework was evaluated on 8B and 16B MoE models trained from scratch on a hybrid corpus (CommonCrawl + The Stack v2) up to 300B tokens.

Scaling Efficiency: GRIP consistently outperforms random sampling baselines. The performance gap widens with model capacity (e.g., +4.8% on 16B models), indicating effective saturation of model capacity with high-density information.
Reasoning and Robustness: Significant gains were observed in reasoning-intensive benchmarks:
- LiveCodeBench: +4.1% improvement (8B).
- MultiPL-E (Multilingual): +10.2% improvement (8B).
- HumanEval/MBPP: Substantial gains in code generation pass rates.
Ablation Studies:
- Static vs. Dynamic: Switching from static quality replay to loss-based replay yielded a +1.0% gain, proving the necessity of dynamic feedback.
- The "Diversity Trap": Using diversity sampling without length correction caused a performance drop in structural tasks (MultiPL-E), confirming that naive density sampling discards valuable long-context data.
- Length Rectification: Adding the length-correction term recovered multilingual and reasoning capabilities, validating the geometric collapse hypothesis.
Transferability: The learnability signal derived from lightweight proxy models (e.g., SmolLM-135M) was shown to be robust and transferable to larger target architectures (16B), with negligible computational overhead (<1% of total FLOPs).

5. Significance

GRIP establishes a robust geometric foundation for adaptive data curation. It demonstrates that prioritizing informative geometry and dynamic learnability over raw data volume is the key to scaling LLMs efficiently in an era of data scarcity. By addressing both macro-redundancy and micro-geometric collapse (specifically for long contexts), GRIP provides a scalable path for training high-performance models under fixed computational constraints, particularly for complex domains like code generation and logical reasoning.

GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency

1. The "Smart Map" (Geometric Refinement)

2. The "Quick Test" (Adaptive Information Potential)

3. The "Long-Story Fix" (Length-Rectified Selection)

The Result

Summary

1. Problem Statement

2. Methodology: The GRIP Framework

A. Inter-Cluster Budgeting (Macro-Level)

B. Intra-Cluster Selection (Micro-Level)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma