Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

The Big Picture: Teaching a Small Student with a Big Teacher

Imagine you have a brilliant but exhausted professor (a Large Language Model or "Teacher") who knows everything. You also have a bright but small student (a Small Language Model) who needs to learn the same material but can't carry as many books or think as fast.

Usually, to teach the student, you just hand them a random pile of practice questions. The problem? The pile might have 1,000 questions about "adding apples" and only 5 questions about "dividing fractions." The student gets great at apples but fails the fractions test.

This paper proposes a smarter way to create those practice questions using Synthetic Data Generation (SDG). Instead of just grabbing random questions, they use a map to find exactly where the student is weak and create new questions specifically for those weak spots.

The Core Problem: The "Random Shuffle" Trap

Most current methods of making practice questions work like this:

Take a big bag of existing questions.
Shake the bag and pull out a few random ones.
Ask the "Teacher" to write new, similar questions based on those.

The Flaw: If the bag is mostly full of "apple" questions, you will keep pulling out "apple" questions. The "Teacher" will just write more "apple" questions. The student never gets enough practice on the rare, difficult topics (like "fractions") because those questions are hidden in the bottom of the bag.

The Solution: The "Heat Map" Approach

The authors suggest looking at the data not as a bag of words, but as a geographic map.

1. The Embedding Space (The Map)

Imagine every math problem is a dot on a giant map.

Problems about "adding" are clustered in the North.
Problems about "geometry" are clustered in the South.
Problems about "fractions" are in a tiny, lonely village in the East.

When we look at the map, we see that the "North" is packed with people (dense), but the "East" is almost empty (sparse).

2. The Student's Weakness

The paper discovered a golden rule: Where the map is empty, the student is confused.
If a region on the map has very few examples, the student model performs poorly there. If a region is crowded with examples, the student does great.

3. The New Strategy: Targeted Sampling

Instead of shaking the bag randomly, the authors use a GPS to find the empty villages (sparse regions) on the map.

Step A: They find two existing questions that are on the edges of an empty village.
Step B: They draw a straight line between those two questions. The middle of that line is a brand new, imaginary question that fits perfectly in that empty spot.
Step C: They ask the "Teacher" to write a real, high-quality question based on that imaginary middle point.

This is like saying, "We have a question about '2 apples' and a question about '3 apples,' but we have nothing about '2.5 apples.' Let's invent a question about 2.5 apples to fill that gap."

The Analogy: Filling the Gaps in a Jigsaw Puzzle

Think of the student's knowledge as a jigsaw puzzle.

Old Method: You keep grabbing random pieces from the box. You end up with 50 pieces of the sky and 0 pieces of the dog's face. The picture is incomplete.
New Method (This Paper): You look at the puzzle on the table. You see a huge gap where the dog's face should be. You specifically look for the two pieces on the edge of that gap, imagine what the missing piece looks like, and ask a master artist to paint that exact missing piece for you.

Why This Matters

The paper tested this on math problems (like the GSM8K and MATH datasets) using different small AI models.

The Result: The models trained with this "Targeted Map" method got significantly higher scores than those trained with random questions.
The Efficiency: They didn't need thousands of new questions. By focusing only on the "empty" areas, they improved the student's performance much faster.
The Correlation: They proved mathematically that more data in a specific area = better accuracy in that area. It's a direct line: fill the gap, fix the grade.

Summary in One Sentence

Instead of randomly throwing practice questions at a student, this paper teaches us to look at a map of the student's knowledge, find the empty, confusing spots, and specifically generate new questions to fill those holes, making the student smarter with less effort.

1. Problem Statement

The paper addresses the challenge of improving the performance of smaller, resource-efficient Large Language Models (LLMs) (typically <20B parameters) to rival larger models (100B+ parameters). While Synthetic Data Generation (SDG) using a "teacher" model to fine-tune a "student" model is a proven strategy, existing methods suffer from two main limitations:

Lack of Diversity: Traditional SDG often relies on random sampling from a pool of seed examples. This leads to over-sampling of dominant modes in the teacher model's distribution, resulting in low-diversity synthetic data.
Model-Agnostic Approaches: Prior works generally ignore the specific shortcomings of the target "student" model. They do not tailor the synthetic data generation process to address the specific gaps in the student model's knowledge or embedding space.

The authors propose a Targeted Synthetic Data Generation approach that specifically analyzes and targets the deficiencies of a given student model ( $SM$ ) by operating within its embedding space.

2. Methodology: Embedding-based SDG (EmbedSDG)

The proposed pipeline generates synthetic data by identifying and filling "sparse" regions in the student model's embedding space. The process involves six key steps:

A. Embedding Computation & Dimensionality Reduction

Input: A labeled dataset $D$ and the target student model $SM$ .
Process: The model computes token embeddings and attention weights. To manage memory and handle the non-isotropic nature of transformer spaces, the authors apply dimensionality reduction (e.g., PCA, TruncatedSVD, or t-SNE) to project high-dimensional embeddings ( $N \approx 4000+$ ) into a lower-dimensional space ( $K=2$ or $3$).
Result: A compact embedding space $E$ representing the distribution of the training data.

B. Identifying Sparse Regions

The authors visualize the embedding space and observe that data is not uniformly distributed; some areas are dense (common topics), while others are sparse (rare topics).
Grid Analysis: The space is divided into a grid. Regions with a sample count below a specific threshold $T$ are identified as sparse regions ( $l$ ).
Hypothesis: Sparse regions in the embedding space correlate with areas where the student model performs poorly due to a lack of training examples.

C. Seed Selection

For each identified sparse region, the algorithm selects two "seed examples" from the existing dataset $D$ .
These seeds are chosen from opposing sides (or surfaces in 3D) of the sparse region to maximize the interpolation span within that specific gap.

D. Interpolation

The embeddings of the two selected seeds are interpolated to create a new vector.
Technique: The method averages the weighted embedding sequences of the two seeds. If the dimensionality reduction is linear (e.g., PCA), the new vector lies exactly on the midpoint between the seeds, ensuring it remains within the target sparse region.

E. Decoding

The interpolated embedding vector is decoded back into natural language text.
This is achieved by prompting the student model ( $SM$ ) with a specific decoding prompt ( $P_d$ ) that instructs it to reconstruct text from the provided embedding representation.

F. Final Generation

The decoded text, along with the two original seed examples, is fed into a powerful Teacher LLM ( $TM$ ) via a prompt ( $P_g$ ).
The Teacher LLM generates a high-quality, new synthetic example (question and answer) that is semantically situated in the previously sparse region.

3. Key Contributions

Targeted SDG Pipeline: A novel framework that generates synthetic data specifically to improve the diversity and quality of training for a specific student model, rather than using a generic approach.
Embedding Space Analysis: An empirical analysis demonstrating a strong correlation between the density of training examples in a specific neighborhood of the embedding space and the model's prediction accuracy in that region.
Performance Validation: Experimental evidence showing that targeting sparse regions consistently outperforms random seed selection across different models and benchmarks.

4. Experimental Results

The authors evaluated their method (EmbedSDG) against Random Seed Selection and Base Models using:

Models: Granite 3 8B, Granite 3.1 8B, and Mistral 7B.
Datasets: MetaMathQA (seed pool), GSM8K, and MATH (benchmarks).
Setup: Fine-tuning with varying amounts of synthetic data (500, 1000, 4500 examples).

Key Findings:

Consistent Improvement: EmbedSDG consistently outperformed random sampling across all models and datasets.
Significant Gains with Low Data: The improvement was most pronounced with smaller datasets. For example, Mistral 7B on GSM8K showed a ~2x improvement (0.62 vs. 0.35 accuracy) when using 500 EmbedSDG examples compared to 500 random examples.
Correlation Analysis: Statistical analysis (Pearson $r=0.813$ , Spearman $\rho=0.806$ ) confirmed a strong positive correlation between data density in the embedding space and model accuracy. As the number of examples in a region increases, accuracy improves.
Efficiency: The method allows smaller models to achieve performance closer to larger models with fewer, higher-quality synthetic examples.

5. Significance and Limitations

Significance:

Resource Efficiency: Provides a pathway to make smaller, cheaper LLMs viable for complex reasoning tasks by optimizing the data they are trained on.
Data Quality over Quantity: Validates the hypothesis that addressing data sparsity in the embedding space is more effective than simply increasing the volume of random synthetic data.
Model-Centric Design: Shifts the SDG paradigm from "generate everything" to "generate what the model needs."

Limitations:

Domain Specificity: The study was limited to mathematical reasoning and specific models (Granite/Mistral) where the fine-tuning data history was known. Generalization to other domains or models with unknown training histories is unproven.
Computational Cost: While the goal is to help smaller models, the process requires running a large Teacher LLM and performing dimensionality reduction/interpolation, which still demands significant compute resources compared to simple random sampling.

Conclusion

The paper successfully demonstrates that analyzing the geometric distribution of data in a student model's embedding space allows for the targeted generation of synthetic data. By specifically filling "sparse" regions, the method significantly boosts the reasoning capabilities of smaller LLMs, offering a more efficient alternative to random data generation.