$p1$: Better Prompt Optimization with Fewer Prompts — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a brilliant but slightly confused student (the AI) who is trying to solve very hard math problems. You can't change how their brain is wired (you can't retrain the model), but you can give them a set of instructions on how to think. This set of instructions is called a System Prompt.

The goal of this paper is to figure out how to write the perfect set of instructions to make the student solve more problems correctly.

The Problem: Why "More Data" Sometimes Makes Things Worse

Usually, when you teach a human, you give them a huge textbook with thousands of examples. The more examples they see, the better they get.

The researchers found something weird happening with AI prompts: Giving the AI a huge textbook of math problems actually made it harder to find the perfect instructions.

Here is the analogy:
Imagine you are trying to find the best pair of running shoes.

Scenario A (Homogeneous Task): You are training for a 100-meter sprint. You try 100 different pairs of shoes on the track. Some are terrible, some are great. It's very easy to see which shoes make you faster because the track is the same every time.
Scenario B (Heterogeneous Task - The Math Problem): You are training for a multi-sport event (sprinting, swimming, and climbing).
- If you try to find "one perfect shoe" for all three sports, you get confused. The shoes that are great for sprinting are terrible for swimming. The shoes for climbing are useless for running.
- When you average the results across all 100 different sports, the "good" shoes and "bad" shoes cancel each other out. The data looks like noise. You can't tell which shoe is actually the best because the task is too mixed up.

The paper calls this "Variance."

Response Variance (Noise): The AI is just being random. Sometimes it gets the answer right by luck, sometimes wrong, even with the same instructions.
Prompt Variance (Signal): How much the instructions actually change the outcome.

On hard math problems (like the AIME competition), the "noise" (randomness) is so loud that it drowns out the "signal" (the quality of the instructions). When you add more math problems to the training set, you add more noise, making it even harder to hear the signal.

The Solution: The "p1" Filter (The Spotlight Method)

Instead of trying to teach the AI with the whole library of 30 math problems, the researchers proposed a method called p1.

The Analogy:
Imagine you are a coach trying to pick the best running shoes. Instead of testing the shoes on 30 different athletes doing 30 different sports, you pick just two athletes who are extremely sensitive to shoe quality.

Athlete A runs fast in Shoe X but trips in Shoe Y.
Athlete B runs fast in Shoe Y but trips in Shoe X.

By focusing only on these two "sensitive" athletes, you can clearly see which shoe is better. You ignore the 28 other athletes who don't care much about the shoes because their results are just random noise.

How p1 works:

Test the Waters: The AI tries a bunch of different instructions on a few math problems.
Find the "Sensitive" Problems: It looks for the specific math problems where changing the instructions causes a huge difference in the score (one instruction gets 100%, another gets 0%). These are the problems where the instructions matter.
Filter: It throws away the "boring" problems where the instructions don't seem to change anything.
Train: It trains the AI only on those few, high-sensitivity problems.

The Results: Less is More

The results were surprising:

The Old Way (Full Dataset): Trying to optimize instructions on all 30 math problems resulted in almost no improvement. The AI was confused by the noise.
The p1 Way (Filtered Dataset): By training on just two carefully selected math problems, the AI learned a set of instructions that was much better than the old way.
Generalization: Even better, the instructions learned from just two problems worked great on other math competitions the AI had never seen before!

The Takeaway

When you are trying to teach an AI a complex, messy skill (like advanced math), don't throw everything at it.

Sometimes, the best way to learn is to find the specific, tricky examples where the difference between "good" and "bad" is most obvious, and focus your energy there. It's like trying to tune a radio: if you turn the volume up on a station full of static (noise), you can't hear the music. But if you tune to a clear frequency (a sensitive prompt), the music becomes crystal clear.

In short: To get better AI instructions, stop trying to please everyone. Find the few problems where the instructions make the biggest difference, and focus on those.

1. Problem Statement

Prompt optimization aims to improve Large Language Model (LLM) performance by searching for an optimal system prompt without updating the model's weights. While effective in some domains (e.g., instruction following), the paper identifies that prompt optimization is highly inconsistent across tasks. It often fails on complex reasoning benchmarks (like AIME) despite significant computational resources.

The core problem investigated is: Why does prompt optimization succeed on some tasks but fail on others, and how can we improve its effectiveness on difficult, heterogeneous tasks?

2. Theoretical Analysis & Key Insights

The authors decompose the reward variance across different system prompts into two distinct components to explain optimization failure:

Variance Among Responses: Noise arising from the inherent stochasticity of the LLM's generation process under a fixed system prompt.
Variance Among System Prompts: The true signal representing the difference in expected reward between different system prompts.

Key Findings:

Signal-to-Noise Ratio: Prompt optimization succeeds only when the variance among system prompts is sufficiently large relative to the variance among responses.
- Instruction Following (e.g., IFBench): High sensitivity to system prompts; the signal is clean, and optimization works well.
- Complex Reasoning (e.g., AIME): Low sensitivity; the variance is dominated by generation noise (stochasticity), obscuring the signal.
The "More Data" Paradox: Contrary to intuition, increasing the dataset size can hurt optimization on heterogeneous tasks.
- Different user prompts (e.g., different math problems) may favor different system prompts.
- Averaging rewards over a large, diverse dataset causes these preferences to cancel out, making distinct system prompts appear statistically identical (reducing the variance among system prompts).
- This dilution of the optimization signal is severe in heterogeneous datasets like mathematical reasoning but less so in homogeneous datasets like instruction following.

3. Methodology: p1 (Prompt Filtering)

Motivated by the insight that a smaller, high-variance subset provides a clearer optimization signal, the authors propose p1, a simple user-prompt filtering method.

Algorithm Steps:

Candidate Generation: Sample a set of candidate system prompts from the current policy.
Variance Estimation: For every possible subset of user prompts (of size $K_{top}$ $K_{t o p}$ ), estimate the variance among system prompts.
- Crucially, the method explicitly subtracts the estimated "variance among responses" (noise) from the total observed variance to isolate the true signal.
Subset Selection: Select the subset of user prompts that maximizes this "clean" variance score.
Optimization: Perform Reinforcement Learning (RL) based prompt optimization (using a GRPO variant) only on this filtered subset.

Key Design Choice: The method selects a very small subset (e.g., $K_{top}=2$ ) of user prompts that are most sensitive to changes in the system prompt, effectively creating a high-signal training environment.

4. Experimental Results

The authors evaluated p1 on IFBench (instruction following) and AIME/HMMT (mathematical reasoning) using Qwen models (1.7B, 4B, and 30B).

Performance on Reasoning Benchmarks (AIME/HMMT):

Baseline Failure: Standard RL optimization on the full 30-question AIME 24 dataset yielded no improvement over the base model. Evolutionary baselines (GEPA) also failed to improve significantly.
p1 Success:
- Training on a filtered subset of just 2 prompts (specifically prompts [1, 23] from AIME 24) resulted in a system prompt that achieved 54.01% accuracy on AIME 25 (vs. 47.03% base).
- This outperformed full-dataset RL and GEPA significantly.
- Generalization: The prompt optimized on AIME 24 generalized well to unseen benchmarks (AIME 26, HMMT 25/26) and transferred effectively to a larger model (Qwen3-30B) without further tuning.

Performance on Instruction Following (IFBench):

On homogeneous tasks like IFBench, p1 was less effective than full-dataset training.
This confirms the theory: homogeneous tasks benefit from large datasets because the signal is consistent across examples, whereas heterogeneous reasoning tasks require filtering to avoid signal dilution.

Qualitative Analysis:

GEPA tended to produce prompts that memorized specific training patterns (e.g., specific geometry formulas), leading to overfitting.
p1 produced prompts that encouraged general reasoning behaviors (e.g., "stream of consciousness" thinking, unfiltered reasoning), leading to better generalization.

5. Key Contributions

Theoretical Decomposition: Provided a formal decomposition of reward variance into "response noise" and "system prompt signal," identifying the latter as the critical factor for optimization success.
Discovery of the Dataset Size Paradox: Demonstrated that for heterogeneous tasks, increasing the dataset size reduces the optimization signal, explaining why standard scaling laws fail for prompt optimization in reasoning domains.
p1 Algorithm: Introduced a simple, effective filtering method that selects high-variance user prompts, enabling prompt optimization to succeed where full-dataset training fails.
Empirical Validation: Showed that training on as few as two prompts can yield system prompts that generalize across models and unseen reasoning benchmarks, outperforming strong baselines like GEPA and full-dataset RL.

6. Significance

This paper fundamentally shifts the perspective on prompt optimization from "more data is better" to "better data selection is critical." It highlights that for complex, heterogeneous tasks, the noise introduced by averaging over diverse examples can completely mask the learning signal. By filtering for high-variance examples, practitioners can unlock significant performance gains with minimal computational cost, making prompt optimization a viable and powerful tool for reasoning tasks where full model fine-tuning is expensive or infeasible.

p1p1p1: Better Prompt Optimization with Fewer Prompts