p1p1: Better Prompt Optimization with Fewer Prompts

The paper introduces p1p1, a method that improves prompt optimization by filtering user prompts to select a small subset with high variance in system prompt performance, thereby enabling more effective identification of superior system prompts and outperforming existing baselines on reasoning benchmarks.

Original authors: Zhaolin Gao (Sid), Yu (Sid), Wang, Bo Liu, Thorsten Joachims, Kianté Brantley, Wen Sun

Published 2026-04-13
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a brilliant but slightly confused student (the AI) who is trying to solve very hard math problems. You can't change how their brain is wired (you can't retrain the model), but you can give them a set of instructions on how to think. This set of instructions is called a System Prompt.

The goal of this paper is to figure out how to write the perfect set of instructions to make the student solve more problems correctly.

The Problem: Why "More Data" Sometimes Makes Things Worse

Usually, when you teach a human, you give them a huge textbook with thousands of examples. The more examples they see, the better they get.

The researchers found something weird happening with AI prompts: Giving the AI a huge textbook of math problems actually made it harder to find the perfect instructions.

Here is the analogy:
Imagine you are trying to find the best pair of running shoes.

  • Scenario A (Homogeneous Task): You are training for a 100-meter sprint. You try 100 different pairs of shoes on the track. Some are terrible, some are great. It's very easy to see which shoes make you faster because the track is the same every time.
  • Scenario B (Heterogeneous Task - The Math Problem): You are training for a multi-sport event (sprinting, swimming, and climbing).
    • If you try to find "one perfect shoe" for all three sports, you get confused. The shoes that are great for sprinting are terrible for swimming. The shoes for climbing are useless for running.
    • When you average the results across all 100 different sports, the "good" shoes and "bad" shoes cancel each other out. The data looks like noise. You can't tell which shoe is actually the best because the task is too mixed up.

The paper calls this "Variance."

  • Response Variance (Noise): The AI is just being random. Sometimes it gets the answer right by luck, sometimes wrong, even with the same instructions.
  • Prompt Variance (Signal): How much the instructions actually change the outcome.

On hard math problems (like the AIME competition), the "noise" (randomness) is so loud that it drowns out the "signal" (the quality of the instructions). When you add more math problems to the training set, you add more noise, making it even harder to hear the signal.

The Solution: The "p1" Filter (The Spotlight Method)

Instead of trying to teach the AI with the whole library of 30 math problems, the researchers proposed a method called p1.

The Analogy:
Imagine you are a coach trying to pick the best running shoes. Instead of testing the shoes on 30 different athletes doing 30 different sports, you pick just two athletes who are extremely sensitive to shoe quality.

  • Athlete A runs fast in Shoe X but trips in Shoe Y.
  • Athlete B runs fast in Shoe Y but trips in Shoe X.

By focusing only on these two "sensitive" athletes, you can clearly see which shoe is better. You ignore the 28 other athletes who don't care much about the shoes because their results are just random noise.

How p1 works:

  1. Test the Waters: The AI tries a bunch of different instructions on a few math problems.
  2. Find the "Sensitive" Problems: It looks for the specific math problems where changing the instructions causes a huge difference in the score (one instruction gets 100%, another gets 0%). These are the problems where the instructions matter.
  3. Filter: It throws away the "boring" problems where the instructions don't seem to change anything.
  4. Train: It trains the AI only on those few, high-sensitivity problems.

The Results: Less is More

The results were surprising:

  • The Old Way (Full Dataset): Trying to optimize instructions on all 30 math problems resulted in almost no improvement. The AI was confused by the noise.
  • The p1 Way (Filtered Dataset): By training on just two carefully selected math problems, the AI learned a set of instructions that was much better than the old way.
  • Generalization: Even better, the instructions learned from just two problems worked great on other math competitions the AI had never seen before!

The Takeaway

When you are trying to teach an AI a complex, messy skill (like advanced math), don't throw everything at it.

Sometimes, the best way to learn is to find the specific, tricky examples where the difference between "good" and "bad" is most obvious, and focus your energy there. It's like trying to tune a radio: if you turn the volume up on a station full of static (noise), you can't hear the music. But if you tune to a clear frequency (a sensitive prompt), the music becomes crystal clear.

In short: To get better AI instructions, stop trying to please everyone. Find the few problems where the instructions make the biggest difference, and focus on those.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →