Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

The Big Picture: The "Try Again" Trap

Imagine you are training a student (an AI) to solve math problems.

Pass@1 is the score you get if the student is allowed one shot to solve a problem. If they get it right, great. If not, they fail.
Pass@k is the score you get if the student is allowed to try $k$ times (say, 5 times) and you only care if at least one of those attempts is correct.

In the real world, we often use Pass@k to train AI because it feels like a "safer" metric. If the AI can solve a hard problem on its 3rd try, we count it as a success. Researchers have been tweaking AI training to maximize this "Pass@k" score, hoping the AI gets better at solving hard problems.

The Problem: The paper discovers a scary side effect. When you train the AI specifically to get better at "Pass@k" (trying multiple times), its ability to get the answer right on the first try (Pass@1) actually gets worse.

It's like a student who learns to cheat by writing down 5 different answers on a test sheet. They might get the right answer on the sheet, but if you ask them to solve the problem instantly in their head, they fail.

The Core Analogy: The "Noisy Classroom"

To understand why this happens, imagine a classroom with two types of students:

The Easy Students: They already know the answers. They get 90% of the questions right immediately.
The Hard Students: They struggle. They only get 10% of the questions right immediately.

The Teacher's Goal (Pass@1)

If the teacher wants to improve the class average for the first try (Pass@1), they should focus on helping the Easy Students get even better, or gently nudging the Hard Students without messing up the Easy ones. The goal is to make everyone slightly better at their first attempt.

The Teacher's New Goal (Pass@k)

Now, imagine the teacher decides to optimize for Pass@k (getting the right answer eventually, even after 5 tries).

The Easy Students are already doing great. They don't need much help to get a "success" on their 5th try because they are already good.
The Hard Students are failing almost every time. To get a "success" on their 5th try, they need massive help.

The "Reweighting" Effect:
The Pass@k training algorithm acts like a teacher who becomes obsessed with the Hard Students. It says, "The Easy students are fine; let's ignore them. Let's pour ALL our energy into the Hard students so they can finally get one right answer out of five."

The algorithm heavily upweights the Hard students (giving them 1,000x more attention) and downweights the Easy students (ignoring them almost completely).

The "Interference" (The Crash)

Here is the twist: The math problems the Hard students are struggling with are confusingly similar to the problems the Easy students are good at, but with a slight twist that requires a different solution.

The Conflict: When the teacher tries to teach the Hard students a new trick to solve their specific hard problems, that trick accidentally breaks the logic the Easy students were using.
The Result: The Hard students get slightly better at their 5th try (Pass@k goes up), but the Easy students, who were previously perfect, now get confused by the new teaching method and start failing their first try.

Because the teacher was so obsessed with the Hard students (due to the Pass@k weighting), the overall class average for the first try (Pass@1) drops, even though the "5-attempt" score went up.

The Technical "Secret Sauce" (Simplified)

The paper introduces a concept called Prompt Interference.

Gradient Conflict: In AI training, "gradients" are like arrows pointing the way to improve.
- The arrow for Pass@1 points in a direction that helps everyone a little bit.
- The arrow for Pass@k points in a direction that helps the "Hard" problems a lot, but hurts the "Easy" ones.
The Angle: The paper proves that for certain types of problems, these two arrows point in opposite directions (an obtuse angle, like 120 degrees).
The Outcome: If you follow the Pass@k arrow (to get more "5-attempt" successes), you are mathematically forced to move away from the Pass@1 direction. You are literally walking backward on the metric that matters most for real-world speed and cost.

Why Should We Care?

In the real world, we can't always wait for an AI to try 5 times.

Latency: Waiting for 5 tries takes too long for a chatbot.
Cost: Generating 5 answers costs 5x more money.
Reliability: Sometimes, you only get one shot (e.g., a medical diagnosis or a self-driving car decision).

If we train AI only to be good at "Pass@k," we might end up with a model that is worse at being reliable on the first try, which is exactly what we need for safe, fast, and cheap AI.

The Takeaway

The paper warns us: Don't just optimize for "eventual success" (Pass@k) without checking if you are breaking "immediate success" (Pass@1).

The AI training process is like a seesaw. If you push down too hard on the "Hard Problems" side to get them to succeed eventually, you might accidentally launch the "Easy Problems" side into the air, causing the whole system to become less stable for single-shot tasks. The authors suggest we need new ways to train AI that balance these two goals so we don't lose the ability to get the right answer the first time.

1. Problem Statement

In verifiable tasks (e.g., code generation, mathematical reasoning), Large Language Models (LLMs) are often evaluated using Pass@k, which measures the probability that at least one of $k$ independently sampled solutions is correct. Recent trends involve inference-aware fine-tuning, where models are trained to directly optimize the Pass@k objective rather than the standard Pass@1 (single-shot) objective.

However, empirical observations reveal a counter-intuitive trade-off: optimizing for Pass@k often leads to a degradation in Pass@1 performance. This is problematic because Pass@1 remains a critical operational constraint due to latency, cost, and the need for reliable single-shot fallbacks. The paper addresses the open question: Why and when does optimizing Pass@k cause Pass@1 to decrease?

2. Methodology & Theoretical Framework

The authors develop a theoretical framework based on Prompt Interference and Gradient Conflict to explain this phenomenon.

A. Prompt Interference

The core concept introduced is Prompt Interference, defined by the interaction between the gradients of different prompts under shared policy parameters.

Definition: Two prompts $x$ and $x'$ are negatively interfering if a policy update that increases the success probability of $x$ tends to decrease the success probability of $x'$ .
Mechanism: This is quantified using a similarity kernel $\kappa_\theta(x, x') = \langle \nabla p_\theta(x), \nabla p_\theta(x') \rangle$ . If the inner product is negative, the prompts interfere negatively.
Context: In heterogeneous datasets (e.g., easy vs. hard math problems), prompts with similar representations but different "correct" answers (or difficulty levels) can have opposing gradients.

B. The Reweighting Mechanism of Pass@k

The paper analyzes the gradient of the Pass@k objective, $J_k(\theta)$ .

The gradient is given by $\nabla J_k(\theta) = \mathbb{E}_{x \sim D} [w_{k,\theta}(x) \nabla p_\theta(x)]$ , where the weight is $w_{k,\theta}(x) = k(1 - p_\theta(x))^{k-1}$ .
Key Insight: This weighting function $w_{k,\theta}(x)$ is decreasing in $p_\theta(x)$ . Consequently, Pass@k optimization implicitly reweights the prompt distribution, heavily upweighting "hard" prompts (low success probability) and downweighting "easy" prompts (high success probability).

C. Gradient Conflict Characterization

The paper proves that while per-prompt Pass@k and Pass@1 gradients are collinear (pointing in the same direction for a single prompt), the population gradients (averaged over all prompts) can conflict.

The Conflict: Pass@k optimization shifts the focus to hard prompts. If these hard prompts happen to be negatively interfering (i.e., improving them hurts the performance on other prompts that dominate the Pass@1 metric), the Pass@k gradient direction rotates away from the Pass@1 direction.
Mathematical Condition: The inner product $\langle \nabla J_k(\theta), \nabla J_1(\theta) \rangle$ becomes negative (obtuse angle) when the covariance between the Pass@k weights and the "agreement scores" (alignment with the population gradient) is sufficiently negative.
Phase Transition: The authors derive a threshold $k^*$ such that for $k > k^*$ , gradient conflict is guaranteed under certain conditions, leading to Pass@1 degradation.

3. Key Contributions

Concept of Prompt Interference: Formalized the notion of negative interference between prompts in the context of LLM post-training, defining it via the inner product of per-prompt success gradients.
Theoretical Characterization of Trade-offs: Provided a rigorous proof showing that Pass@k optimization induces a prompt reweighting that can amplify negatively interfering prompts, causing the population Pass@k gradient to conflict with the Pass@1 gradient.
Sufficient Conditions: Established sufficient conditions (Corollary 4.4, Proposition 4.5) under which gradient conflict occurs, specifically linking the magnitude of $k$ and the distribution of "hard" negatively interfering prompts to the degradation of Pass@1.
Provable Degradation: Demonstrated that under explicit step-size conditions, a single step of Pass@k gradient ascent can simultaneously increase Pass@k and decrease Pass@1 (Proposition 4.6).
Empirical Validation: Validated these theoretical findings on large language models (DeepSeek-R1 variants) using the MATH dataset.

4. Experimental Results

The authors conducted experiments on DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B using the MATH dataset.

Gradient Conflict Verification:
- They computed agreement scores ( $a_\theta(x)$ ) for prompts. Hard prompts (low Pass@1) consistently showed negative agreement scores (negatively interfering), while easy prompts showed positive scores.
- Reweighting Effect: Pass@k weights for hard prompts were found to be orders of magnitude larger (e.g., ratios exceeding $10^{28}:1$ ) than for easy prompts.
- Result: The extreme upweighting of hard, negatively interfering prompts caused the weighted average agreement score to flip from positive to negative. This confirmed that the Pass@k gradient direction became anti-aligned with the Pass@1 gradient.
Performance Trade-off:
- In toy examples and real LLM experiments, optimizing for Pass@5 (or higher) led to a measurable increase in Pass@k but a simultaneous decrease in Pass@1.
- The degradation was more severe when the dataset contained a significant portion of hard prompts that were negatively interfering with the majority of easy prompts.

5. Significance and Implications

Theoretical Explanation: The paper resolves the mystery of why "more attempts" during training do not always translate to "better single-shot" performance. It identifies prompt interference as the root cause, exacerbated by the implicit reweighting of Pass@k.
Practical Warning: It warns practitioners that blindly optimizing for Pass@k (e.g., for code generation or complex reasoning) without constraints can degrade the model's reliability in single-shot scenarios, which are often the default in production environments.
Future Directions: The work suggests that future inference-aware fine-tuning methods must account for prompt interference. Potential solutions include:
- Gradient Surgery: Modifying updates to prevent negative interference.
- Constrained Optimization: Balancing Pass@k gains with Pass@1 preservation.
- Prompt Reweighting: Designing reweighting schemes that do not disproportionately amplify negatively interfering hard prompts.

In summary, the paper provides a rigorous mathematical explanation for a critical failure mode in LLM post-training, demonstrating that the pursuit of multi-sample success metrics can inadvertently sabotage single-sample reliability through the mechanism of prompt interference.