Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Imagine you are a highly skilled chef (the AI) who has been asked to cook a complex, multi-course meal (solving a math problem or writing a summary). But before you start chopping, the customer hands you a sticky note with a very specific rule: "When you put the final dish on the table, you must write 'Bon Appétit!' in all capital letters."

If the meal is simple, like a grilled cheese sandwich, you'll probably remember the note. But if the meal is a complex 10-course banquet requiring intense focus, you might get so lost in the cooking that you forget the sticky note entirely. You serve a perfect meal, but you forget to write the sign-off.

This paper, "Did You Forget What I Asked?", investigates exactly this phenomenon in Large Language Models (LLMs). The researchers call it "Prospective Memory Failure."

Here is the breakdown of their findings using simple analogies:

1. The "Cognitive Load" Problem

The Analogy: Imagine you are juggling.

Task A: Keep the formatting rules in your head (the sticky note).
Task B: Solve a difficult math problem or summarize a long story (the juggling).

The researchers found that when the "juggling" gets too hard (like solving complex math), the model drops the "sticky note." The harder the task, the more likely the AI is to forget the formatting rules.

The Result: When asked to do a hard task and follow a rule, the AI followed the rule only 2% to 21% less often than when it was just doing the rule alone. For some models and hard tasks, this drop was as high as 50%.

2. The "End of the Line" Trap

The Analogy: Think of a train journey.

Continuous Rules: "Don't wear a hat during the whole trip." (Easy to remember because you check it every time you get on a train car).
Terminal Rules: "When the train stops at the final station, wave a red flag." (Hard to remember because you have to wait until the very end, after hours of travel).

The study found that Terminal Constraints (rules that must be done at the very end, like "end your sentence with a specific phrase") are the most likely to be forgotten.

Why? By the time the AI finishes writing hundreds of words of content, the instruction to "end with X" has faded from its "working memory."
The Exception: Avoidance Rules (like "don't use commas") are very hard to forget because the AI has to check for them every single time it types a word.

3. The "Highlighter" Solution

The Analogy: Imagine you are studying for a test.

Natural Method: The rule is buried in a paragraph of text.
Salience Method: You use a bright yellow highlighter and write the rule in big, bold letters at the top, then write "DON'T FORGET THIS!" at the bottom.

The researchers discovered a simple fix: Make the instruction "salty" (salient).

They added a clear header like "IMPORTANT FORMATTING INSTRUCTION:" and a reminder at the end like "Remember to follow ALL instructions above."
The Result: This simple trick recovered the AI's performance, bringing compliance back up to 90–100%, even on the hardest tasks. It's like giving the AI a second pair of eyes right before it finishes.

4. The "Two-Way Traffic" Jam

The Analogy: It's not just the formatting that suffers; the cooking suffers too.
If you force the chef to focus too hard on the "Bon Appétit!" sign, they might burn the steak.

The Finding: Adding formatting rules actually made the AI worse at the main task.
Example: One model's math accuracy dropped from 93% to 27% just because it was trying to follow a formatting rule at the same time. The AI was so busy trying to remember the rule that it messed up the math.

5. The "Stacking" Disaster

The Analogy: Asking the chef to do five things at once.

"Use all caps."
"No commas."
"End with a poem."
"Use exactly 3 bullet points."
"Summarize this 50-page book."

When you stack multiple rules on top of a hard task, the AI's performance collapses.

The Result: With 5 rules and a hard task, one model's ability to follow all rules dropped below 50%. The "highlighter" trick (the reminder) stopped working as well when there were too many rules to remember.

The Big Takeaway

AI models aren't "forgetting" because they are stupid or because the text disappeared from the screen. They are forgetting because their "attention" is being pulled in two directions at once.

What should we do?

Don't bury the lead: If you want an AI to follow a rule, put it in a big, bold box and remind them at the end.
Watch out for the end: Rules that need to happen at the very end of the response are the most fragile.
One thing at a time: If you need the AI to do a hard math problem, don't ask for 5 different formatting tricks at the same time. It will likely fail at both.

In short: AI is like a brilliant but easily distracted student. If you want them to remember the rules, you have to shout them out clearly right before they finish the test.

1. Problem Statement

The paper investigates a specific failure mode in Large Language Models (LLMs): the inability to maintain formatting instructions when simultaneously performing demanding cognitive tasks.

The Phenomenon: Users often provide substantive requests (e.g., solve a math problem, summarize a text) alongside formatting constraints (e.g., "use all caps," "end with a specific phrase," "no commas"). While LLMs often generate high-quality content, they frequently violate these constraints under load.
The Analogy: The authors frame this as a Prospective Memory (PM) failure. In cognitive psychology, PM is the ability to remember to perform a deferred intention at a specific future point. The paper hypothesizes that LLMs suffer from a "cognitive load effect" similar to humans, where the attentional resources required for the primary task crowd out the monitoring process needed to maintain the formatting constraint.
Gap in Literature: Previous work focused on instruction following in isolation or adversarial settings. This paper isolates non-adversarial forgetting caused specifically by concurrent task complexity.

2. Methodology

The study employs a controlled experimental paradigm combining verifiable formatting constraints with benchmark tasks of varying difficulty.

Experimental Design:
- Constraints: 15 instruction types from the IFEval benchmark (e.g., case constraints, terminal actions, structural formatting, avoidance rules).
- Distraction Tasks: Four benchmarks representing increasing cognitive loads:
  1. TriviaQA: Factual recall (Low load).
  2. MMLU: Multiple-choice reasoning (Medium load).
  3. GSM8K: Multi-step arithmetic (High load).
  4. CNN/DailyMail: Long-context summarization (Long load).
- Models: Three distinct architecture families:
  - o4-mini: Reasoning-specialized (Azure OpenAI).
  - DeepSeek-V3.1: Open-weight (Azure AI).
  - Llama-3.3-70B-Instruct: Instruction-tuned (Azure AI).
Conditions:
1. Baseline: IFEval prompt only.
2. Natural Embedding: The formatting constraint is embedded in the original prompt, and a benchmark task is appended via a neutral transition.
3. Salience-Enhanced (Reminder): The constraint is extracted, prefixed with "IMPORTANT FORMATTING INSTRUCTION," and a trailing reminder ("Remember to follow ALL...") is added.
4. Stacking: Multiple constraints (up to 5) are combined to test compounding effects.
Evaluation:
- Deterministic Checkers: All evaluations use code-based checkers (no "LLM-as-a-judge").
- Metrics:
  - Compliance Rate (CR): Percentage of responses passing the formatting checker.
  - Forgetting Delta ( $\Delta$ ): $CR_{baseline} - CR_{task+constraint}$ .
  - Task Accuracy: Standard metrics for the benchmark tasks (e.g., GSM8K accuracy).

3. Key Contributions

Quantification of Prospective Memory Failure: The paper provides the first systematic quantification of how concurrent task load degrades instruction following in LLMs, establishing a direct link between cognitive load and formatting compliance.
Instruction Type Vulnerability: It identifies that terminal constraints (actions required at the very end of generation, e.g., "end with X") are the most vulnerable, while avoidance constraints (e.g., "no commas") remain robust.
Salience Mitigation: It demonstrates that a simple prompt engineering technique (explicit framing + trailing reminder) can recover lost compliance, often restoring performance to 90–100%.
Bidirectional Interference: The study proves that interference is two-way: formatting constraints not only cause the model to forget the format, but the constraints also significantly degrade the accuracy of the primary task.
Stacking Fragility: It reveals that as constraints accumulate, joint compliance drops sharply, and the effectiveness of a single generic reminder diminishes.

4. Key Results

A. Systematic Forgetting Under Load

Across 8,000+ prompts, compliance dropped by 2–21% when a secondary task was added.
Model Sensitivity:
- DeepSeek-V3.1: Most affected (avg. drop $\Delta \approx 10.8\%$ ).
- o4-mini: Moderate drop ( $\Delta \approx 6.3\%$ ).
- Llama-3.3-70B: Most robust ( $\Delta \approx 3.2\%$ ).
Task Difficulty: The drop correlates with task complexity. For example, adding 5 chained GSM8K problems caused DeepSeek's compliance to drop by 20.7%.

B. Vulnerability by Instruction Type

Terminal Constraints: Suffered the most severe drops.
- End Checker (e.g., "End with 'Peace!'"): DeepSeek dropped 50%.
- JSON Format: Llama dropped 50%.
Avoidance Constraints: Nearly immune.
- No Commas / Forbidden Words: Drops were negligible (0–6%).
Hypothesis: Terminal constraints require a "deferred action" that is forgotten because the model generates hundreds of tokens of task content before reaching the end, causing the instruction to lose salience. Avoidance constraints are continuously monitored with every token generated.

C. The Power of Salience (Mitigation)

Adding a Salience-Enhanced format (explicit header + trailing reminder) recovered compliance to 90–100% across almost all conditions.
In many cases, the reminder-enhanced prompt outperformed the baseline (no-task) condition, suggesting standard prompt formats may systematically underestimate model capability.

D. Dual-Task Interference (Bidirectional Cost)

Formatting constraints degrade task accuracy.
Critical Finding: For o4-mini, accuracy on chained GSM8K (3 problems) dropped from 93% (task-only) to 27% when a formatting constraint was active. This suggests the reasoning model allocates its "reasoning budget" to format compliance at the expense of mathematical accuracy.
MMLU accuracy dropped 17–23% across all models when constraints were active.

E. Stacking and Scaling

Joint Compliance: When stacking multiple constraints (N=5), joint compliance fell drastically:
- o4-mini: 47.5% (down from 99% at N=1).
- DeepSeek: 61.7%.
- Llama: 83.3% (most resilient).
Reminder Reliability: Under heavy stacking, a single generic reminder became unreliable, sometimes even reducing compliance compared to no reminder.

5. Significance and Implications

Theoretical: The work validates that LLMs exhibit cognitive phenomena analogous to human prospective memory failures, specifically the "monitoring failure" pathway where attention to the ongoing task crowds out the deferred intention.
Practical Guidelines for Developers:
1. Always use reminders: When combining formatting with complex tasks, explicitly frame the instruction and add a trailing reminder.
2. Beware of Terminal Constraints: These are high-risk; consider breaking them into separate prompts or using structural constraints instead.
3. Budget for Accuracy Loss: Adding formatting requirements will degrade task performance, especially in reasoning-heavy models.
4. Avoid Stacking: Do not rely on a single prompt for multiple complex constraints; split them into separate turns or use per-constraint reminders.
Future Directions: The authors suggest that instruction tuning should explicitly reward compliance under high cognitive load and that future research should investigate the internal representational competition (hidden state dominance) causing these failures.

Conclusion

The paper concludes that LLMs are not merely "forgetting" instructions due to context window limits, but are suffering from representational competition during generation. The "salience" of an instruction is not static; it degrades as the model processes complex content. By treating LLMs through the lens of cognitive psychology, the authors provide actionable strategies to mitigate these failures and a deeper understanding of the trade-offs between instruction following and task performance.