BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Imagine you have a very smart, helpful robot assistant that remembers everything about you. It knows you love telling jokes, using emojis, and signing off as "The Joker." It remembers this so it can be your perfect, personalized buddy.

Now, imagine this robot is also your agent. Sometimes, it needs to talk to your friends (casual chat), but other times, it needs to talk to the IRS, a judge, or a bank loan officer (serious business).

The Problem:
The paper "BenchPreS" asks a simple but tricky question: Can this robot know when to be "you" and when to be "professional"?

Right now, most AI models are like a toddler who just learned a new word. Once they learn "I love emojis," they put emojis in everything—even when writing a letter to the tax man. They don't understand that "being funny" is great for a birthday card but terrible for a legal dispute. They treat your preferences like a global "On" switch that never turns off.

The "BenchPreS" Test

The researchers created a test called BenchPreS to see if AI can figure this out. They gave the AI a "User Profile" (your preferences) and a "Task" (like writing to the IRS).

They looked for two things:

The "Oops" Rate (Misapplication Rate): How often did the AI use your preferences when it shouldn't have? (e.g., Calling the IRS agent "Buddy" and using a clown emoji).
The "Good Job" Rate (Appropriate Application Rate): How often did the AI use your preferences when it should have? (e.g., Using your preferred bold text in a casual email to a friend).

What They Found

They tested the smartest AI models available (like GPT-5, Claude, Gemini, etc.) and found some surprising results:

The "Over-Enthusiastic" AI: The smartest models were actually the worst at this. Because they are so good at following instructions, they thought, "The user said 'be funny,' so I will be funny everywhere!" They got the "Good Job" rate high, but their "Oops" rate was also huge. They couldn't tell the difference between a party and a courtroom.
The "Shy" AI: Some smaller models were better at not making mistakes, but only because they barely used your preferences at all. They were too scared to be "you."
The "Thinking" Trap: The researchers tried turning on "Reasoning Mode" (making the AI think before it speaks). They hoped this would help the AI pause and say, "Wait, is this a joke?" Instead, the AI just thought harder about how to be funny, making the problem worse.
The "Please Don't" Prompt: They tried telling the AI, "Only be funny if it's appropriate." This helped a little, but the AI still slipped up often. It's like telling a toddler, "Don't run in the house," and they still run because they don't truly understand the why.

The Big Picture

The main takeaway is that current AI treats your preferences like hard-coded rules (e.g., "Always use emojis") rather than context clues (e.g., "Use emojis when the vibe is right").

The Analogy:
Imagine you hire a personal stylist.

Current AI: This stylist puts a neon clown nose on you every single time you leave the house, whether you are going to a wedding, a funeral, or a job interview. They think, "You said you like clown noses! I must follow the rule!"
What We Need: A stylist who knows that a clown nose is perfect for a birthday party but disastrous for a job interview. They need to understand the situation, not just the instruction.

Why This Matters

As we start using AI to write emails, file taxes, and talk to government agencies, we need them to be smart enough to know the difference between "casual me" and "professional me." If they can't learn this, they might accidentally send a joke-filled letter to a judge, causing real trouble for the user.

The paper concludes that we need to teach AI not just how to follow your preferences, but when to hold back. It's about teaching the robot social intelligence, not just memory.

1. Problem Statement

As Large Language Models (LLMs) evolve into personalized agents with persistent memory, they increasingly store user preferences (e.g., tone, formatting, nicknames) to tailor interactions. While current systems excel at remembering and applying these preferences, they struggle with context-aware selectivity.

The core problem is that user preferences are not universally applicable. A preference for a "sarcastic tone" or "school newsletter format" might be appropriate for a casual chat but is highly inappropriate for formal communications (e.g., writing to an IRS agent or a court clerk). Current LLMs tend to treat personalized preferences as globally enforceable rules rather than context-dependent normative signals, leading to the indiscriminate application of inappropriate preferences in formal settings.

2. Methodology: BenchPreS Benchmark

The authors introduce BenchPreS, a benchmark designed to evaluate whether LLMs can distinguish when to apply or suppress user preferences based on the communication context.

Data Construction:
- Contexts: 39 distinct recipient-task pairs across five formal domains (Finance, Health, Education, Employment, Housing). Examples include resolving tax discrepancies with the IRS or explaining performance to an admissions committee.
- User Profiles: 10 synthetic user profiles, each containing ~152 attributes. Crucially, 5 of these are preference attributes (Role, Style, Tone, Markers, Nickname) derived from real-world customization interfaces (e.g., "prefers emojis," "wants to be called 'Joker'").
- Gold Labels: Human annotators determined whether each preference should be applied ( $g=1$ ) or suppressed ( $g=0$ ) for a specific context.
Evaluation Metrics:
- Misapplication Rate (MR): The proportion of preferences that should be suppressed but were incorrectly applied. (Lower is better).
- Appropriate Application Rate (AAR): The proportion of preferences that should be applied and were correctly applied. (Higher is better).
- Ideal Performance: Low MR and High AAR.
Models Evaluated: A diverse set of 10 frontier LLMs, including reasoning variants (e.g., GPT-5.2, Gemini 3 Pro, DeepSeek V3.2) and non-reasoning variants (e.g., Llama-3.3, Mistral 7B).
Evaluation Framework: An LLM-as-Judge approach (using DeepSeek-R1) to automatically detect if specific preferences were reflected in the generated responses.

3. Key Contributions

Definition of Preference Selectivity: Formalizes the ability to apply appropriate preferences while suppressing inappropriate ones as a distinct capability separate from general instruction following.
BenchPreS Benchmark: Provides the first standardized dataset and evaluation protocol specifically targeting the contextual appropriateness of persistent memory usage.
Diagnostic Insights: Reveals a fundamental failure mode in current LLMs where they lack the ability to modulate preference application based on social norms and recipient context.

4. Key Results

Global Over-Application: Even the most advanced models struggle with selectivity. There is a strong positive correlation between AAR and MR; models that apply preferences well (High AAR) also tend to apply them indiscriminately (High MR).
- Example: Gemini 3 Pro achieved the highest AAR (88.69%) but also the highest MR (86.48%), indicating it applies almost all preferences regardless of context.
- Example: Mistral 7B had the lowest MR (38.49%) but also the lowest AAR (49.77%), suggesting it fails to apply preferences at all rather than selectively.
- GPT-5.2 showed the best separation (AAR - MR = 46.38%) but still misapplied preferences in ~41% of cases.
Reasoning Capability is Insufficient: Enabling "Thinking" or reasoning modes increased both AAR and MR. While reasoning models follow instructions better, they decompose user inputs into executable subgoals without filtering for contextual appropriateness, leading to more misapplication.
Prompt-Based Mitigation: Explicitly instructing models to "suppress inappropriate preferences" reduced MR significantly (e.g., Gemini 3 Pro dropped from 86% to 12% MR) but at the cost of a slight decrease in AAR. However, this defense is not a universal fix and varies heavily by model architecture.
Domain and Category Variance: The failure persists across all communication domains. "Surface-level" preferences like Markers (emojis) and Nicknames are the hardest to suppress, often treated as simple formatting instructions rather than context-dependent signals.

5. Significance and Future Directions

Safety and Professionalism: The findings highlight a critical safety gap for LLMs-as-Agents. Indiscriminate application of personal preferences in formal settings (legal, financial, medical) can lead to unprofessional, offensive, or legally damaging outputs.
Shift in Training Paradigms: The results suggest that current training paradigms prioritize "adherence" over "selectivity." Future models require structural training signals that teach the model to:
1. Enumerate stored preferences.
2. Evaluate their contextual appropriateness against the recipient and task.
3. Explicitly exclude conflicting attributes before generation.
Limitations: The benchmark focuses on the final generation stage and formal contexts. It does not yet cover retrieval-based systems or highly nuanced informal social interactions where norms are less rigid.

Conclusion: BenchPreS demonstrates that current frontier LLMs treat personalized preferences as rigid, global commands rather than flexible, context-aware signals. Achieving true personalization requires a fundamental shift toward context-aware preference regulation, where models learn to suppress user preferences when they violate social or institutional norms.

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

The "BenchPreS" Test

What They Found

The Big Picture

Why This Matters

1. Problem Statement

2. Methodology: BenchPreS Benchmark

3. Key Contributions

4. Key Results

5. Significance and Future Directions

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents