Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Imagine you have a digital assistant, like a super-smart butler named "Alex." You've been chatting with Alex for months. You've told him you hate cilantro, you love jazz music, and you get anxious in crowded elevators. You've mentioned these things in passing, sometimes directly, sometimes by complaining about a salad, or by sighing when a jazz band plays too loudly.

Now, imagine you ask Alex: "What should I do this weekend?"

A truly personalized Alex wouldn't just suggest a generic jazz club or a crowded concert. He would remember your specific history, your subtle hints, and your long-term habits to suggest a quiet jazz listening party in a small, empty room.

The Problem:
Current AI assistants are great at following simple, one-time instructions ("Write a poem about a cat"). But when it comes to being a personal assistant over a long period, they often forget things, miss subtle hints, or get confused when the conversation gets very long. They act more like a robot that just read a manual than a friend who actually knows you.

The Solution (RealPref):
The authors of this paper built a giant test called RealPref to see how good these AI "butlers" really are at remembering and respecting your personal quirks over time.

Here is how they did it, using some fun analogies:

1. The "Memory Palace" Dataset

Instead of just asking the AI simple questions, they created 100 fake people with detailed lives.

The Backstory: Each fake person has a biography, a job, and a history of life events (like graduating or starting a business).
The Preferences: These people have 1300 specific likes and dislikes. Some are obvious ("I love pizza"), but many are hidden.
The Conversation: They generated thousands of chat sessions. Sometimes the person says, "I hate spicy food" (Direct). Other times, they say, "Ugh, this curry is too hot, I'm sweating!" (Implicit). Sometimes, they only reveal a preference after talking about it for three different weeks (Long-Horizon).

2. The "Detective" Test

To test the AI, they didn't just ask, "Do you remember I hate spicy food?" That's too easy. Instead, they set up a mystery.

The Scenario: The user asks, "I'm going to a new restaurant; what should I order?"
The Trap: The AI has to look back through hundreds of pages of previous chat logs to find the clue that the user hates spicy food.
The Challenge: The clues might be hidden in a conversation from three weeks ago, buried under 50 pages of talk about the weather, or hidden inside a metaphor ("I prefer my life as calm as a lake, not a boiling pot").

3. The Three Types of "Secrets"

The paper tested how well AI handles different ways people hide their preferences:

The Shout: "I hate spicy food!" (Easy for AI).
The Whisper: "I usually skip the spicy options on the menu." (Medium difficulty).
The Riddle: "I prefer things that don't make my eyes water." (Hard for AI).
The Slow Reveal: Over three months, the user complains about hot sauce, then mentions a bad experience with chili, then finally says they only eat mild food. The AI has to connect the dots over time.

4. What They Found (The Results)

The results were a bit of a wake-up call for the AI industry:

The "Needle in a Haystack" Problem: As the conversation got longer (more "haystack"), the AI got worse at finding the "needle" (the preference). Even the smartest models started forgetting things when the chat history got too long.
The "Subtlety" Gap: When people were direct, AI did okay. But when people were subtle or implied their preferences, the AI often failed completely. It's like a waiter who only understands "I want water" but gets confused if you say, "I'm feeling a bit parched."
The "Stranger" Problem: If the AI learned you liked jazz, could it guess you'd also like a jazz-themed movie? Often, no. It struggled to apply what it learned about one thing to a totally new situation.
The "Magic Trick" Fix: They tried helping the AI by giving it a "cheat sheet" (a reminder) or by letting it search its own memory (Retrieval-Augmented Generation). This helped, but it wasn't a perfect fix.

Why This Matters

Think of this paper as a driver's license test for AI personalization.

Right now, most AI assistants are like learner drivers who can drive in an empty parking lot (short, simple chats) but panic when you put them in rush-hour traffic with complex rules (long, nuanced, real-life conversations).

RealPref is the test that proves they aren't ready for the highway yet. It shows us that to build an AI that truly feels like a helpful, personal friend, we need to teach it to listen better, remember longer, and understand the "unsaid" things we say.

In short: We want AI that doesn't just answer questions, but knows us. This paper is the first big step in measuring how far we are from that dream.

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

1. The "Memory Palace" Dataset

2. The "Detective" Test

3. The Three Types of "Secrets"

4. What They Found (The Results)

Why This Matters

1. Problem Statement

2. Methodology: The RealPref Benchmark

A. Dataset Construction

B. Evaluation Framework

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

1. The "Memory Palace" Dataset

2. The "Detective" Test

3. The Three Types of "Secrets"

4. What They Found (The Results)

Why This Matters

1. Problem Statement

2. Methodology: The RealPref Benchmark

A. Dataset Construction

B. Evaluation Framework

3. Key Contributions

4. Experimental Results

5. Significance and Future Directions

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks