Aligning Language Models from User Interactions

The Big Idea: Learning from the "Oops" and "Try Again" Moments

Imagine you are a chef cooking a meal for a customer.

The Old Way: You cook the meal, serve it, and then the customer leaves. You never know if they liked it, hated it, or if the salt was too high, unless they send a formal survey later. Most of the time, you just throw the data away and cook the next meal the exact same way.
The New Way (This Paper): The customer takes one bite, frowns, and says, "It's a bit too salty, could you fix it?" or "I actually wanted it spicy, not sweet."
- The Magic: Instead of waiting for a survey, the chef immediately thinks, "Oh, if I had known they wanted it spicy, I would have added chili right now."
- The Paper's Method: This paper teaches AI models to do exactly that. It uses the customer's immediate reaction (the follow-up message) to teach the AI how it should have acted in the first place.

The Problem: We Have Too Much Data, But We Ignore It

Right now, AI models talk to millions of people every day. These conversations are goldmines of information.

If a user asks for code, gets an error, and says, "This doesn't work," that is a learning signal.
If a user asks for a poem, gets a sad one, and says, "Make it happier," that is a learning signal.

But currently, AI companies mostly throw this data in the trash. They only use carefully curated datasets where humans have explicitly written labels like "Good" or "Bad." The paper argues: Why throw away the messy, real-world conversations when they contain the most honest feedback?

The Solution: "Self-Distillation" (The Time-Travel Trick)

The authors call their method SDPO (Self-Distillation Policy Optimization). It sounds complicated, but here is the simple analogy:

The "Hindsight" Time Machine
Imagine the AI is a student taking a test.

The Original Answer: The student writes an answer on the paper.
The Teacher's Note: The teacher (the user) writes a note on the back saying, "You missed the point here; try again."
The Time Travel: The student goes back in time, reads the teacher's note before writing the answer, and re-writes the test.
The Comparison: The student compares their Original Answer with their Re-written Answer (with the note).
- Where did the re-written answer change? That's where the student made a mistake.
- What stayed the same? That's what they got right.

The "Self-Distillation" Part:
The AI doesn't need a human teacher to grade it. It acts as its own teacher. It asks itself: "If I had known what the user was going to say next, how would I have answered?" It then compares that "perfect hindsight answer" to its "original answer" and updates its brain to match the "perfect" one.

It's like looking in a mirror after you've spilled coffee on your shirt. You see the stain, realize you should have been more careful, and promise to be more careful next time.

What Happened When They Tried It?

The researchers tested this on real conversations from a public dataset called WildChat (which is full of messy, unfiltered human chats).

It Got Smarter: The AI got better at following instructions and being helpful, even though it was only trained on "messy" data without any human labels.
It Didn't Forget: Usually, when you teach an AI something new, it forgets old things (like how to do math). This method improved the AI's personality and helpfulness without making it worse at math or coding.
It Learned Personalities: The AI could adapt to specific users. If User A likes short, funny answers and User B likes long, serious answers, the AI learned to switch between them just by talking to them, without needing a "User Profile" setting.

Why Is This a Big Deal?

No More Waiting for Surveys: We don't need to wait for humans to label data. The AI learns in real-time from the conversation itself.
Scalable: Since AI is already talking to millions of people, this method allows the AI to learn from all those interactions automatically.
Robust: Even if the user is confused, angry, or sends a random message ("What is 2+2?"), the AI is smart enough to realize, "This message doesn't tell me I made a mistake," and it ignores it. It only learns when the feedback is actually useful.

The Catch (Safety)

The paper admits a potential risk: If a user tries to trick the AI into being mean or unsafe, the AI might learn that behavior because it's "adapting to the user." Just like a student might learn bad habits from a bad teacher, the AI could learn bad habits from a manipulative user. The authors suggest we need safety guardrails to make sure the AI doesn't learn to be dangerous just because a user asked it to.

Summary

This paper proposes a way for AI to learn from its own mistakes in real-time. Instead of waiting for a human to grade it, the AI looks at what the user said next, realizes what it should have done differently, and teaches itself to do better next time. It's like giving the AI a superpower: the ability to learn from every single conversation it has, turning "oops" moments into "aha!" moments.

1. Problem Statement

Current large language models (LLMs) generate massive volumes of multi-turn user interactions during deployment. However, this data is typically discarded because:

Lack of Explicit Labels: Unlike standard training datasets (e.g., RLHF datasets), real-world conversations lack explicit preference labels, expert demonstrations, or reward signals.
Implicit Feedback: Feedback is embedded in natural language follow-up messages (e.g., "That's wrong," "Rewrite this," or "I need a shorter answer"), making it difficult to extract a principled learning signal.
Missed Opportunity: While models possess in-context learning capabilities (they can revise behavior when given a follow-up prompt), this ability is not currently leveraged to update the model's weights for future interactions.

The central question is: Can we train language models directly from raw, multi-turn user interactions in a simple, principled, and scalable manner to improve alignment and enable personalization without explicit supervision?

2. Methodology: Self-Distillation Policy Optimization (SDPO)

The authors propose SDPO, a method that treats the user's follow-up message as "hindsight" information to generate a learning signal via self-distillation.

Core Concept

The method relies on the observation that if a model sees a user's follow-up message ( $o$ ) after generating a response ( $y$ ), it can often produce a better-aligned response. SDPO distills this "hindsight" capability back into the model's original policy.

Technical Formulation

Given a conversation history $x$ , a model response $y$ , and a subsequent user message $o$ :

Original Policy: The model generates $y$ based on $x$ : $\pi_\theta(y | x)$ .
Hindsight Policy: The same model is reprompted with both $x$ and the follow-up $o$ to generate a "corrected" distribution: $\pi_\theta(y | x, o)$ .
Learning Signal (Token-Level Advantage): The authors compare the token probabilities of the original policy against the hindsight policy.
- Advantage ( $A_i$ ): Defined as the log-ratio of probabilities for the $i$ -th token $y_i$ :
  $A_i(x, y, o) = \log \frac{\pi_\theta(y_i | x, o, y_{<i})}{\pi_\theta(y_i | x, y_{<i})}$
- Interpretation:
  - If $A_i > 0$ : The token is reinforced (the hindsight model is more confident in it).
  - If $A_i < 0$ : The token is penalized (the user's follow-up suggests this token contributed to an error or misalignment).

Optimization Objectives

The paper presents two equivalent views for updating the model parameters $\theta$ :

Policy Gradient View: Treat $A_i$ as a token-level advantage to maximize the expected log-ratio.
Self-Distillation View: Minimize the reverse KL divergence between the original policy and the hindsight policy (where the hindsight policy is treated as a fixed "teacher" via stopgrad):
$L_{SDPO}(\theta) = \sum_i KL(\pi_\theta(\cdot | x, y_{<i}) \parallel \pi_\theta(\cdot | x, o, y_{<i}))$

The algorithm (Algorithm 1) iterates through interactions, sampling $y$ , observing $o$ , computing the advantage, and updating $\theta$ .

Off-Policy Variant

Since real-world logs (e.g., WildChat) often contain responses from different models (off-policy data), the authors propose a surrogate objective (Equation 4) that optimizes directly over logged tuples $(x, y, o)$ without requiring the original behavioral policy's probabilities, making it applicable to existing datasets.

3. Key Contributions

Principled Learning from Raw Data: A novel framework to learn directly from uncurated, multi-turn user conversations without explicit reward models, preference labels, or human annotation.
Self-Distillation Mechanism: Leveraging the model's own in-context learning ability as a teacher to generate training signals, effectively "distilling the model into itself."
Scalability and Simplicity: The method requires no complex reward modeling or external judges; it uses the interaction data itself to compute gradients.
Dual Capability: The approach simultaneously improves general alignment (instruction following) and enables continual personalization (adapting to individual user preferences over time).

4. Experimental Results

The authors evaluated SDPO on WildChat and WildFeedback (real-world user conversations) using Qwen3 and Olmo3 model families.

A. General Alignment (Offline Training)

Benchmarks: Evaluated on AlpacaEval 2.0, IFEval, ArenaHard-v2, and MMLU-Pro.
Findings:
- Training on 14,000 raw user conversations significantly improved instruction-following and alignment scores across all models.
- No Catastrophic Forgetting: Unlike standard Supervised Fine-Tuning (SFT) on the same data (which degraded performance), SDPO improved alignment without regressing capabilities in math, coding, or knowledge tasks.
- Robustness: Even when trained on uncurated (randomly sampled) WildChat data without filtering for feedback signals, SDPO showed improvements, though slightly less pronounced than on curated data.
- Comparison to SFT: Standard SFT on the same dataset caused massive performance drops (e.g., -18.9% on AlpacaEval), proving that SDPO's gains are not due to simple imitation of assistant outputs but rather the extraction of implicit feedback signals.

B. Continual Personalization (Online Adaptation)

Setup: Simulated online interactions where user preferences (e.g., "concise/casual" vs. "detailed/professional") were introduced or flipped.
Findings:
- Rapid Adaptation: The model adapted to new user preferences within ~50 interactions, achieving >85% win rates against the base model.
- Stability: The model could unlearn outdated preferences when user feedback changed (e.g., flipping from concise to detailed) without forgetting the new preference.
- No Catastrophic Forgetting: When learning multiple complementary preferences sequentially, the model retained earlier learned behaviors while acquiring new ones.

C. Interpretability

Visualization: Heatmaps of token-level advantages showed that the model correctly identified and penalized tokens that led to user dissatisfaction (e.g., informal tone when a formal tone was requested) and reinforced correct tokens.
Noise Robustness: When user follow-ups were irrelevant (e.g., asking "What is 2+2?" after an email draft), the advantages were near zero, preventing the model from learning from noise.

5. Significance and Implications

Closing the Loop: SDPO demonstrates that the "deployment phase" is a viable and rich source of training data, potentially allowing models to continuously improve in the wild without manual intervention.
Cost Efficiency: It eliminates the need for expensive reward modeling, human preference labeling, or complex RLHF pipelines for ongoing model improvement.
Personalization: It offers a path to truly personalized AI assistants that adapt to individual users through natural conversation rather than static profiles.
Safety Considerations: The authors note that while effective, learning directly from users carries risks (e.g., users manipulating the model). They suggest that the hindsight prompt mechanism could be augmented with safety principles to filter adversarial signals, but this remains an open challenge.

In summary, the paper establishes that user interactions are a distinct, underutilized data modality that, when processed through Self-Distillation Policy Optimization, can drive significant, scalable, and safe improvements in language model alignment and personalization.