Influencing Humans to Conform to Preference Models for… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot how to drive a car. You can't just tell the robot, "Drive safely and efficiently," because the robot doesn't understand those abstract concepts. Instead, you have to show it two different driving routes and ask, "Which one is better?"

This is the core of RLHF (Reinforcement Learning from Human Feedback). The robot learns by watching humans choose between options.

However, there's a hidden problem. The robot has a specific "mathematical recipe" (a Preference Model) for how it thinks humans make choices.

Recipe A (Partial Return): "Humans always pick the path that gives the most points right now."
Recipe B (Regret): "Humans pick the path that shows the smartest decisions, even if the points are lower."

The problem is that humans are messy. Sometimes we pick the path with the most points, and sometimes we pick the smartest path. If the robot is using Recipe A, but you (the human) are thinking like Recipe B, the robot learns the wrong thing. It's like trying to bake a cake using a recipe for bread; the ingredients (your feedback) are real, but the instructions (the robot's model) don't match how you actually use them.

The Big Idea of This Paper:
Instead of trying to change the robot's recipe to match our messy human brains, the authors asked: "Can we train humans to speak the robot's language?"

They wanted to see if they could influence humans to give feedback that perfectly matches the robot's specific mathematical recipe, without changing what the humans actually want (their underlying values).

The Three "Tricks" They Tried

The researchers tested three ways to get humans to "speak" the robot's language better. Think of these as different ways to coach a player before a game.

1. The "Privileged" Cheat Sheet

The Analogy: Imagine you are playing a video game, and before you choose a path, the game tells you, "Path A gives you 50 points. Path B gives you 40 points."
What they did: They showed humans the exact math the robot uses (the "points" or "regret" scores) while they were making their choice.
The Result: Humans immediately started picking the path the robot wanted. It proved that if you give humans the right information, they can perfectly align with the robot's model.
The Catch: In the real world, we can't show humans the "secret math" because we don't know the answer yet! So, this was just a proof-of-concept.

2. The "Training Camp"

The Analogy: Instead of giving a cheat sheet during the game, you hold a workshop before the game. You teach the players, "Hey, in this game, the goal isn't just points; it's about making the smartest moves. Here is how you calculate a 'smart move'."
What they did: They taught humans how to calculate the specific metric the robot cared about (like "Regret" or "Total Score") and practiced with them.
The Result:
- When humans were trained on the Total Score (Partial Return), they became excellent at giving that kind of feedback, even in new, different game maps.
- When humans were trained on Regret (smart moves), it was too hard. They got tired and confused, especially when switching to a new game map. It was like asking someone to do advanced calculus while running a marathon.
Lesson: Training works, but the concept must be simple enough for humans to grasp without burning out.

3. The "Magic Question"

The Analogy: You don't teach the player anything new. You just change the question on the scoreboard. Instead of asking, "Which path do you like?", you ask, "Which path had the best immediate results?" or "Which path shows better decision-making?"
What they did: They simply tweaked the wording of the question asked during the feedback process.
The Result: Surprisingly, just changing the words had an effect!
- Asking about "immediate outcomes" nudged people toward the "Total Score" model.
- Asking about "decision-making" nudged people toward the "Regret" model.
Lesson: The way you ask a question shapes the answer. A tiny change in the interface can steer human thinking in a specific direction.

Why Does This Matter?

Think of RLHF as a conversation between a human and a robot.

Old Way: We assume the robot understands us perfectly. If the robot learns the wrong thing, we blame the robot's math.
New Way (This Paper): We realize the robot and human are speaking different dialects. Instead of just fixing the robot, we can design better interfaces and training to help humans speak the robot's dialect.

The Takeaway:
If you want an AI to learn exactly what you want, don't just hope it guesses right. Design the interface and the training so that your natural feedback lines up perfectly with the AI's learning style.

It's like tuning a radio. You can't change the station (the human's true preference), but you can tune the dial (the interface and training) so the signal comes in crystal clear, without static. This paper shows us exactly how to tune that dial.

1. Problem Statement

Reinforcement Learning from Human Feedback (RLHF) aims to align agent behavior with human preferences. This process typically involves learning a reward function ( $\hat{r}$ ) from a dataset of human preferences over trajectory segments. However, RLHF algorithms must assume a specific preference model (a probabilistic mapping from an unobservable reward function to observed preferences) to perform this learning.

The core problem identified is a mismatch between:

The preference model assumed by the RLHF algorithm (e.g., Partial Return, Regret, or Change-in-Expected-Return).
The actual cognitive process humans use to generate preferences.

If the assumed model poorly describes how humans generate preferences, the learned reward function will be a poor approximation of the true human reward, leading to suboptimal alignment. While prior work has focused on finding better descriptive models of human behavior, this paper takes a prescriptive approach: instead of changing the algorithm to fit human behavior, can we change human behavior (via interface design and training) to better fit the algorithm's assumed model?

2. Methodology

The authors propose three distinct intervention strategies to influence human annotators to conform to specific preference models without altering their underlying latent reward function. They evaluate these interventions in a grid-world delivery task (deterministic and stochastic environments).

The three target preference models are:

Partial Return ( $P_{\Sigma r}$ ): Humans prefer the segment with the highest sum of rewards within the segment.
Regret ( $P_{regret}$ ): Humans prefer the segment with the lowest deviation from optimal decision-making (minimizing regret).
Change-in-Expected-Return ( $P_{\Delta}$ ): Humans prefer the segment that maximizes the change in expected return from start to end (relevant in stochastic environments).

The three intervention methods tested are:

A. The PRIVILEGED Experiment (Proof of Concept)

Mechanism: During preference elicitation, subjects are explicitly shown the calculated statistics (Partial Return or Regret) for each trajectory segment based on the ground-truth reward function.
Goal: To determine if providing the "correct" information allows humans to perfectly follow a specific model.
Limitation: Not practically deployable as the ground-truth reward is unknown in real-world RLHF.

B. The TRAINED Experiment (Practical Intervention)

Mechanism: Subjects undergo a training phase where they are taught the concepts of Partial Return or Regret and how to calculate them. They practice applying these metrics before labeling preferences.
Variation (TRAINED-DIFF-DOMAIN): Subjects are trained in one domain (grid-world delivery) and asked to elicit preferences in a visually distinct domain (space delivery) with a different reward function. This tests the generalizability of the training.

C. The QUESTION Experiment (Minimal Intervention)

Mechanism: No explicit training on metrics is provided. Instead, the wording of the preference elicitation question is modified to nudge subjects toward a specific model.
- For Partial Return: "Which path has better immediate outcomes?"
- For Regret: "Which path reflects better decision-making?"
Variation (QUESTION-STOCHASTIC-MDP): Tested in a stochastic environment where Regret and Change-in-Expected-Return diverge.

3. Key Contributions

Novel Research Direction: Shifts the focus from "modeling human preferences" to "designing interfaces to align human expression with algorithmic assumptions."
Three Intervention Types: Introduces and evaluates Privileged Information, Training, and Question Wording as levers to control preference data quality.
Empirical Validation: Demonstrates that human preferences are malleable. Subjects can be significantly influenced to conform to specific preference models (Partial Return or Regret) through training and interface design.
Impact on Reward Learning: Shows that influencing human preferences to match the algorithm's model leads to learned reward functions that induce near-optimal behavior more frequently than standard, unguided preference collection.

4. Key Results

A. Influence on Preference Conformity (Hypothesis 1)

PRIVILEGED: Showing subjects the target statistic resulted in a statistically significant ( $p < 0.01$ ) increase in conformity to that model compared to the control.
TRAINED: Training subjects to follow a model significantly shifted their preferences toward that model in the same domain ( $p < 0.01$ $p < 0.01$ ).
- Cross-Domain Limitation: Training on Partial Return successfully transferred to a new domain. Training on Regret failed to transfer to a new domain. The authors attribute this to cognitive fatigue; calculating Regret is cognitively demanding, and the added load of learning a new domain caused subjects to revert to their natural (control) preferences.
QUESTION: Changing the question had a significant effect ( $p < 0.05$ ) for the Partial Return model in deterministic settings but had mixed or non-significant results for Regret, suggesting question wording alone is less powerful than training for complex models.

B. Impact on Learned Reward Functions (Hypothesis 2)

Alignment: When humans were influenced to follow a specific model, the reward function learned using that same model induced near-optimal behavior (normalized return > 0.9) significantly more often than when learning from control data.
Partial Return Caveat: In the TRAINED experiment, learning from the Partial Return-trained dataset yielded poor results. The authors identified this was due to identifiability issues in the dataset (lack of specific trajectory pairs where one segment terminates early), not a failure of the intervention itself. When synthetic pairs were added, performance improved.
Stochastic Environments: In the stochastic domain, modifying the question to favor Change-in-Expected-Return improved alignment, though the effect on Regret was mixed.

5. Significance and Implications

Practical Tool for RLHF: The paper provides actionable tools for practitioners. Specifically, training annotators and carefully designing elicitation questions can significantly improve the quality of preference data, leading to better-aligned AI systems.
Cognitive Load Management: The study highlights that the complexity of the preference model matters. Simple models (Partial Return) are easier to train subjects on and generalize better across domains than complex models (Regret), which may induce fatigue.
Survey Design Perspective: The work reinforces the view that RLHF is an act of "survey design." The interface and instructions do not just measure preferences; they actively shape them.
Future Directions: The authors suggest exploring these interventions in robotics, using interactive interfaces, and potentially collecting multiple datasets corresponding to different models to capture complementary information (e.g., combining Partial Return for "winning" and Regret for "efficiency").

In conclusion, the paper establishes that the gap between human preference generation and algorithmic assumptions is not fixed. By designing better interfaces and training protocols, we can "steer" human feedback to match the mathematical models required for robust RLHF, thereby improving the final alignment of AI agents.

Influencing Humans to Conform to Preference Models for RLHF