Influencing Humans to Conform to Preference Models for RLHF

This paper demonstrates that human preference data quality for Reinforcement Learning from Human Feedback (RLHF) can be significantly improved by designing specific interventions—such as visualizing underlying model quantities, training users on the model, and modifying elicitation questions—to align human expression with the algorithm's preference model assumptions without altering their underlying reward functions.

Original authors: Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Peter Stone

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot how to drive a car. You can't just tell the robot, "Drive safely and efficiently," because the robot doesn't understand those abstract concepts. Instead, you have to show it two different driving routes and ask, "Which one is better?"

This is the core of RLHF (Reinforcement Learning from Human Feedback). The robot learns by watching humans choose between options.

However, there's a hidden problem. The robot has a specific "mathematical recipe" (a Preference Model) for how it thinks humans make choices.

  • Recipe A (Partial Return): "Humans always pick the path that gives the most points right now."
  • Recipe B (Regret): "Humans pick the path that shows the smartest decisions, even if the points are lower."

The problem is that humans are messy. Sometimes we pick the path with the most points, and sometimes we pick the smartest path. If the robot is using Recipe A, but you (the human) are thinking like Recipe B, the robot learns the wrong thing. It's like trying to bake a cake using a recipe for bread; the ingredients (your feedback) are real, but the instructions (the robot's model) don't match how you actually use them.

The Big Idea of This Paper:
Instead of trying to change the robot's recipe to match our messy human brains, the authors asked: "Can we train humans to speak the robot's language?"

They wanted to see if they could influence humans to give feedback that perfectly matches the robot's specific mathematical recipe, without changing what the humans actually want (their underlying values).

The Three "Tricks" They Tried

The researchers tested three ways to get humans to "speak" the robot's language better. Think of these as different ways to coach a player before a game.

1. The "Privileged" Cheat Sheet

  • The Analogy: Imagine you are playing a video game, and before you choose a path, the game tells you, "Path A gives you 50 points. Path B gives you 40 points."
  • What they did: They showed humans the exact math the robot uses (the "points" or "regret" scores) while they were making their choice.
  • The Result: Humans immediately started picking the path the robot wanted. It proved that if you give humans the right information, they can perfectly align with the robot's model.
  • The Catch: In the real world, we can't show humans the "secret math" because we don't know the answer yet! So, this was just a proof-of-concept.

2. The "Training Camp"

  • The Analogy: Instead of giving a cheat sheet during the game, you hold a workshop before the game. You teach the players, "Hey, in this game, the goal isn't just points; it's about making the smartest moves. Here is how you calculate a 'smart move'."
  • What they did: They taught humans how to calculate the specific metric the robot cared about (like "Regret" or "Total Score") and practiced with them.
  • The Result:
    • When humans were trained on the Total Score (Partial Return), they became excellent at giving that kind of feedback, even in new, different game maps.
    • When humans were trained on Regret (smart moves), it was too hard. They got tired and confused, especially when switching to a new game map. It was like asking someone to do advanced calculus while running a marathon.
  • Lesson: Training works, but the concept must be simple enough for humans to grasp without burning out.

3. The "Magic Question"

  • The Analogy: You don't teach the player anything new. You just change the question on the scoreboard. Instead of asking, "Which path do you like?", you ask, "Which path had the best immediate results?" or "Which path shows better decision-making?"
  • What they did: They simply tweaked the wording of the question asked during the feedback process.
  • The Result: Surprisingly, just changing the words had an effect!
    • Asking about "immediate outcomes" nudged people toward the "Total Score" model.
    • Asking about "decision-making" nudged people toward the "Regret" model.
  • Lesson: The way you ask a question shapes the answer. A tiny change in the interface can steer human thinking in a specific direction.

Why Does This Matter?

Think of RLHF as a conversation between a human and a robot.

  • Old Way: We assume the robot understands us perfectly. If the robot learns the wrong thing, we blame the robot's math.
  • New Way (This Paper): We realize the robot and human are speaking different dialects. Instead of just fixing the robot, we can design better interfaces and training to help humans speak the robot's dialect.

The Takeaway:
If you want an AI to learn exactly what you want, don't just hope it guesses right. Design the interface and the training so that your natural feedback lines up perfectly with the AI's learning style.

It's like tuning a radio. You can't change the station (the human's true preference), but you can tune the dial (the interface and training) so the signal comes in crystal clear, without static. This paper shows us exactly how to tune that dial.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →