Aligning Large Language Models with Searcher Preferences

This paper introduces SearchLLM, the first large language model designed for open-ended generative search on platforms like RedNote, which utilizes a hierarchical multi-dimensional reward system and Gated Aggregation Strategy with GRPO to balance safety, factual grounding, and user alignment, resulting in measurable improvements in generation quality and user engagement.

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are asking a librarian for help finding a book.

The Old Way (Traditional Search):
The librarian hands you a stack of 20 different books and says, "Here, these are related to your question. Good luck!" You have to read the titles, flip through the pages, and figure out which one is actually useful. This is how most search engines work today: they give you a list of links.

The New Way (Generative Search):
The librarian reads all 20 books, summarizes the best parts, checks if the information is true, and writes you a single, perfect letter with the answer. This is what "Generative Search" tries to do. It uses a super-smart AI (a Large Language Model) to read the search results and write a direct answer for you.

The Problem:
But here's the catch: If you ask a super-smart AI to write an answer, it might get too excited. It might:

  1. Lie (make up facts because it sounds cool).
  2. Be unsafe (give dangerous advice, like telling you how to build a bomb).
  3. Get confused by bad information (if the search results are messy, the AI might get messy too).
  4. Talk too much (write a novel when you just wanted a quick fact).

This paper introduces a new system called SearchLLM that fixes these problems. Here is how they did it, using some simple analogies.

1. The "Two-Layer" Rulebook

The authors realized they couldn't just tell the AI, "Be helpful." They needed a strict rulebook with two layers, like a security checkpoint at an airport.

  • Layer 1: The "Hard Stop" Rules (Bottom-line Constraints)
    Think of this as the Security Gate. Before the AI is even allowed to speak, it must pass a strict check.

    • Did you lie? (If yes, stop immediately).
    • Is this dangerous? (If yes, stop immediately).
    • Did you follow the format? (If you wrote a poem instead of a list, stop).
    • The Analogy: This is like a bouncer at a club. If you don't have ID or are wearing the wrong shoes, you don't get in, no matter how funny you are.
  • Layer 2: The "Star Performer" Rules (Behavioral Objectives)
    Once the AI passes the security gate, it gets to the stage. Now, the goal is to be awesome.

    • Is the answer easy to read?
    • Did you cover all the angles?
    • Is it short and sweet?
    • The Analogy: This is the talent show. Once you're on stage, the judges want you to be creative, clear, and engaging.

2. The "Smart Coach" (The Reward System)

How do you teach an AI to follow these rules? You can't just say "Good job" or "Bad job." You need a Coach that gives specific feedback.

The authors built a Hybrid Coach Team:

  • The Robot Referee: This part checks the hard rules automatically (like a spellchecker or a fact-checker). It's fast and never gets tired.
  • The Human Expert: This part uses a second, very smart AI (trained by humans) to check the "vibe" of the answer. Does it sound natural? Is it helpful?

The Secret Sauce: The "Gated Aggregation" Strategy
This is the most clever part. Imagine you are training a race car driver.

  • If the driver crashes (violates a safety rule), it doesn't matter how fast they were going; they get a zero score.
  • If the driver stays on the track, then we look at how fast they went.

The authors created a mathematical "Gate." If the AI fails the safety check (Layer 1), the gate slams shut, and the reward is zero. The AI learns: "Safety first, speed second." If it tries to cheat to get a high score on "creativity" but fails the safety check, it gets punished heavily. This stops the AI from "gaming the system."

3. The Result: A Better Search Experience

The team tested this new AI (SearchLLM) on a huge app called RedNote (similar to Instagram/TikTok but with search).

  • Before: Users had to click through many links, sometimes getting bad or outdated info.
  • After: The AI gave them a direct, safe, and accurate answer.

The Numbers:

  • More People Read the Answer: The "Valid Consumption Rate" went up by 1.03%. (More people actually found the answer useful enough to read it).
  • Fewer People Had to Search Again: The "Re-search Rate" went down by 2.81%. (People got the answer the first time and didn't have to ask the same question again).

Summary

Think of this paper as teaching a robot librarian how to be a perfect assistant.

  1. Don't lie (Safety).
  2. Don't be dangerous (Safety).
  3. Be helpful and clear (Quality).
  4. Use a "Gate" system so the robot knows that being safe is more important than being clever.

By doing this, they turned a chaotic list of search results into a reliable, friendly conversation that actually solves the user's problem.