When & How to Write for Personalized Demand-aware Query Rewriting in Video Search

Imagine you are at a massive, chaotic library (a video search engine like WeChat Channels). You walk up to the librarian and say, "I want to see Guang Liang."

The librarian is confused. Is Guang Liang a famous singer? Is it a brand of liquor? Without knowing who you are, the librarian has to guess. If they guess wrong, you get frustrated, say, "No, I meant the liquor," and ask again. This is the problem the paper solves.

The authors, from Tencent, built a smart system called WeWrite. Think of it as a Super-Intelligent Personal Assistant that stands next to the librarian, whispering the exact right request based on what they know about you.

Here is how they built this assistant, broken down into three simple steps:

1. The "When" Question: Knowing When to Whisper

The Problem: If your assistant whispers a suggestion for every question you ask, it becomes annoying. If you ask, "How do I cook an air fryer?", your assistant shouldn't suddenly suggest "Funny air fryer pranks for couples" just because you watched a comedy last week. That would be a distraction, not a help.

The Solution (Posterior Mining):
The team taught the assistant to look at your past behavior to decide when to speak up.

The "Frustration Signal": They looked at logs where people asked a question, got bad results (didn't watch the video), and immediately asked a new question.
The "Aha! Moment": They checked if the new question was related to things the user had watched before.
The Filter: They used a "Teacher AI" to double-check: "Did this user actually need help because of their history, or did they just make a typo?"
The Result: The assistant now only whispers when it's 100% sure you need a personalized nudge. If you ask a clear, functional question (like "air fryer recipes"), it stays silent.

2. The "How" Question: Learning to Speak the Library's Language

The Problem: Even if the assistant knows what you want, it might write the request in a weird way that the library's computer system can't understand. Imagine the assistant whispering, "Show me the spicy liquid that makes people happy," when the library only understands the word "Liquor." The library would return zero results.

The Solution (SFT + GRPO):
They trained the assistant in two stages:

Stage 1 (The Student - SFT): They showed the assistant thousands of examples of "Bad Request → Good Request" pairs. The assistant learned to mimic these corrections, just like a student copying a teacher's handwriting.
Stage 2 (The Coach - GRPO): This is the clever part. They didn't just let the assistant guess; they gave it a scorecard.
- If the assistant wrote a query that the library system could easily find (high "Index Hit Rate"), it got a gold star.
- If it wrote a fancy, confusing query that the library couldn't find, it got a penalty.
- The assistant practiced thousands of times, learning to write requests that are not only personal to you but also perfectly formatted for the library's database.

3. The "Speed" Question: Doing It Without Waiting

The Problem: Smart assistants usually take a long time to think. In a video app, if you wait 2 seconds for a result, you'll just scroll away. You can't wait for the AI to think before showing you videos.

The Solution (Fake Recall):
They built a parallel highway.

The Main Road: The traditional search engine starts looking for videos immediately (Text/Vector search).
The Side Road: At the exact same time, the AI assistant starts thinking about your personalized rewrite.
The Magic Cache: The assistant doesn't search the whole library. It checks a pre-made "Cheat Sheet" (a Fake Index) that already contains the top results for popular personalized queries.
The Merge: By the time the Main Road finishes gathering results, the Side Road has already grabbed the personalized ones from the Cheat Sheet. They are merged together instantly. You get the best of both worlds with zero extra waiting time.

The Real-World Result

When they tested this in the real world:

More Happy Viewers: People watched videos for longer (over 10 seconds) because the results actually matched what they wanted.
Less Frustration: People stopped having to re-type their search queries because the system understood them the first time.

In a nutshell: WeWrite is a smart, fast, and polite assistant that knows exactly when to help you find what you want, speaks the language of the search engine perfectly, and does it all without making you wait a single second.

1. Problem Statement

Short-form video search platforms (e.g., WeChat Channels) face significant challenges due to user queries being brief and ambiguous.

Ambiguity: Queries like "Guang Liang" can refer to a singer or a liquor brand. Generic search engines fail to resolve this without user context.
Intent Drift: Existing personalized methods often rewrite queries indiscriminately. This can cause "intent drift," where functional queries (e.g., "air fryer") are incorrectly biased by a user's entertainment history, leading to poor retrieval results.
Latency Constraints: Integrating Large Language Models (LLMs) into real-time search systems is difficult due to high inference costs, which violate strict latency requirements for synchronous search paths.
Signal Dilution: Traditional methods relying on implicit history features often suffer from delayed feedback and diluted signals.

The core challenge is determining When to rewrite a query (to avoid unnecessary noise) and How to rewrite it (to ensure the output aligns with the retrieval system's index and user intent).

2. Methodology: The WeWrite Framework

The authors propose WeWrite, a Personalized Demand-aware Query Rewriting framework consisting of three main modules:

A. Posterior-based Sample Mining (The "When")

To determine when personalization is strictly necessary, the system mines high-quality training data from user logs using a posterior-based strategy.

Context Definition: User context ( $C_u$ ) includes historical queries ( $H_{query}$ ), watched videos ( $H_{video}$ ), and geolocation ( $G$ ).
Positive Sample Mining (Rewrite): Identifies scenarios where a user was dissatisfied with an original query ( $Q_{orig}$ $Q_{or i g}$ ) but satisfied with a subsequent reformulated query ( $Q_{next}$ $Q_{n e x t}$ ).
- Constraints: $Q_{orig}$ had short dwell time (<2.4s), while $Q_{next}$ had valid consumption (>10s).
- Filtering: A two-stage filter ensures the rewrite is context-driven:
  1. Context Overlap: New terms in $Q_{next}$ must appear in the user's history.
  2. LLM Intent Verification: A teacher model (Qwen3-32B) verifies if the reformulation is explicitly supported by context, filtering out typos or unrelated changes.
Negative Sample Mining (Reject): Identifies cases where the original query led to immediate satisfaction (long dwell time) without reformulation. These are labeled as <reject> to teach the model when not to rewrite.

B. Style-aligned LLM Fine-tuning (The "How")

The framework uses a hybrid training paradigm to ensure rewrites are both semantically accurate and system-friendly (retrievable).

Supervised Fine-Tuning (SFT): The model is trained on the mined dataset ( $S_{pos} \cup S_{neg}$ ) to learn the mapping from $(C_u, Q_{orig})$ to either a rewritten query or the <reject> token.
Reinforcement Learning (RL) with GRPO: To prevent the "zero-recall" problem (where the LLM generates semantically correct but unindexable queries), the model undergoes RL using Group Relative Policy Optimization (GRPO).
- Reward Function: Based on historical logs, the reward $R(Q_{rew})$ $R (Q_{r e w})$ combines:
  - Log-frequency of the query (encouraging popular, indexed terms).
  - Historical Click-Through Rate (CTR).
  - A penalty for generating unknown terms (hallucinations).
- Optimization: GRPO optimizes the policy by sampling a group of rewrites and normalizing rewards against group statistics, eliminating the need for a separate value network.

C. Deployment: Fake Recall Architecture

To solve the latency bottleneck, the system employs a parallel "Fake Recall" architecture.

Fake Index ( $I_{fake}$ ): A pre-built Key-Value index maps valid system queries to their top-performing documents (Top-K). It is constructed using interaction-based caching for head queries and retrieval-based mining for long-tail queries.
Parallel Execution:
1. When a user query arrives, the Traditional Search Path (Text/Vector Recall) and the Personalized Rewriting Path run simultaneously.
2. The LLM generates a rewrite asynchronously.
3. If a rewrite is generated, it queries the Fake Index (O(1) lookup) to retrieve candidates immediately, bypassing the heavy online retrieval chain.
4. A lightweight relevance filter removes irrelevant results, and the final candidates are fused with the main search results.
Result: This achieves "zero-perceived-latency" personalization.

3. Key Contributions

Posterior-based "When" Strategy: An automated mining mechanism that uses user feedback (dwell time, reformulation) to identify strictly necessary personalization scenarios, effectively mitigating intent drift.
GRPO-aligned "How" Training: A hybrid SFT + RL training paradigm. By optimizing for retrieval-oriented rewards (Index Hit Rate and CTR), it aligns the LLM's output style with the specific constraints of the search index.
Fake Recall Deployment: A novel parallel architecture that decouples LLM inference from the main search path, ensuring low latency suitable for real-time video search.

4. Experimental Results

The framework was evaluated via large-scale online A/B testing on a major video platform.

Click-Through Video Volume (VV>10s): Increased by 1.07%.
Query Reformulation Rate: Reduced by 2.97% (indicating users found what they needed faster without needing to re-type queries).
Latency: The parallel architecture successfully maintained system latency within acceptable limits despite LLM integration.

5. Significance

This paper addresses a critical gap in search personalization: the trade-off between intent resolution and retrieval feasibility.

Explicit vs. Implicit: It moves beyond implicit history modeling to explicit, demand-aware rewriting, ensuring personalization is triggered only when it adds value.
System Alignment: It demonstrates how to fine-tune generative models not just for language quality, but specifically for the constraints of a retrieval system (index hit rates).
Scalability: The "Fake Recall" architecture provides a practical blueprint for deploying heavy LLM inference in latency-sensitive production environments, making personalized generative search viable for millions of users.

When & How to Write for Personalized Demand-aware Query Rewriting in Video Search

1. The "When" Question: Knowing When to Whisper

2. The "How" Question: Learning to Speak the Library's Language

3. The "Speed" Question: Doing It Without Waiting

The Real-World Result

1. Problem Statement

2. Methodology: The WeWrite Framework

A. Posterior-based Sample Mining (The "When")

B. Style-aligned LLM Fine-tuning (The "How")

C. Deployment: Fake Recall Architecture

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank