Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

Imagine you are a personal shopper for a massive, chaotic department store. Your job is to pick the perfect items for your customers to keep them happy and coming back.

However, there's a catch: the store is rigged.

The Problem: The "Popularity Trap"

In this store, the most popular items (like the latest viral sneakers) are placed right at the entrance with flashing neon signs. The quieter, niche items (like a hand-knitted scarf) are hidden in the back.

Because the popular items are so easy to see, customers grab them first. They leave positive reviews because they bought them, not necessarily because they loved them more than the scarf.

The Mistake:
Your boss (the AI algorithm) looks at the data and thinks, "Wow, everyone loves the sneakers! I should only show sneakers!"
But the boss is wrong. The customers didn't choose the sneakers because they are the best; they chose them because they were the only ones they saw. The boss is confusing exposure with preference.

This creates a vicious cycle:

The boss shows sneakers.
Customers buy sneakers (because they are the only option).
The boss thinks, "See? They love sneakers!" and shows even more sneakers.
The customers get bored, stop coming, and the store loses money.

This is what the paper calls the "Rich-Get-Richer" loop. Existing AI tries to fix this by telling the boss, "Hey, maybe show a scarf sometimes," but the boss is still looking at the distorted data where everyone seems to love sneakers. The boss is confused and keeps making mistakes.

The Solution: "Cleaning the Glasses"

The authors of this paper, Yun Lu and his team, say: "Stop trying to fix the boss's decisions. First, fix what the boss is seeing."

They propose a two-step system called DSRM-HRL. Think of it as giving the boss a pair of magic glasses and a smart assistant.

Step 1: The Magic Glasses (DSRM)

Before the boss looks at the customer, they put on a pair of "Diffusion Model" glasses.

What it does: These glasses filter out the "neon sign" noise. They ignore the fact that the sneakers were just placed at the front. They look deep into the customer's history to find out what they actually liked, even if they never got a chance to see it.
The Analogy: Imagine looking at a muddy puddle. You can't see the fish swimming underneath. The "Diffusion" process is like slowly stirring the water and filtering out the mud until the water is crystal clear. Now, the boss can see the true fish (the customer's real interests), not just the mud (the popularity bias).

Step 2: The Smart Assistant (Hierarchical RL)

Once the boss sees the clear picture through the glasses, they don't just pick one item. They use a two-level management team:

The CEO (High-Level Policy): This is the long-term strategist. Their only job is to make sure the store is fair. They say, "We need to make sure the hand-knitted scarves get a chance to be seen, or the store will lose its soul." They set the rules for the day.
The Salesperson (Low-Level Policy): This is the day-to-day worker. They listen to the CEO. They say, "Okay, I need to be fair. But I also need to sell something the customer likes right now." They pick the perfect item that satisfies both the customer's taste and the CEO's fairness rules.

Why This Works Better

In the old way, the boss tried to be fair while looking at muddy water. They kept making mistakes because the data was corrupted.

In the new way:

Clean the Water: The "Magic Glasses" remove the popularity noise, revealing the customer's true taste.
Split the Job: The "CEO" handles the long-term fairness, and the "Salesperson" handles the immediate sale. They don't fight each other; they work together.

The Result

When the authors tested this in a simulated video store (using real data from apps like TikTok), the results were amazing:

Customers stayed longer: They were happier because they found things they actually liked, not just what was popular.
The "Long Tail" survived: The hidden, niche items finally got a chance to be seen and sold.
No more confusion: The system stopped the "Rich-Get-Richer" loop and created a healthy, balanced ecosystem.

In short: You can't make a fair decision if you are looking at a distorted reality. This paper teaches us that to build a fair AI, we must first clean the data (the state) before we try to teach the AI (the policy) how to be fair.

1. Problem Statement

The paper addresses a critical flaw in existing Fairness-Aware Reinforcement Learning (RL) for Interactive Recommender Systems (IRS).

The Core Issue: Current methods assume that the observed user state (derived from implicit feedback) is a faithful representation of true user preferences.
The Reality: In interactive environments, observed states are heavily contaminated by exposure bias and popularity-driven noise. Users interact with items they are exposed to, not necessarily what they intrinsically prefer. This creates a "Spurious Feedback Loop" where the RL agent learns to recommend popular items to maximize immediate rewards, reinforcing a "rich-get-richer" cycle and causing item-side exposure unfairness (long-tail items are ignored).
The Misconception: Existing approaches treat the accuracy-fairness conflict as a reward-shaping or policy-constraint problem. The authors argue it is fundamentally a state estimation failure. Applying fairness constraints to a distorted state leads to an artificial trade-off where accuracy must be sacrificed for fairness.

2. Proposed Methodology: DSRM-HRL

The authors propose DSRM-HRL, a two-stage framework that reformulates fairness as a latent state purification problem followed by decoupled hierarchical decision-making.

A. Denoising State Representation Module (DSRM)

To address the non-linear bias and signal-noise entanglement, the authors introduce a diffusion-based module to recover the "true" latent preference manifold.

Mechanism: It treats the noisy interaction history as a corrupted signal. Using a Diffusion Model, it simulates a forward process of injecting noise and learns a reverse process to iteratively reconstruct the clean, low-entropy latent state ( $s^*$ ) from the noisy observed state ( $\tilde{s}$ ).
Goal: To disentangle genuine user interests from popularity-driven fluctuations, effectively filtering out "epistemic uncertainty" before the state reaches the decision-maker.
Geometric Insight: The module transforms a "Manifold Collapse" (where user embeddings cluster by popularity) into a Disentangled Preference Manifold (where embeddings cluster by semantic interest).

B. Hierarchical Reinforcement Learning (HRL)

Once the state is purified, the framework employs a hierarchical architecture to resolve the temporal conflict between short-term engagement and long-term fairness.

High-Level Policy (Manager): Regulates long-term fairness trajectories. It outputs a strategic control variable ( $z_t$ ) that defines the fairness constraints (e.g., exposure quotas) for the current step. It optimizes for ecosystem-level equity (e.g., minimizing the Gini coefficient of exposure).
Low-Level Policy (Worker): Optimizes short-term user engagement (accuracy) conditioned on the constraints provided by the Manager. It selects items that maximize immediate relevance while strictly adhering to the fairness bounds set by the high-level policy.
Decoupling: This structure separates the optimization of immediate rewards from long-term equity goals, preventing gradient interference and training instability.

3. Key Contributions

Theoretical Insight: The paper identifies and formalizes that the accuracy-fairness trade-off in IRS is primarily caused by biased state estimation, not just reward design. It argues that effective fairness intervention must begin with state purification.
Novel Architecture: It proposes DSRM-HRL, the first framework to combine Diffusion Models for state denoising with Hierarchical RL for temporal decoupling.
- DSRM recovers the decision-relevant preference manifold.
- HRL resolves multi-objective conflicts across different time scales.
Empirical Validation: The authors demonstrate that state purification is a necessary prerequisite for robust fairness. By cleaning the input signal, the system achieves a superior Pareto frontier, improving both long-term user retention and item exposure equity simultaneously.

4. Experimental Results

The framework was evaluated on high-fidelity simulators (KuaiRec and KuaiRand-Pure) using the KuaiSim environment.

Performance vs. Baselines: DSRM-HRL significantly outperformed both general RL methods (A2C, TD3, BCQ) and state-of-the-art fairness-aware methods (MOFIR, DNaIR, SAC4IR).
- Interaction Length (Retention): On KuaiRec (Max Len=30), DSRM-HRL achieved an average interaction length of 26.60, a 21.1% improvement over the best fairness baseline (SAC4IR) and 27.9% over the best general RL baseline.
- Fairness (AD): It achieved the lowest Absolute Difference (AD) in exposure disparity (0.008), indicating highly balanced exposure between popular and long-tail items.
- Accuracy: Unlike naive fairness methods that sacrifice relevance, DSRM-HRL maintained high single-step rewards ( $R_{reach}$ ), proving that fairness does not require sacrificing accuracy when the state is purified.
Ablation Studies:
- Removing the diffusion denoising (using HRL on raw states) led to performance degradation, confirming that hierarchical control alone cannot fix corrupted inputs.
- Using heuristic denoising (e.g., RCE, TCE) instead of the learned diffusion model resulted in severe performance drops, highlighting the superiority of the learned non-linear reconstruction.
- Removing the hierarchical structure (Flat RL) also underperformed, showing the need for temporal decoupling.
Training Stability: DSRM-HRL exhibited smoother convergence curves and lower variance compared to baselines, which often suffered from oscillations and performance collapses due to the non-stationary nature of biased feedback.

5. Significance

Paradigm Shift: The paper shifts the focus of fairness research from post-hoc reward shaping to pre-hoc state purification. It suggests that "garbage in, garbage out" applies to fairness; if the state representation is biased, no amount of policy optimization can achieve true equity.
Practical Impact: By breaking the "rich-get-richer" feedback loop, the method ensures long-tail items receive equitable exposure, which is crucial for the diversity and health of digital ecosystems.
Efficiency: While the diffusion process adds computational overhead, it is significantly more efficient than heuristic denoising strategies and yields substantial gains in long-term utility, offering a favorable efficiency-performance trade-off for real-world deployment.

In conclusion, DSRM-HRL demonstrates that by first "purifying" the user's latent preferences from popularity noise, an RL agent can naturally learn policies that are both highly accurate and inherently fair, resolving the traditional conflict between the two objectives.