Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Imagine you are an advertiser running a massive online ad campaign. You have a fixed budget (say, $10,000) for the day, and you need to decide how much to bid for thousands of ad spots every second. If you bid too low, you miss out on customers. If you bid too high, you run out of money too early and stop showing ads for the rest of the day.

This is the Auto-Bidding problem. It's like trying to drive a car through a crowded city while keeping your gas tank from emptying before you reach your destination.

The Old Way: The "Copycat" Driver

For a long time, computers solved this by looking at a huge notebook of past driving trips (offline data). They tried to learn the best routes by simply imitating what worked in the past.

The Problem: If the traffic changes slightly (a new road opens, a storm hits), the "copycat" driver gets confused. They are afraid to try anything new because they only know what's in the notebook. They stick to the safe, old routes, even if a faster one exists. They can't "explore" safely.

The New Way: The "Smart Navigator" (AIGB-Pearl)

This paper introduces a new system called AIGB-Pearl. Think of it as upgrading that copycat driver into a Smart Navigator that has two special tools:

1. The "Quality Judge" (The Trajectory Evaluator)

Imagine a strict coach sitting in the passenger seat. Every time the driver (the AI) suggests a new route, the coach doesn't just guess; they have a scorecard.

The coach looks at the proposed route and gives it a score: "This looks like a $90 route," or "This looks like a $110 route."
The Innovation: In the past, the driver had to guess if a new route was good. Now, the coach gives a concrete score before the car even moves. This tells the driver exactly how good a new idea might be.

2. The "Safety Fence" (KL-Lipschitz Constraint)

Here is the tricky part. If the coach says, "Hey, try that new route over there!" the driver might get too excited and drive off a cliff (this is called Out-of-Distribution risk in tech terms).

The paper builds a Safety Fence around the driver.
The Rule: "You can explore new routes, but you must stay within a certain distance of the roads we know are safe."
Mathematically, this is called a KL-Lipschitz Constraint. In plain English, it means: "Don't jump too far away from what you know works. Take small, safe steps into the unknown."

How It Works Together

The Coach (Evaluator) learns from the old notebook to predict how good a route would be.
The Driver (Planner) tries to find a route that gets the highest score from the coach.
The Safety Fence ensures the driver doesn't wander into dangerous territory where the coach's predictions might be wrong.

Why Is This a Big Deal?

Stability: Old methods were like a rollercoaster that kept crashing during training. This new method is smooth and steady.
Safety: It prevents the AI from making crazy, expensive mistakes that could burn the advertiser's budget in minutes.
Performance: In tests on Taobao (a massive Chinese e-commerce site), this new system made 3% more money for advertisers than the previous best methods. In the world of billions of dollars, that's millions of extra dollars in profit!

The Bottom Line

AIGB-Pearl is like giving an auto-bidding AI a smart coach and a safety harness. It allows the AI to try new, better strategies to win more ads, but it keeps the AI from doing anything reckless that could ruin the campaign. It's the difference between a reckless gambler and a professional poker player who knows when to take a calculated risk.

1. Problem Statement

The paper addresses the auto-bidding problem in online advertising, where an advertiser must automatically adjust bids to maximize cumulative value (e.g., GMV) within a fixed budget over a bidding episode.

Context: This is modeled as an offline sequential decision-making problem (Markov Decision Process) constrained by a static dataset due to safety concerns in real-world systems.
Limitations of Existing Methods:
- Offline RL: While widely used, methods relying on bootstrapped value estimates (like Q-learning) suffer from training instability and the "deadly triad" (function approximation, bootstrapping, off-policy data), leading to unreliable policies.
- Generative Auto-Bidding (AIGB): Recent methods (e.g., DiffBid, Decision Transformer) treat bidding as a trajectory generation task. They offer stable training but lack a mechanism to explore beyond the static dataset. They essentially imitate offline data; when forced to extrapolate (generate trajectories with higher rewards than seen in data), they lack explicit reward guidance, leading to unreliable or risky generations (Out-of-Distribution or OOD failures).

Core Challenge: How to integrate policy optimization (exploration) into a generative auto-bidding framework to improve performance beyond the offline dataset while ensuring safety and avoiding OOD pitfalls.

2. Methodology: AIGB-Pearl

The authors propose AIGB-Pearl (Planning with EvaluAtor via RL), a novel framework that integrates a learned trajectory evaluator with a generative planner under strict theoretical constraints.

A. Trajectory Evaluator

Instead of relying on bootstrapped value functions, AIGB-Pearl trains a trajectory evaluator ( $\hat{y}_\phi$ ) using supervised learning on the offline dataset $D$ .

Goal: Predict the trajectory quality (normalized cumulative reward) $y(\tau)$ for any given trajectory $\tau$ .
Training: Minimized Mean Squared Error (MSE) between predicted scores and ground truth rewards.
Enhancements: The evaluator is augmented with LLM embeddings (to capture semantic advertiser features) and pairwise ranking losses (to improve relative score accuracy).

B. KL-Lipschitz-Constrained Score Maximization

The core innovation is a constrained optimization objective for the generative planner ( $p_\theta$ ). The planner aims to maximize the evaluator's score $L(\theta) = \mathbb{E}[\hat{y}_\phi(\tau)]$ but is restricted to ensure the evaluator remains reliable.

The objective is formulated as:
$\max_\theta L(\theta) \quad \text{s.t.} \quad \text{KL Constraint} \land \text{Lipschitz Constraint}$

KL Constraint (Behavior Cloning):
$\mathbb{E}_{y \sim p_D(y)} [D_{KL}(p_D(\tau|y) \| p_\theta(\tau|y))] \leq \delta_K$
This ensures the planner stays close to the offline data distribution, preventing catastrophic divergence.
Lipschitz Constraint (Stability):
The authors prove that trajectory quality $y(\tau)$ is Lipschitz continuous. They enforce the planner to be Lipschitz continuous with respect to the condition $y$ (the desired reward level).
$\text{Lip}_{W_1}(p_\theta(\tau|y)) \leq L_p$
This guarantees that small changes in the target reward condition do not lead to drastic, unpredictable changes in the generated trajectory, keeping the generation within a "certified neighborhood" of high-quality offline trajectories.

C. Theoretical Guarantees

The paper provides a sub-optimality gap bound (Theorem 3). It proves that the difference between the optimal performance and the proposed method's performance is bounded by:

The evaluator's training error ( $\delta_D$ ).
The Lipschitz constant of the evaluator ( $k$ ).
The KL divergence error ( $\delta_K$ ) and the Lipschitz constant of the planner ( $L_p$ ).
This theoretically justifies that the method can safely explore higher-reward regions without incurring unbounded performance degradation.

D. Practical Algorithm: Synchronous Coupling

To enforce the Lipschitz constraint during training, the authors introduce a Synchronous Coupling technique.

Instead of random sampling, two trajectories conditioned on different rewards ( $y_1, y_2$ ) are generated using the same sequence of Gaussian noise.
This allows for a tighter upper bound estimation of the Wasserstein distance ( $W_1$ ) between distributions, making the Lipschitz penalty computationally feasible and effective.

3. Key Contributions

AIGB-Pearl Framework: A novel method combining generative modeling with RL-based policy search, enabling continuous improvement beyond static datasets.
Theoretical Safety Mechanism: A provably sound KL-Lipschitz-constrained score maximization objective. This is the first approach to provide a sub-optimality bound for generative auto-bidding that explicitly handles OOD risks.
Synchronous Coupling Algorithm: A practical implementation technique to satisfy Lipschitz constraints in generative models, ensuring stable training.
Enhanced Evaluator: Integration of LLM embeddings and pairwise learning to create a robust trajectory quality scorer.

4. Experimental Results

The authors validated AIGB-Pearl through extensive simulations and real-world A/B tests on Taobao (Alibaba).

Simulated Experiments (30 advertisers):
- Outperformed all baselines (including USCB, BCQ, CQL, IQL, DiffBid, DT) across all budget levels.
- Achieved a +4.62% improvement in GMV over the strongest baseline (DiffBid) at the 1.5k budget level.
Real-World A/B Tests (6,000 advertisers, 19 days):
- GMV: +3.00% improvement over DiffBid.
- BuyCnt (Impressions Won): +2.20% improvement.
- ROI: +1.89% improvement.
- Cost: Fluctuated within acceptable tolerance (<2%).
- Generalization: On unseen advertisers (OOD), AIGB-Pearl still outperformed DiffBid by +3.32% in GMV, demonstrating superior generalization capabilities.
Ablation Studies:
- Removing the KL constraint led to a 1.1% drop in GMV.
- Removing the Lipschitz constraint led to a 1.8% drop in GMV and resulted in "pathological" trajectories (e.g., excessive budget consumption, backward-trending pacing).
- Training Stability: AIGB-Pearl showed significantly smoother learning curves and lower variance across random seeds compared to offline RL methods with bootstrapping.
TargetROAS Extension:
- Applied to a more complex TargetROAS bidding problem (300k advertisers), achieving a +5.1% GMV uplift over the SOTA DiffBid.

5. Significance

Bridging Generative AI and RL: The paper successfully bridges the gap between the stability of generative modeling and the optimization power of RL, solving the "exploration vs. safety" dilemma in offline settings.
Industrial Impact: The method is deployed on a massive scale (Taobao), directly translating to millions of RMB in additional daily GMV. It addresses a critical industry need for stable, safe, and high-performing auto-bidding systems.
Theoretical Rigor: By providing a sub-optimality bound and Lipschitz constraints, the paper moves generative decision-making from heuristic improvements to theoretically grounded optimization, offering a blueprint for safe exploration in other offline RL domains.