Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Imagine you are trying to teach a robot dog how to walk. You give it a command, it tries to move, and you tell it, "Good job!" or "Try again!" This is how Reinforcement Learning (RL) works. The robot (the "policy") is constantly adjusting its brain (its neural network) to get better at walking.

The most popular way to teach the robot is a method called PPO (Proximal Policy Optimization). Think of PPO as a very diligent student who, after seeing a single example of how to walk, tries to practice that same example over and over again in their head.

The Problem: "Over-Practicing" Makes You Worse

In the paper, the authors discovered a funny problem with this "over-practicing."

Imagine you are trying to walk in a straight line toward a goal.

The Signal: This is the part of your brain that says, "Okay, step forward." This is the useful information.
The Waste: This is the part of your brain that says, "Wait, maybe I should wiggle my left ear, or tilt my head, or step on my toes." This is noise. It doesn't help you walk; it just confuses you.

When PPO practices the same data too many times (let's say 40 times instead of 10), the "Signal" (the useful step) stops getting stronger. It hits a ceiling. But the "Waste" (the confusing wiggles) keeps growing and growing.

The Analogy:
Imagine you are trying to tune a radio to a clear station.

The Signal is the music.
The Waste is static noise.
If you turn the dial slightly (10 practices), you get a clear song.
If you keep turning the dial wildly in the same direction (40 practices), the music doesn't get any clearer, but the static noise gets so loud it drowns out the song. You end up with a worse result than if you had stopped earlier.

The paper calls this the "Optimization-Depth Dilemma": The deeper you dig (more practice rounds), the more useless garbage you find.

The Solution: "Optimize Wider, Not Deeper"

Instead of having one student practice the same lesson 40 times, the authors propose a new method called CAPO (Consensus Aggregation for Policy Optimization).

The Analogy: The Committee of Experts
Imagine you have a difficult math problem.

The Old Way (PPO): You give the problem to one genius student and say, "Solve this, then solve it again, then again, 40 times." Eventually, they get tired, start making silly mistakes, and their answer gets worse.
The New Way (CAPO): You give the same problem to four different students.
- Student A solves it.
- Student B solves it (but they shuffled the order of the numbers in their head).
- Student C solves it.
- Student D solves it.

Each student makes their own unique mistakes (their "Waste"). But they all agree on the main answer (the "Signal").

Now, you ask them to average their answers.

Student A's weird mistake cancels out Student B's weird mistake.
Student C's weird mistake cancels out Student D's.
The "Signal" (the correct math) stays because they all agreed on it.

The result? You get a much better answer than any single student could have gotten, even though they all looked at the exact same data.

How CAPO Works in Real Life

Gather Data: The robot takes a walk and records what happened.
Split the Team: Instead of training one brain, the computer creates 4 copies of the robot's brain.
Shuffle the Deck: Each copy looks at the same walk data, but they read it in a different random order (like shuffling a deck of cards). This makes them think slightly differently.
Train: Each copy tries to learn from the data.
Vote: The computer takes the 4 brains and blends them together into one "Consensus Brain."
- If the "Waste" (noise) was different for each copy, it cancels out.
- If the "Signal" (good learning) was the same, it gets stronger.

Why is this a Big Deal?

No Extra Walking: The robot doesn't have to walk any further to learn this. It uses the exact same amount of data.
Massive Gains: On simple tasks, it was 2x better. On a very complex task (teaching a human-sized robot to stand up), it was 8.6 times better than the old method!
Efficiency: It's like hiring a team of experts instead of overworking one person. It's faster to get a good answer by asking four people once than asking one person four times.

The Bottom Line

The paper teaches us a simple lesson for AI (and maybe for us too): Don't just drill the same thing over and over until you get confused.

Instead, try looking at the same problem from four different angles, listen to four different opinions, and then find the middle ground. By going wider (more perspectives) instead of deeper (more repetition), you get smarter, faster, and with less wasted effort.

1. Problem Statement

The paper addresses a fundamental limitation in Proximal Policy Optimization (PPO), a leading algorithm for reinforcement learning (RL). PPO approximates trust region updates by running multiple epochs ( $E$ ) of clipped Stochastic Gradient Descent (SGD) on a fixed batch of data.

The Optimization-Depth Dilemma: While increasing the number of epochs ( $E$ ) initially improves performance, it eventually leads to diminishing returns and performance collapse.
The Cause: The authors argue that as PPO runs deeper (more epochs), the update trajectory drifts away from the natural gradient direction. This drift creates "path-dependent noise."
The Consequence: Additional epochs consume the "trust region budget" (measured by KL-divergence) without providing first-order surrogate improvement. The paper formalizes this as a trade-off where signal (useful gradient information) saturates quickly, while waste (orthogonal noise) accumulates linearly or super-linearly.

2. Methodology: CAPO (Consensus Aggregation for Policy Optimization)

CAPO proposes shifting the optimization strategy from depth (running one optimizer for many epochs) to width (running multiple optimizers in parallel and aggregating them).

Core Mechanism

Data Collection: A single on-policy batch $B$ is collected from the current policy $\pi_t$ .
Parallel Optimization: Instead of running PPO for $K \times E$ $K \times E$ epochs sequentially, CAPO runs $K$ $K$ independent PPO instances (experts) on the same batch $B$ $B$ .
- Each expert starts with the same parameters and optimizer state.
- The only source of diversity is the minibatch shuffling order (different random seeds for SGD).
Consensus Aggregation: The $K$ $K$ resulting expert policies ( $\pi_1, \dots, \pi_K$ $π_{1}, \dots, π_{K}$ ) are aggregated into a single consensus policy $\pi_{agg}$ $π_{a g g}$ . The paper explores two aggregation spaces:
- Euclidean Parameter Space (CAPO-Avg): Simple averaging of the neural network weights ( $\theta_{agg} = \frac{1}{K}\sum \theta_k$ ).
- Natural Parameter Space (CAPO / LogOP): Aggregation via the Logarithmic Opinion Pool (LogOP). For exponential family distributions (like Gaussian policies), this corresponds to averaging the natural parameters ( $\eta$ $η$ ).
  - Formula: $q(a|s) \propto \prod_{k=1}^K \pi_k(a|s)^{1/K}$ .
  - For Gaussian policies, this results in a precision-weighted average, where experts with lower variance (higher confidence) on specific action dimensions contribute more strongly to the consensus.

Theoretical Foundation

The paper utilizes Fisher Information Geometry to decompose any policy update $\Delta$ into:

Signal ( $c$ ): The projection onto the natural gradient direction ( $F^{-1}g$ ).
Waste ( $\epsilon$ ): The component orthogonal to the natural gradient (Fisher-orthogonal residual).

Key Theoretical Insight (Theorem 2):

Signal Preservation: The expected signal is preserved under averaging ( $\bar{c} = \frac{1}{K}\sum c_k$ ).
Waste Cancellation: Since waste is path-dependent and random (driven by minibatch ordering), it tends to cancel out when averaged ( $\bar{\epsilon} \approx 0$ ).
Guarantees: In natural parameter space, the consensus policy provably achieves a higher KL-penalized surrogate objective and tighter trust region compliance than the mean of individual experts.

3. Key Contributions

Fisher-Geometric Decomposition: The authors formally decompose PPO updates into "signal" and "waste," demonstrating empirically that waste grows monotonically with epoch count while signal saturates, explaining the failure of deep optimization.
The CAPO Algorithm: A novel framework that redirects compute budget from sequential epochs to parallel experts, aggregating them via parameter averaging or LogOP.
Theoretical Proof: Proved that consensus aggregation in natural parameter space yields a policy with better surrogate improvement and trust region compliance than the average expert.
Empirical Validation: Demonstrated that CAPO significantly outperforms standard PPO and compute-matched baselines without requiring additional environment interactions.

4. Experimental Results

The authors evaluated CAPO on six continuous control benchmarks from Gymnasium MuJoCo (Hopper, HalfCheetah, Walker2d, Ant, Humanoid, HumanoidStandup).

Performance Gains:
- CAPO outperformed standard PPO on 5 out of 6 tasks.
- Humanoid (High-Dim): CAPO achieved 8.6× the return of standard PPO (6367 vs. 739).
- HalfCheetah: CAPO achieved a 71% improvement over PPO.
- Walker2d: CAPO achieved a 54% improvement.
Comparison with Baselines:
- PPO-K× (Deeper): Running PPO with $K \times$ more epochs on the same data caused performance to collapse (e.g., 9× degradation on Ant), confirming the "depth dilemma."
- Best-of-K: Selecting the best of $K$ experts improved performance but did not match the gains of aggregation, as it retained the full "waste" of the selected expert.
- PPO-SWA: Stochastic Weight Averaging along a single trajectory failed to improve performance, validating that parallel diversity is necessary.
Efficiency:
- CAPO requires no additional environment interactions (sample efficiency is identical to PPO).
- The overhead is purely computational ( $K \times$ gradient steps), which is "embarrassingly parallel." End-to-end wall-clock time increased by only ~25% on average for $K=4$ .
Waste Reduction:
- Parameter averaging reduced waste by 2–17%.
- LogOP (Natural Parameter Space) reduced waste by 46% on Humanoid, demonstrating the benefit of precision-weighting in high-dimensional spaces.

5. Significance and Implications

Paradigm Shift: The paper challenges the prevailing intuition that "more epochs = better training" in PPO. It suggests that for fixed sample budgets, width (parallelism) is superior to depth (sequential refinement).
Efficiency: CAPO offers a way to significantly boost RL performance without the cost of collecting more data from the environment, which is often the bottleneck in robotics and simulation.
Generalizability: The signal-waste decomposition and the principle of canceling path-dependent noise via consensus aggregation are likely applicable to other trust region methods and potentially to Large Language Model (LLM) fine-tuning, where optimizer noise compounds over long sequences.
Practical Advice: The authors recommend practitioners limit PPO epochs to $\approx 10$ and consider using parallel aggregation (CAPO) rather than increasing epoch counts to squeeze more performance from a dataset.

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

The Problem: "Over-Practicing" Makes You Worse

The Solution: "Optimize Wider, Not Deeper"

How CAPO Works in Real Life

Why is this a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: CAPO (Consensus Aggregation for Policy Optimization)

Core Mechanism

Theoretical Foundation

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank