Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

The paper proposes Consensus Aggregation for Policy Optimization (CAPO), a method that improves policy learning by aggregating multiple PPO replicates optimized with different minibatch shuffles to redirect compute from excessive training depth to model width, thereby reducing optimization noise and achieving significantly higher sample efficiency than traditional deeper approaches.

Zelal Su (Lain), Mustafaoglu, Sungyoung Lee, Eshan Balachandar, Risto Miikkulainen, Keshav Pingali

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot dog how to walk. You give it a command, it tries to move, and you tell it, "Good job!" or "Try again!" This is how Reinforcement Learning (RL) works. The robot (the "policy") is constantly adjusting its brain (its neural network) to get better at walking.

The most popular way to teach the robot is a method called PPO (Proximal Policy Optimization). Think of PPO as a very diligent student who, after seeing a single example of how to walk, tries to practice that same example over and over again in their head.

The Problem: "Over-Practicing" Makes You Worse

In the paper, the authors discovered a funny problem with this "over-practicing."

Imagine you are trying to walk in a straight line toward a goal.

  1. The Signal: This is the part of your brain that says, "Okay, step forward." This is the useful information.
  2. The Waste: This is the part of your brain that says, "Wait, maybe I should wiggle my left ear, or tilt my head, or step on my toes." This is noise. It doesn't help you walk; it just confuses you.

When PPO practices the same data too many times (let's say 40 times instead of 10), the "Signal" (the useful step) stops getting stronger. It hits a ceiling. But the "Waste" (the confusing wiggles) keeps growing and growing.

The Analogy:
Imagine you are trying to tune a radio to a clear station.

  • The Signal is the music.
  • The Waste is static noise.
  • If you turn the dial slightly (10 practices), you get a clear song.
  • If you keep turning the dial wildly in the same direction (40 practices), the music doesn't get any clearer, but the static noise gets so loud it drowns out the song. You end up with a worse result than if you had stopped earlier.

The paper calls this the "Optimization-Depth Dilemma": The deeper you dig (more practice rounds), the more useless garbage you find.

The Solution: "Optimize Wider, Not Deeper"

Instead of having one student practice the same lesson 40 times, the authors propose a new method called CAPO (Consensus Aggregation for Policy Optimization).

The Analogy: The Committee of Experts
Imagine you have a difficult math problem.

  • The Old Way (PPO): You give the problem to one genius student and say, "Solve this, then solve it again, then again, 40 times." Eventually, they get tired, start making silly mistakes, and their answer gets worse.
  • The New Way (CAPO): You give the same problem to four different students.
    • Student A solves it.
    • Student B solves it (but they shuffled the order of the numbers in their head).
    • Student C solves it.
    • Student D solves it.

Each student makes their own unique mistakes (their "Waste"). But they all agree on the main answer (the "Signal").

Now, you ask them to average their answers.

  • Student A's weird mistake cancels out Student B's weird mistake.
  • Student C's weird mistake cancels out Student D's.
  • The "Signal" (the correct math) stays because they all agreed on it.

The result? You get a much better answer than any single student could have gotten, even though they all looked at the exact same data.

How CAPO Works in Real Life

  1. Gather Data: The robot takes a walk and records what happened.
  2. Split the Team: Instead of training one brain, the computer creates 4 copies of the robot's brain.
  3. Shuffle the Deck: Each copy looks at the same walk data, but they read it in a different random order (like shuffling a deck of cards). This makes them think slightly differently.
  4. Train: Each copy tries to learn from the data.
  5. Vote: The computer takes the 4 brains and blends them together into one "Consensus Brain."
    • If the "Waste" (noise) was different for each copy, it cancels out.
    • If the "Signal" (good learning) was the same, it gets stronger.

Why is this a Big Deal?

  • No Extra Walking: The robot doesn't have to walk any further to learn this. It uses the exact same amount of data.
  • Massive Gains: On simple tasks, it was 2x better. On a very complex task (teaching a human-sized robot to stand up), it was 8.6 times better than the old method!
  • Efficiency: It's like hiring a team of experts instead of overworking one person. It's faster to get a good answer by asking four people once than asking one person four times.

The Bottom Line

The paper teaches us a simple lesson for AI (and maybe for us too): Don't just drill the same thing over and over until you get confused.

Instead, try looking at the same problem from four different angles, listen to four different opinions, and then find the middle ground. By going wider (more perspectives) instead of deeper (more repetition), you get smarter, faster, and with less wasted effort.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →