Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

This paper argues that large pretrained models contain a dense distribution of task-specific experts near their initial weights, enabling a simple, parallel post-training method that samples random perturbations and ensembles the best performers to achieve competitive results with standard optimization techniques like PPO and GRPO.

Yulu Gan, Phillip Isola

Published 2026-03-13
📖 5 min read🧠 Deep dive

The Big Idea: From "Needle in a Haystack" to a "Thicket of Experts"

Imagine you have a giant library of books (the AI model).

  • Small Models are like a tiny, messy shed. If you want to find a book on "How to bake a cake," you have to search through every single shelf, page by page, using a very smart map (gradient descent). The right book is hidden in a needle in a haystack. You need a smart search algorithm to find it.
  • Large Models are like a massive, sprawling forest. The paper argues that once a model is big enough and well-trained, the "right answers" aren't hidden anymore. Instead, they are everywhere, like a thicket of bushes. If you just walk randomly into the forest, you are almost guaranteed to bump into a bush that has the answer you need.

The authors call this phenomenon "Neural Thickets."


The Problem: Why "Random Guessing" Usually Fails

For decades, scientists believed that if you wanted to teach an AI a new skill (like math or coding), you had to use a slow, step-by-step learning process called Gradient Descent. This is like a hiker carefully climbing a mountain, checking every step to make sure they are going uphill.

The old thinking was: "Randomly changing the AI's brain (weights) is useless. The chance of guessing a smart brain by accident is zero."

The Discovery: The "Thicket" Regime

The researchers found that for large, pre-trained models, the landscape has changed.

  • The Old View: The model is sitting on a flat plateau. To get better, you have to climb a specific, narrow path.
  • The New View: The model is sitting in a valley surrounded by a dense forest of "experts."
    • Some bushes are experts at math.
    • Some are experts at writing stories.
    • Some are experts at chemistry.
    • Crucially: These experts are different. One bush might be great at math but terrible at chemistry. Another might be the opposite.

Because these "expert bushes" are so dense, you don't need a smart map. You can just throw darts at the wall (randomly tweak the model's brain), and you will likely hit a bush that is an expert at something.

The Solution: "RandOpt" (Random Optimization)

Based on this discovery, the authors created a new, super-simple method called RandOpt. Here is how it works, using a Talent Show analogy:

  1. The Casting Call (Random Guessing): Instead of training one actor for months, the director hires 5,000 random people and gives them a tiny, random tweak to their personality.
  2. The Audition (Evaluation): They all try to solve a math problem.
  3. The Selection (Top K): The director picks the top 50 people who got the answer right.
  4. The Ensemble (The Group Vote): Instead of picking just one "winner," the director puts those 50 people in a room and asks them to vote on the final answer.

Why this is amazing:

  • Speed: Traditional training (like PPO or GRPO) is like a relay race where runners pass a baton one by one. It takes a long time. RandOpt is like a sprint where 5,000 people run at the exact same time. It finishes in O(1) time (constant time), regardless of how complex the task is.
  • Efficiency: It uses less computing power (FLOPs) than traditional methods to get the same or better results.
  • Diversity: Because the "thicket" is full of different specialists, the group vote combines the best parts of many different "brains."

The Catch: "Sandbagging" vs. Real Skills

You might ask: "Did the AI just get lucky? Maybe it was pretending to be bad before (sandbagging) and now it's showing its true skills?"

The authors tested this. They found that while some of the improvement comes from fixing formatting (like putting the answer in the right box), a huge chunk comes from actual reasoning. The random tweaks helped the model solve problems it couldn't solve before. It wasn't just a formatting fix; the model actually learned to think differently.

The "Distillation" Trick

One downside of RandOpt is that at the end, you have to run 50 different models to get the final answer (the "Ensemble"). That's slow for a user.

  • The Fix: The authors showed you can take those 50 "expert" models and teach a single, smaller model to mimic them. This is called Distillation. It's like taking the notes from 50 experts and writing one perfect textbook. Now you have the speed of a single model with the smarts of 50.

Summary: What Does This Mean for the Future?

  1. Pre-training is King: If you train a model well enough on a lot of data, it naturally develops a "thicket" of solutions inside its brain.
  2. Post-training is Easy: Once you have a good base model, you don't need complex, slow algorithms to teach it new things. You can just sample randomly and pick the best ones.
  3. Parallelism is the Future: Instead of one brain thinking hard, it's better to have 5,000 brains thinking in parallel and voting on the answer.

In a nutshell: Large AI models are so rich in knowledge that they are surrounded by a forest of experts. You don't need to be a genius to find them; you just need to walk randomly into the forest, pick the best 50, and let them vote.