Bayesian Lottery Ticket Hypothesis

The Big Picture: Finding the "Golden Ticket" in a Noisy World

Imagine you are trying to teach a robot to recognize cats in photos. You have two ways to do this:

The Standard Way (Deterministic): You give the robot a fixed set of rules. It learns, makes a guess, and says, "That's a cat!" It's fast and efficient, but it doesn't know how sure it is. If it's wrong, it might still sound very confident.
The Bayesian Way: You give the robot a set of rules that are more like "educated guesses." Instead of saying "This is a cat," it says, "I'm 90% sure this is a cat, but there's a 10% chance it's a dog." This is great for safety-critical tasks (like self-driving cars), but it's much slower and heavier because the robot has to carry around all these extra "what-if" scenarios.

The Problem: Bayesian robots are too heavy to run on normal computers. They need supercomputers.

The Goal: The researchers wanted to see if we could find a "Lite Version" of these Bayesian robots—a tiny, sparse version that is just as smart and safe, but runs fast. They wanted to see if the Lottery Ticket Hypothesis works for these fancy robots.

What is the "Lottery Ticket Hypothesis"?

Think of a massive, dense neural network like a giant, crowded orchestra with 10,000 musicians.

The Hypothesis: The researchers believe that inside this huge orchestra, there is a tiny, secret group of just 50 musicians (a "Lottery Ticket") who, if they started playing from the very beginning with the exact same sheet music (initialization), could play the symphony just as beautifully as the full 10,000-person orchestra.
The Catch: To find this group, you usually have to train the whole orchestra, fire the musicians who aren't playing well, reset the remaining ones to their original starting notes, and try again. It's a lot of work.

What Did This Paper Do?

The team asked: "Does this 'secret group' exist in the heavy, Bayesian robots too?"

They took three types of AI models (ResNet, VGG, and Vision Transformers) and turned them into Bayesian versions. Then, they tried to find these "Lottery Tickets" using a process called Iterative Magnitude Pruning (IMP).

The Process (The "Cut and Reset" Game):

Train: Let the Bayesian robot learn.
Prune: Cut out the "weakest" connections. In a Bayesian robot, a connection isn't just a number; it's a number plus a measure of uncertainty (how shaky the robot is about that number).
Reset: Take the remaining connections and reset them to their very first random values.
Repeat: Do this over and over until the robot is tiny (very sparse).

The Key Findings (The "Aha!" Moments)

1. The Lottery Ticket Exists!

Just like in standard robots, they found that even in these heavy Bayesian robots, there are tiny, sparse sub-networks that can learn just as well as the big, heavy ones. You don't need the whole brain to do the job; you just need the right "neurons."

2. How to Cut the Cake (Pruning Strategy)

When deciding which connections to cut, the researchers tested different rules:

Rule A: Cut the ones with the biggest numbers (Magnitude).
Rule B: Cut the ones that are "noisy" or uncertain (High Standard Deviation).
Rule C: Cut the ones that are both small and noisy.

The Verdict: The best strategy was surprisingly simple. Just look at the size of the numbers (Magnitude). You don't need to overcomplicate it by looking at the "uncertainty" too much. If a number is tiny, cut it. If it's big, keep it.

3. The "Transplant" Trick (The Best Part)

This is the most exciting discovery. Finding a Bayesian Lottery Ticket is expensive because you have to train the heavy robot many times.

The Idea: What if we find the "Golden Ticket" in a standard (lightweight) robot first? Then, we take that specific pattern of connections (the mask) and transplant it into the heavy Bayesian robot?
The Result: It works! The transplanted Bayesian robot performs almost as well as if it had been trained from scratch, but it saves 50% of the computing time.
Analogy: Imagine you want to build a high-tech, solar-powered house (Bayesian). Building it from scratch takes forever. Instead, you find a perfect blueprint for a regular house (Standard), copy the layout, and then just upgrade the materials to solar. You get the high-tech house much faster.

4. Architecture Matters

Convolutional Models (ResNet/VGG): These are like traditional brick-and-mortar buildings. They are stable. Even if you shuffle the rooms around a bit, the house still stands.
Transformers (ViT): These are like complex, modern glass structures. They are very sensitive. If you move the wrong beam (weight), the whole thing collapses. For these models, you must keep the exact original starting weights to get a winning ticket.

Why Does This Matter?

Saves Money and Energy: Bayesian AI is usually too expensive for regular use. This paper shows we can make them tiny and fast without losing their "safety" features (uncertainty quantification).
Better Safety: We can now run these "safe" AI models on regular laptops or phones, not just supercomputers.
Smarter Training: We don't need to train the heavy models from scratch every time. We can "transplant" good patterns from simpler models to jumpstart the heavy ones.

Summary in One Sentence

The researchers proved that even the heavy, complex "uncertainty-aware" AI robots have hidden, tiny "super-teams" inside them, and we can find these teams by copying patterns from simpler robots, saving us a massive amount of computing power.

1. Problem Statement

Bayesian Neural Networks (BNNs) are essential for uncertainty quantification (UQ) in safety-critical applications, as they model weights as probability distributions rather than fixed values. However, BNNs suffer from a massive computational overhead compared to deterministic networks:

Parameter Overhead: Weights are distributions (e.g., mean $\mu$ and variance $\sigma$ ), doubling the parameter count.
Computational Cost: Training requires sampling multiple times per forward/backward pass, significantly increasing FLOPs and memory usage.
Training Difficulty: Large-scale BNNs are often infeasible to train on consumer hardware.

While sparsity (pruning) has successfully reduced costs in deterministic networks via the Lottery Ticket Hypothesis (LTH)—which posits that sparse subnetworks ("winning tickets") exist that can match dense network performance—it remains unclear if this hypothesis holds for BNNs. Specifically, it is unknown if:

Sparse subnetworks exist within BNNs that maintain high accuracy and calibration.
The criteria for identifying these tickets differ from deterministic networks (e.g., should pruning rely on the mean $\mu$ , the standard deviation $\sigma$ , or a combination?).
The computational cost of finding these tickets can be reduced by "transplanting" tickets from deterministic models.

2. Methodology

The authors translated the standard Iterative Magnitude Pruning (IMP) framework of the LTH into a Bayesian setting using Mean-Field Variational Inference (VI).

Models & Dataset: Experiments were conducted on CIFAR-10 using three architectures: ResNet-18, VGG-11, and Vision Transformer (ViT-tiny). Both deterministic and Bayesian variants were trained.
Pruning Strategies for BNNs: Unlike deterministic pruning which uses weight magnitude ( $|w|$ $∣ w ∣$ ), the authors tested three scoring functions for BNN weights $(\mu, \sigma)$ $(μ, σ)$ :
1. Signal-to-Noise Ratio (SNR): $s_{SNR} = |\mu| / \sigma$ . Prunes weights that are small and "noisy" (high variance).
2. Squared-Sum (Square): $s_{SS} = \sqrt{\mu^2 + \sigma^2}$ . Prunes weights with low magnitude and low variance (where the network is "sure" the weight is near zero).
3. Mean-Magnitude ( $\mu$ ): $s_{\mu} = |\mu|$ . Ignores variance, relying solely on the mean.
The IMP Cycle: Models undergo a Train $\to$ Prune $\to$ Reset cycle. After pruning, weights are reset to their original initialization ( $w_0$ ) and retrained. This is repeated for 20 levels of sparsity (down to ~1.2% non-zero parameters).
Reinitialization & Shuffling Experiments: To understand what defines a "winning ticket," the authors tested:
- Weight Reinitialization: Keeping the mask fixed but re-drawing weights from the prior.
- Mask Shuffling: Keeping weights fixed but permuting the pruning mask (Global, Even, and Layer-wise shuffling).
Transplantation Strategy: A method to bypass the expensive Bayesian training loop. The authors take a winning ticket (mask + initialization) found in a deterministic model and transplant it into a Bayesian model. The Bayesian model is then trained using VI, but starting from the pre-identified sparse structure.

3. Key Contributions

Validation of LTH in BNNs: The study confirms that the Lottery Ticket Hypothesis holds for Bayesian Neural Networks. Sparse subnetworks exist that can match or surpass the accuracy of their dense Bayesian counterparts.
Pruning Criteria Analysis: The authors identify that Mean-Magnitude ( $|\mu|$ ) is the most effective pruning criterion for BNNs, outperforming or matching strategies that incorporate variance ( $\sigma$ ). This suggests the posterior mean dominates the pruning decision.
Architectural Insights:
- Convolutional Models (ResNet/VGG): Winning tickets rely heavily on the specific layer-wise sparsity ratio and the initial weight values.
- Transformers (ViT): These models are highly sensitive to initialization. The specific combination of the mask structure and the initial weights is critical; shuffling either component destroys performance.
Transplantation Efficiency: The paper demonstrates that "transplanting" a deterministic lottery ticket into a BNN yields performance comparable to a fully trained Bayesian ticket but reduces computational training time by ~50%.

4. Key Results

Performance vs. Sparsity:
- Up to 90% sparsity, Bayesian winning tickets achieve accuracy comparable to dense models.
- At very high sparsities (>98%), performance degrades, particularly for the VGG and ResNet models using the "Square" pruning metric.
- ViT-tiny showed a unique trend where Bayesian variants outperformed deterministic ones at moderate sparsity (up to 50%) before declining.
Pruning Metric Comparison:
- The Mean-Magnitude ( $|\mu|$ ) and SNR strategies performed similarly and well.
- The Squared-Sum strategy performed poorly, especially in ViT models, indicating that pruning based on low variance is not a reliable heuristic.
Structure vs. Initialization:
- For ResNet and VGG, Layer-wise Shuffling (preserving the sparsity ratio per layer but randomizing the mask) combined with weight reinitialization yielded results close to the winning ticket. This implies the distribution of sparsity across layers is a key factor.
- For ViT, neither shuffling nor reinitialization alone could recover performance; the specific pair of the original mask and initialization is required.
Uncertainty Calibration:
- The Mean Absolute Calibration Error (MACE) generally remained stable or improved at high sparsities for ResNet and ViT, indicating that pruning does not necessarily degrade the model's ability to quantify uncertainty.
Transplantation Efficiency:
- Transplanting deterministic tickets into BNNs achieved accuracy comparable to full Bayesian IMP.
- Runtime Savings: Since VI training is 3–7x more expensive than deterministic training, and the IMP process requires 20 training cycles, the transplantation method saves approximately 50% of total compute time while maintaining calibration benefits.

5. Significance and Implications

Feasibility of Large-Scale BNNs: By proving that sparse subnetworks exist in BNNs, the paper suggests that large-scale Bayesian models can be made computationally tractable for real-world deployment without sacrificing uncertainty quantification.
Efficient Training Protocols: The Transplantation method offers a practical workflow: train a deterministic network to find the "winning ticket," then transplant it to a Bayesian model for final VI optimization. This drastically lowers the barrier to entry for training robust, uncertainty-aware models on limited hardware.
Understanding BNN Dynamics: The findings highlight that while BNNs share the LTH phenomenon with deterministic networks, the underlying mechanics (e.g., the dominance of the mean over variance in pruning, and the sensitivity of Transformers to initialization) differ, requiring tailored pruning strategies.

In conclusion, this work bridges the gap between the efficiency of sparse networks and the robustness of Bayesian inference, providing both theoretical validation and practical algorithms for training efficient BNNs.