On weight and variance uncertainty in neural networks for regression tasks

Imagine you are trying to teach a robot to predict the weather based on past data. You want the robot not just to say, "It will rain tomorrow," but to also tell you how sure it is about that prediction.

This paper is about teaching that robot to be smarter about its own confidence. Specifically, it tackles a problem where the robot was previously too confident in its guesses, even when the data was messy or confusing.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The Overconfident Robot

In the world of Artificial Intelligence (specifically "Bayesian Neural Networks"), there are two types of uncertainty:

Weight Uncertainty: The robot isn't sure about the rules it learned (e.g., "Does a dark cloud mean rain?").
Variance Uncertainty: The robot isn't sure about the "noise" or randomness in the data itself (e.g., "Sometimes it rains even when the sky is clear").

The Old Way (The "Fixed Variance" Model):
Previous methods (like the one by Blundell et al.) taught the robot to be unsure about its rules, but they forced it to assume the "noise" in the weather data was a fixed, known number.

The Analogy: Imagine a dart player who knows they are shaky with their hands (weight uncertainty), but they are told to assume the wind is always exactly 5 mph (fixed variance). If the wind actually blows at 20 mph, the player will still aim as if it's 5 mph, and they will miss the target. They will think they are aiming perfectly, but they are actually wrong.

2. The Solution: Letting the Robot Learn the Noise

The authors of this paper said, "Why don't we let the robot learn how noisy the data is, just like it learns the rules?"

They introduced Variance Uncertainty. Now, the robot doesn't just guess the weather; it also guesses, "How chaotic is the weather today?"

The Analogy: Now, our dart player has a weather vane. If the wind is calm, they aim tightly. If the wind is howling, they widen their aim and say, "I'm not sure exactly where the dart will land, but I know it's going to be all over the place."

3. How They Did It (The "Magic Trick")

To make this work, they used a mathematical technique called Variational Bayes.

The Analogy: Imagine trying to find the best route through a foggy maze. You can't see the whole maze (the math is too hard to solve exactly). So, you send out 100 little scouts (samples) to explore different paths. You ask them, "How foggy is it here?" and "How far off course are we?"
The authors added a new scout specifically to measure the "fog" (the variance). They taught the robot to adjust its "fog meter" while it was learning the map. This is done using a "re-parameterization trick," which is just a fancy way of saying they found a clever shortcut to calculate the answers without getting stuck in the math.

4. The Results: Why It Matters

The team tested this new "smart robot" on two challenges:

Challenge A: A Wiggly Line (Simulation)
They asked the robot to draw a complex, wiggly line based on scattered dots.

Old Robot: Drew a line that was close, but its "safety zone" (prediction interval) was too thin. It missed many of the actual dots because it didn't account for the messiness.
New Robot: Drew a line that was just as accurate, but its "safety zone" was wider and more realistic. It caught almost all the dots because it admitted, "Hey, this data is a bit messy, so I'll give a wider range of possibilities."

Challenge B: The Riboflavin Dataset (Real Life)
This was a real-world dataset about gene expressions (very complex, with thousands of features but very few samples). It's like trying to predict a person's health based on 4,000 different blood tests, but you only have data on 71 people.

The Result: The old robot was dangerously overconfident. It gave narrow predictions that were often wrong (only 72% of the time it was right, when it should be 95%).
The New Robot: It realized, "Wow, this is a tiny dataset with huge complexity. I can't be sure!" So, it widened its prediction intervals. It became 100% reliable in its coverage. It didn't just predict the answer; it predicted the uncertainty correctly.

The Big Takeaway

Think of the old model as a confident driver who drives fast but ignores the rain. They might get there fast, but they crash often when conditions change.

The new model is a cautious, experienced driver. They know the road is slippery (variance uncertainty). They drive at a safe speed and keep a huge distance from the car in front. They might not always be the fastest, but they never crash when the unexpected happens.

In short: By teaching neural networks to be unsure about the "noise" in the data, not just the rules, the authors created a system that is much safer, more reliable, and better at handling real-world messiness.

1. Problem Statement

The paper addresses a critical limitation in standard Bayesian Neural Networks (BNNs) applied to regression tasks: the treatment of the likelihood variance ( $\sigma^2$ ) as a fixed, deterministic parameter rather than a random variable.

Current Limitation: In existing frameworks like Bayes by Backprop (Blundell et al., 2015), while weight uncertainty is modeled via a posterior distribution, the observation noise variance is typically fixed (often determined via cross-validation) or treated as a point estimate.
The Consequence: Treating variance as fixed leads to overconfident predictions, particularly in data-scarce or high-dimensional scenarios. The model fails to capture epistemic uncertainty regarding the global observation noise, resulting in prediction intervals that are too narrow and fail to cover true values (low coverage probability).
Distinction: The authors distinguish their approach from heteroscedastic regression (where the network predicts input-dependent variance $\sigma(x)$ ). Instead, this work focuses on modeling the uncertainty of the global observation variance itself.

2. Methodology

The authors propose an extension to the Variational Bayes (VB) framework, specifically adapting the Bayes by Backprop algorithm to include variance uncertainty.

A. Model Formulation

Likelihood: For a regression task $y = \phi(x; W) + \epsilon$ , the noise $\epsilon$ is assumed to follow $N(0, g(S)I)$ , where $S$ is a latent parameter and $g(S)$ is a strictly positive transformation (specifically the softplus function: $g(S) = \log(1 + e^S)$ ) to ensure valid variance.
Parameters: The model jointly infers the posterior over:
1. Network weights and biases ( $W$ ).
2. The variance parameter ( $S$ ).
Variational Posterior: A mean-field approximation is used where the joint posterior factorizes: $q(W, S) = q(W)q(S)$ $q (W, S) = q (W) q (S)$ .
- $W$ is approximated by a diagonal Gaussian: $N(\mu_w, \text{diag}(\sigma_w^2))$ .
- $S$ is approximated by a Gaussian: $N(\mu_L, \sigma_L^2)$ .
- Standard deviations are parameterized via softplus to ensure positivity: $\sigma = \log(1 + e^\rho)$ .

B. Optimization via Variational Inference

The authors minimize the negative Evidence Lower Bound (ELBO) using Stochastic Gradient Descent (SGD) and the reparameterization trick.

Objective Function: The cost function $f(W, S, \eta)$ combines the log-likelihood, the log-prior, and the log-variational posterior.
$F(\eta) = \mathbb{E}_{q(W,S|\eta)} [ \log q(W) + \log q(S) - \log p(W) - \log p(S) - \log L(W, S | x, y) ]$
Gradient Estimation: Gradients are estimated via Monte Carlo sampling. The reparameterization trick allows gradients to flow through the stochastic nodes:
- $W = \mu_w + \epsilon_w \odot \sigma_w$
- $S = \mu_L + \epsilon_L \sigma_L$
Algorithm: The proposed method, termed VBNET-SVAR, iteratively updates variational parameters $(\mu_w, \rho_w, \mu_L, \rho_L)$ to maximize the ELBO.

C. Priors

The study evaluates two prior configurations for weights:

Gaussian Prior: Standard normal distribution (used in dense networks).
Spike-and-Slab Prior: A mixture of two Gaussians (one with near-zero variance, one with larger variance) to induce sparsity, effectively modeling the Dropout mechanism within a Bayesian framework.

3. Key Contributions

Extension of Bayes by Backprop: The framework is generalized to jointly optimize weights and the likelihood variance, moving beyond the fixed-variance assumption of Blundell et al. (2015).
Marginalization over Variance: By integrating over the posterior distribution of the variance rather than using a point estimate, the predictive distribution acquires heavy-tailed behavior. This makes the model more robust to outliers and better calibrated for uncertainty.
Computational Efficiency: The addition of variance uncertainty introduces only two scalar variational parameters ( $\mu_L, \rho_L$ ), making the computational complexity comparable to the fixed-variance baseline and independent of the network size.
Comprehensive Evaluation: The method is tested on both fully connected dense networks (Gaussian priors) and dropout-based sparse networks (Spike-and-Slab priors).

4. Experimental Results

The authors evaluated the proposed VBNET-SVAR against VBNET-FIXED (fixed variance), standard Neural Networks (NNET), and Generalized Additive Models (GAM) on two datasets.

A. Nonlinear Function Approximation

Setup: A synthetic regression problem with a complex sinusoidal curve and added noise.
Findings: VBNET-SVAR achieved lower Mean Squared Prediction Error (MSPE) and significantly higher Coverage Probabilities (the percentage of true test points falling within the 95% prediction interval) compared to VBNET-FIXED. The fixed-variance model produced intervals that were too narrow to capture the true data variability.

B. Riboflavin Genetic Dataset (High-Dimensional, Low-Sample)

This dataset contains $n=71$ samples and $p=4088$ features (gene expressions).

Scenario 1: PCA-BNN (Dimensionality Reduction)
- After reducing to 10 principal components, VBNET-SVAR achieved an MSPE of 0.7891 vs. 1.4006 for VBNET-FIXED.
- Coverage: VBNET-SVAR reached 98% coverage (vs. 80% for fixed), with wider, more realistic prediction intervals.
Scenario 2: Dropout-BNN (Full Dimensionality)
- Using the full feature space with Spike-and-Slab priors, VBNET-SVAR achieved the lowest MSPE (0.3077) compared to all baselines.
- Coverage: VBNET-SVAR achieved 100% coverage, whereas VBNET-FIXED only reached 72%, indicating severe overconfidence in the fixed-variance model.

5. Significance and Conclusion

The paper demonstrates that explicitly modeling variance uncertainty is crucial for reliable Bayesian regression, especially in:

Limited Data Regimes: Where the true noise level is unknown.
High-Dimensional Settings: Where dimensionality reduction or sparsity introduces additional uncertainty.

Key Takeaways:

Robustness: Marginalizing over the variance posterior prevents the model from being overconfident, leading to safer prediction intervals.
Generalization: The method improves generalization performance (lower MSPE) compared to treating variance as a fixed hyperparameter.
Practicality: The approach is computationally efficient and easily integrated into existing BNN frameworks without requiring conjugate priors or complex MCMC sampling.

The authors conclude that for regression tasks where the noise level is not known a priori, treating variance as a learnable random variable is a superior strategy to fixed-variance assumptions.