Neural Prior Estimation: Learning Class Priors from Latent Representations

The Big Problem: The "Popular Kid" Syndrome

Imagine a school where 90% of the students are in the "Popular Club," and only a few students are in the "Art Club" or the "Gardening Club."

If you ask a new teacher (an AI) to learn about these clubs just by looking at the students in the hallway, the teacher will naturally assume that everyone is in the Popular Club. Why? Because they see 90% Popular kids and only 1% Art kids.

When the teacher is tested later, they will guess "Popular Club" for almost everyone. They will fail miserably at identifying the Art or Gardening students because they were biased by the crowd they saw during training.

In the world of AI, this is called Class Imbalance. The AI gets "stuck" thinking the common things (like "cat" or "car") are the only things that matter, and it ignores the rare things (like "rare disease" or "endangered animal").

The Old Solution: The "Static Rulebook"

For a long time, scientists tried to fix this by giving the AI a Rulebook.

The Rule: "Hey AI, you saw 1,000 cats and only 10 tigers. So, when you see a tiger, you must boost your confidence score by 10 points."

This works okay, but it has a big flaw: The Rulebook is static.

It relies on counting the students in the hallway before the class starts.
If the class changes (e.g., a new group of tigers arrives, or the hallway layout changes), the old Rulebook becomes useless.
Sometimes, the AI learns to "see" things differently during training (like seeing a tiger as a "striped cat"), making the original counts inaccurate.

The New Solution: The "Neural Prior Estimator" (NPE)

This paper introduces a new, smarter way to fix the bias. Instead of using a static Rulebook, they give the AI a Little Assistant (called the Prior Estimation Module or PEM).

Here is how the NPE works, step-by-step:

1. The Little Assistant (PEM)

Imagine the main AI teacher is trying to solve a puzzle. The Little Assistant is a tiny, separate brain attached to the teacher.

Its Job: It doesn't try to guess the answer (Cat vs. Tiger). Instead, it just watches the teacher's thought process (the "latent representations").
How it learns: It uses a special trick called "One-Way Logistic Loss." Think of this as the Assistant only getting a "high five" when it correctly identifies how common a specific thought is.
- If the teacher thinks about "Cats" 1,000 times, the Assistant learns: "Oh, Cats are very common here."
- If the teacher thinks about "Tigers" only 10 times, the Assistant learns: "Oh, Tigers are very rare here."

2. The Magic of "Neural Collapse"

The paper proves mathematically that if you let this Assistant train alongside the teacher, it naturally figures out the exact ratio of how often each class appears, even without counting them manually. It learns the "vibe" or the "density" of the data directly from the AI's own brain.

3. The Correction (NPE-LA)

Once the Little Assistant knows the true ratios, it whispers a correction to the main teacher right before the final answer is given.

The Whisper: "Wait! You are about to guess 'Cat' again. But remember, the Assistant says 'Cat' is super common, so you need to lower your confidence slightly. And 'Tiger' is rare, so boost your confidence!"

This is called Logit Adjustment. It dynamically shifts the AI's confidence based on what the Little Assistant learned during the training, not what was written in a pre-made book.

Why is this better? (The Creative Metaphors)

The Dynamic GPS vs. The Static Map:
- Old Way: Like using a paper map from 1990. It tells you where the roads used to be. If a road is closed or a new one opened, you get lost.
- NPE Way: Like a live GPS that updates in real-time. It sees the traffic (the data distribution) as it happens and reroutes the AI's decisions instantly.
The "Crowded Room" Analogy:
- Imagine you are in a room where 99 people are shouting "Apple!" and 1 person is whispering "Pear."
- Old AI: "I hear 'Apple' 99 times, so I bet the answer is Apple."
- NPE AI: The Little Assistant listens to the volume of the whispers. It realizes, "Hey, the 'Pear' whisper is being drowned out by the noise. I need to turn up the volume on 'Pear' so we don't miss it."

Does it work? (The Results)

The authors tested this on two types of tasks:

Image Classification (CIFAR): Identifying objects in photos.
- Result: The AI got much better at spotting the "rare" objects (the tail classes) without forgetting the "common" ones. It balanced the score perfectly.
Semantic Segmentation (STARE/ADE20K): Identifying pixels in complex images (like finding blood vessels in eyes or specific objects in a city scene).
- Result: Even though the AI's "eyes" (the main brain) were frozen and couldn't learn new things, the Little Assistant successfully corrected the AI's guesses, helping it find rare details like small blood vessels that it usually ignored.

The Bottom Line

The Neural Prior Estimator is a lightweight, smart add-on that teaches an AI to self-correct its own bias.

Instead of relying on a human to count the data and write a rulebook, the AI builds a tiny internal monitor that learns the true distribution of the world as it trains. This makes the AI fairer, more accurate on rare items, and adaptable to changing environments—all without needing to change the main AI's architecture or slow it down.

In short: It's the difference between a student who memorizes a static list of facts and a student who learns to feel the rhythm of the data and adjust their answers on the fly.

1. Problem Statement

Deep neural networks trained on imbalanced datasets (where a few "head" classes dominate and many "tail" classes are underrepresented) suffer from systematic bias. Standard classifiers tend to overfit head classes, leading to degraded performance on rare categories.

While Logit Adjustment (LA) is a proven method to mitigate this by shifting logits based on the logarithm of empirical class frequencies ( $\log p(y)$ ), it relies on a restrictive assumption: access to accurate, static class priors derived from dataset-level counts. This approach fails in scenarios where:

Class distributions evolve over time (non-stationary).
Distributions are only partially observed.
The "effective" prior induced by the learned feature space differs from the raw data counts due to representation learning dynamics.
Explicit counting is impractical in streaming or online settings.

Existing alternatives that learn calibration functions often require balanced meta-validation sets and do not provide an explicit estimate of the prior itself.

2. Methodology: Neural Prior Estimator (NPE)

The authors propose the Neural Prior Estimator (NPE), a framework that learns explicit class log-prior estimates directly from latent feature representations without requiring external class counts or meta-validation data.

Core Components

Prior Estimation Modules (PEMs): Lightweight, differentiable modules (typically linear mappings) attached to the backbone network. They map the intermediate feature vector $h(x)$ to a class-wise output vector $u_k(x)$ .
One-Way Logistic Loss: PEMs are trained jointly with the main classifier using a specialized loss function that only updates the coordinate corresponding to the true class:
$L_{NPE} = \sum_{k} \mathbb{E}_{(x,y)} [-\log \sigma((-1)^t u_k(x)_y)]$
This loss enforces a fixed update direction. Over training, the gradient accumulation on the true-class coordinate becomes proportional to the frequency of that class, causing the PEM output to naturally encode the empirical class distribution.
The NPE Estimate: The final estimate $\eta(x)$ is the normalized average of the PEM outputs. Theoretically, under the Neural Collapse regime, $\eta(x)$ converges to a monotone transformation of $\log N_c$ (log-counts), which is equivalent to estimating the log-prior $\log p_c$ up to an additive constant.

NPE-LA (Imbalance-Aware Prediction)

The estimated log-prior is integrated into the prediction pipeline via NPE-LA:
$\tilde{z}(x) = z(x) - \eta(x)$
Unlike classical LA, which applies a static global offset, NPE-LA applies a feature-conditioned correction. The adjustment dynamically responds to the local behavior of the representation $h(x)$ , making it suitable for non-stationary or online environments.

Key Design Choices:

No Batch Normalization (BN) in PEM: BN is omitted in the PEM to preserve the scale information required to encode class frequencies.
Scaling for Dense Prediction: In semantic segmentation, a scaling factor $\alpha < 1$ is applied to the NPE correction to prevent the over-amplification of rare classes caused by the BN layers in the main segmentation head.

3. Key Contributions

Autonomous Prior Recovery: NPE is the first framework to recover explicit class priors directly from latent representations without relying on dataset statistics or external calibration sets.
Theoretical Grounding: The paper provides an analytical proof that under Neural Collapse, the NPE estimate recovers the class log-prior (up to an additive constant).
Adaptive Mechanism: NPE-LA offers a dynamic, feature-dependent correction mechanism that adapts to evolving feature distributions, unlike static logit adjustment.
Lightweight & Compatible: The approach adds negligible inference cost (PEMs can be merged into the classifier weights) and is compatible with standard architectures, augmentation strategies, and representation learning objectives.

4. Experimental Results

The authors evaluated NPE-LA on long-tailed image classification and imbalanced semantic segmentation.

Image Classification (CIFAR-10/100)

Setup: ResNet-32 backbone, varying imbalance ratios ( $\rho = 50, 100, 200$ ).
Findings:
- Training Effect: Even without logit adjustment at inference, training with PEMs improves the backbone's ability to generalize to tail classes (Figure 1).
- Inference Performance: NPE-LA consistently outperforms standard Cross-Entropy (CE), Classifier Re-training (cRT), and classical Logit Adjustment (LA).
- Robustness: Under severe imbalance ( $\rho=200$ ) and specific hyperparameter settings (HP-1), NPE-LA with 16 PEMs achieved the highest accuracy, significantly outperforming baselines. It showed particular strength in balancing head and tail performance compared to cRT (which hurts head accuracy) and LA.

Semantic Segmentation (STARE & ADE20K)

Setup: Frozen backbones (UNet, DeepLab-V3, Swin-T) to isolate the effect of logit recalibration.
Findings:
- STARE (Retinal Vessels): NPE-LA significantly improved foreground (vessel) accuracy and mean Dice scores when a scaling factor was applied, correcting the bias against rare pixels.
- ADE20K: NPE-LA successfully generalized to complex architectures (Swin-T + UPerNet). While unscaled adjustments caused instability, applying a scaling factor ( $\alpha=0.1$ ) restored stability and improved mean accuracy (mAcc) for rare classes without degrading mean IoU.
- Conclusion: NPE-LA effectively captures pixel-level prior information and can be integrated into modern dense prediction pipelines.

5. Significance and Future Directions

Theoretical & Practical Bridge: NPE bridges the gap between theoretical logit adjustment and practical deployment by removing the dependency on static, known class counts.
Versatility: The method is applicable to both instance-level (classification) and pixel-level (segmentation) tasks.
Future Work: The authors suggest extending NPE to label-shift adaptation, adaptive reweighting based on feature-space density, and integration with multi-expert systems where prior estimates could guide expert selection.

In summary, NPE offers a theoretically justified, lightweight, and highly effective solution for learning class priors from data, enabling robust, bias-aware predictions in imbalanced deep learning scenarios without requiring external distributional knowledge.