Leveraging Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning

Imagine you are a teacher trying to teach a class of students (an AI model) how to recognize different animals. You have a few flashcards with pictures and names (labeled data), but you also have a huge pile of unmarked photos (unlabeled data).

The goal is for the teacher to look at the unmarked photos, guess what animal is in them, and then use those guesses to help the students learn. This is called Semi-Supervised Learning.

The Problem: The "Popular Kid" Bias

Here's the catch: In the real world, some animals are super common (like cats and dogs), while others are rare (like snow leopards).

In your class, you have 100 flashcards of cats and only 2 flashcards of snow leopards.

The teacher starts by guessing the unmarked photos. Because there are so many cats, the teacher guesses "Cat" for almost everything.
The students trust these guesses. They start thinking, "Oh, everything is a cat!"
The rare animals (snow leopards) get completely ignored. The teacher's bias gets worse and worse, and the students fail to learn the rare animals entirely.

This is the Class Imbalance problem. The AI gets really good at the common stuff but terrible at the rare stuff.

The Solution: The "Class Census"

The authors of this paper came up with a clever, lightweight fix. They realized that even if you only have a few flashcards, you can still count them to get a rough idea of the global ratio.

"Okay, we have 100 cats and 2 snow leopards. So, in the whole world, for every 50 cats, there's roughly 1 snow leopard."

They call this the Label Proportion Prior. It's like having a "Class Census" that tells the teacher, "Hey, don't guess 'Cat' for everything! Remember, the real world has a specific mix of animals."

How It Works: The "Proportion Loss"

The paper introduces a new rule called Proportion Loss. Think of this as a strict supervisor standing next to the teacher.

Every time the teacher makes a batch of guesses on the unmarked photos, the supervisor checks the math:

Teacher: "I guessed 90% of these are cats and 10% are snow leopards."
Supervisor: "Wait a minute! The Census says it should be 98% cats and 2% snow leopards. You are over-guessing cats and under-guessing snow leopards. You need to adjust your guesses to match the Census."

This forces the AI to stop ignoring the rare animals. It acts like a regularizer—a gentle hand guiding the model back to the truth, ensuring it doesn't get too obsessed with the majority.

The Twist: Dealing with "Small Batches"

There's a tricky part. The teacher doesn't look at the whole world at once; they look at small groups of photos (mini-batches) one by one.

Sometimes, by pure luck, a small group might have 5 snow leopards and 5 cats. If the teacher tries to force that specific small group to match the global Census exactly, they might get confused or "overfit" (memorize the wrong pattern). It's like trying to force a single classroom to perfectly reflect the demographics of the entire planet; sometimes a classroom just happens to have more boys than girls by chance.

To fix this, the authors created a Stochastic Variant.
Instead of saying, "You must match the Census exactly right now," they say, "The Census says 2% snow leopards, but because you are looking at a small group, it's okay if your guess is around 2%, maybe a little higher or lower, just don't be crazy."

They use a mathematical tool (a hypergeometric distribution) to simulate this natural "wobble" in small groups. This keeps the training stable and prevents the AI from panicking over tiny fluctuations.

The Results: A Fairer Classroom

When they tested this on a famous dataset (CIFAR-10) where some classes were very rare:

Without the fix: The AI ignored the rare classes.
With the fix: The AI started recognizing the rare classes much better, without losing its ability to recognize the common ones.

It worked especially well when the teacher had very few flashcards to start with (scarce labels). It was like giving the teacher a cheat sheet that said, "Don't forget the rare animals," which made the whole class smarter and more balanced.

In Summary

This paper is about teaching an AI to be fair. When an AI sees too many examples of one thing, it naturally ignores the rare things. This new method acts like a global compass, constantly reminding the AI of the true balance of the world, ensuring that the rare and the common are both treated with respect.

1. Problem Statement

The paper addresses the critical challenge of Class-Imbalanced Semi-Supervised Learning (CISSL).

The Core Issue: In standard Semi-Supervised Learning (SSL), models utilize a small set of labeled data and a large set of unlabeled data via pseudo-labeling. However, when the data distribution is imbalanced (e.g., long-tailed distributions), the classifier develops a bias toward majority classes.
The Feedback Loop: This initial bias causes the generation of biased pseudo-labels for the unlabeled data. These biased pseudo-labels are then fed back into the training process, further amplifying the majority bias and suppressing the performance of minority classes.
Limitation of Existing Methods: While some methods attempt to correct this (e.g., DARP, CReST), they often struggle under severe imbalance or scarce labeled data conditions. Furthermore, existing approaches like Distribution Alignment (DA) typically rescale predictions post-hoc rather than enforcing distributional consistency directly within the loss function.

2. Methodology

The authors propose a lightweight framework that integrates Proportion Loss (originally from Learning from Label Proportions, LLP) into SSL as a regularization term. The method consists of two main components:

A. Proportion Loss Regularization

The core idea is to align the model's predictions on a mini-batch of unlabeled data with the global class distribution estimated from the labeled data.

Global Prior: Let $q = (q_1, \dots, q_L)$ be the estimated global class proportion vector derived from the labeled set (where $q_l = N_l / N$ ).
Prediction: For a mini-batch $B$ of unlabeled data, the model predicts the proportion $\hat{p}_l(B)$ by averaging the softmax outputs.
Loss Function: The Proportion Loss is defined as a cross-entropy between the global prior and the batch predictions:
$\mathcal{L}_{prop}(B) = -\sum_{l=1}^{L} q_l \log \hat{p}_l(B)$
Integration: This loss is added to the standard SSL loss ( $\mathcal{L}_{ssl}$ ) with a weighting hyperparameter $\lambda$ :
$\mathcal{L} = \mathcal{L}_{ssl} + \lambda \mathcal{L}_{prop}$
This forces the model to adjust its predictions to match the known global distribution, mitigating the drift toward majority classes.

B. Stochastic Variant (Hypergeometric Sampling)

A naive application of the global proportion to every mini-batch can lead to overfitting because a mini-batch is a small, noisy sample of the global distribution. To address this, the authors introduce a stochastic perturbation inspired by Large-Bag LLP:

Problem: Enforcing the exact global proportion $q$ on a small batch $|B|$ ignores the natural statistical fluctuation of class counts in random sampling.
Solution: Instead of using the fixed vector $q$ , the supervision target $q^{(t)}$ at iteration $t$ is sampled from a Multivariate Hypergeometric Distribution:
$q^{(t)} \sim \text{MultiHG}(M, q, |B|)$
where $M$ is the total number of unlabeled samples.
Effect: This models the expected class composition of a specific mini-batch drawn without replacement from the population. It prevents the model from memorizing a fixed, noisy target and stabilizes training under severe imbalance by accounting for batch composition fluctuations.

3. Key Contributions

First Integration of LLP into SSL: The paper is the first to introduce the concept of "Label Proportion" from the LLP domain into the SSL setting as a regularization term.
Simple yet Effective Framework: The method is conceptually simple and can be seamlessly integrated into existing SSL algorithms (like FixMatch and ReMixMatch) without requiring architectural changes.
Stochastic Regularization: The development of the stochastic variant using Multivariate Hypergeometric sampling addresses the issue of mini-batch fluctuations, enhancing robustness in severe imbalance scenarios.
Bias Mitigation: The approach explicitly corrects class-level biases by aligning pseudo-labels with the global distribution, rather than just rescaling outputs.

4. Experimental Results

The method was evaluated on the Long-tailed CIFAR-10 (CIFAR-10-LT) benchmark across various imbalance ratios ( $\gamma \in \{10, 20, 50, 100\}$ ) and labeled data ratios ( $\beta \in \{2\%, 4\%, 10\%, 20\%\}$ ).

Performance Gains:
- The proposed method ("Ours") consistently outperformed baselines (FixMatch, ReMixMatch) and existing CISSL methods (DARP, CReST) across almost all settings.
- Scarce Label Conditions: The method showed the most significant improvements when labeled data was scarce ( $\beta = 2\%, 4\%$ ). For example, on FixMatch with $\gamma=100, \beta=2\%$ , accuracy improved from 80.8% (Baseline) to 81.9%. On ReMixMatch, it improved from 85.5% to 88.1%.
- Robustness: Unlike some existing methods that performed well on one backbone but poorly on another, the Proportion Loss improved performance consistently across both FixMatch and ReMixMatch.
Analysis of Proportions:
- Visual analysis (Fig. 3) showed that while baselines overestimated majority classes and underestimated minority classes, the proposed method significantly reduced this discrepancy, bringing predicted proportions closer to the ground truth.
Pseudo-Label Quality:
- Recall analysis (Fig. 4) demonstrated that the method significantly improved the recall of pseudo-labels for minority classes without sacrificing the recall for majority classes. This confirms that the regularization leads to better selection of minority class pseudo-labels.

5. Significance and Limitations

Significance: This work provides a novel, lightweight solution to the class imbalance problem in SSL. By leveraging the global class distribution as a prior, it breaks the feedback loop of bias amplification inherent in pseudo-labeling. The stochastic variant ensures this regularization is statistically sound even with small batch sizes.
Limitations:
- Distribution Shift: The method assumes labeled and unlabeled data follow the same distribution. Performance may degrade if this assumption is violated.
- Batch Size Dependency: If the mini-batch size is too small, the label proportions cannot be estimated accurately, potentially reducing the benefit of the regularization.

In conclusion, the paper presents a robust, theoretically grounded approach to stabilizing SSL in imbalanced settings, achieving state-of-the-art results particularly in data-scarce scenarios.