Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning

Imagine you are a teacher trying to teach a student a new language. You start with Task 1: teaching them 10 words about "Animals." The student learns them well.

Then, you move to Task 2: teaching them 10 words about "Fruits."
Then Task 3: "Vegetables."

In the world of Artificial Intelligence, this is called Class-Incremental Learning (CIL). The problem is that as the teacher introduces new words (Fruits, Vegetables), the student starts to forget the old ones (Animals). This is called Catastrophic Forgetting.

Even worse, the student starts to guess "Fruit" for everything, even when looking at a picture of a dog. They have a bias toward the new things they just learned.

The Old Way of Fixing It

For a long time, researchers thought the problem was simple: "We just have too many pictures of Fruits and Vegetables right now, and not enough of Animals."

So, their solution was like a referee blowing a whistle at the end of the game to tell the student, "Hey, don't guess Fruit so much! Be fair!" They adjusted the final answer sheet (the classifier) to force a balance.

The New Discovery: The "Time" Problem

This paper argues that the old solution is missing the real culprit. It's not just about how many pictures you have; it's about when you saw them.

The authors call this Temporal Imbalance.

Here is the analogy:
Imagine the student is studying for a marathon.

Class A (Animals): They studied this 6 months ago. Since then, they haven't seen a single animal picture. Every time they take a practice test, they see a picture of a car, a tree, or a fruit. The teacher keeps saying, "No, that's not an animal!" over and over again. The student gets beaten down by constant "No's" (Negative Supervision).
Class B (Fruits): They just studied this yesterday. They see fruit pictures every day. The teacher says, "Yes, that's a fruit!" constantly.

Even if the student has seen the same number of Animal and Fruit pictures in total, the Animal class has been under constant attack by "No's" for months, while the Fruit class has been under constant praise.

By the time the final exam comes, the student is terrified to guess "Animal" because they've been punished for it so many times recently. They only guess "Fruit" because that's what they are currently being reinforced on.

The Solution: Temporal-Adjusted Loss (TAL)

The authors propose a new rule for the teacher called TAL (Temporal-Adjusted Loss).

Think of TAL as a smart memory filter or a volume knob for the teacher's voice.

Tracking the "Freshness": TAL keeps a score for every category (Animals, Fruits, etc.).
- If a category hasn't been seen in a while, its score drops (it has "low positive supervision").
- If a category is being seen right now, its score is high.
Turning Down the Volume on "No":
- When the teacher sees a picture of a Car and says, "This is NOT an Animal," TAL checks the Animal score.
- If the Animal score is low (because the student hasn't seen animals in a while), TAL says, "Wait, the student is already stressed about Animals. Don't yell 'NO' so loudly!" It turns down the volume of the negative feedback.
- If the Animal score is high (they just saw an animal), TAL says, "Okay, they are confident. You can yell 'NO' normally if they get it wrong."

Why This Matters

By turning down the "negative volume" for old, forgotten classes, the student doesn't get bullied into forgetting them. They stay confident enough to recognize an old friend (an old class) even when new friends (new classes) are walking around.

The Results

The paper shows that when they use this "smart volume knob" (TAL):

The student forgets much less.
They get better at recognizing both old and new things.
It works like a magic plug-in; you can add it to almost any existing AI system without rebuilding the whole thing.

In a Nutshell

Previous methods tried to fix the AI's bias by adjusting the final answer key. This paper realized the bias was actually caused by the timing of the lessons. Old classes get bullied by constant "No's" because they haven't been seen in a while.

TAL fixes this by whispering "No" gently to the old classes and shouting "No" normally to the new ones, keeping the student's memory balanced over time. It's like giving the old friends a little extra protection so they don't get pushed out by the new kids on the block.

1. Problem Statement

Class-Incremental Learning (CIL) aims to train models on a sequence of tasks where new classes are introduced over time, while old class data is largely unavailable. The core challenge is catastrophic forgetting, often manifested as a prediction bias toward new classes.

Existing View: Current literature attributes this bias primarily to intra-task class imbalance (new classes have more samples than old classes in the current task) and focuses on correcting the classifier head (e.g., via balanced fine-tuning or prototype-based classifiers).
The Gap: The authors argue that intra-task imbalance is an oversimplification. Even when old classes have equal sample counts in the current task, temporal imbalance exists.
- Temporal Imbalance Definition: Earlier classes receive stronger negative supervision (suppression from non-target classes) toward the end of training because their positive samples appeared early in the training timeline. Conversely, newer classes receive stronger positive reinforcement recently.
- Consequence: This leads to an asymmetry where earlier classes exhibit high precision but low recall (overly conservative predictions), while newer classes have lower precision but higher recall. This bias affects the entire model backbone, not just the classifier head.

2. Methodology: Temporal-Adjusted Loss (TAL)

The authors propose Temporal-Adjusted Loss (TAL), a loss function that dynamically reweights negative supervision based on the temporal history of each class.

A. Temporal Supervision Modeling

The method tracks a Temporal Positive Supervision Strength ( $Q_k$ ) for each class $k$ .

Supervision Polarity: For a sample at step $n$ , the polarity $a_k[n]$ is $+1$ if the sample belongs to class $k$ (positive) and $-1$ otherwise (negative).
Memory Kernel: A time-decay memory kernel $f[n]$ (exponential decay $f[n] = \lambda^{n+1}$ ) is applied to the supervision sequence. Recent samples have higher influence than distant ones.
Calculation: $Q_k[N]$ $Q_{k} [N]$ is the convolution of the decay kernel and the supervision sequence up to step $N$ $N$ .
- High $Q_k$ : The class has recent positive reinforcement.
- Low $Q_k$ : The class lacks recent positive reinforcement (suffering from negative pressure).

B. The Loss Function

TAL modifies the standard Cross-Entropy (CE) loss. In standard CE, the denominator sums the exponentials of all non-target logits ( $\sum_{k \neq y} e^{z_k}$ ). In TAL, these negative logits are reweighted:

$\ell_{TAL} = -\log \left( \frac{e^{z_y}}{e^{z_y} + \alpha \sum_{k \neq y} w(Q_k) e^{z_k}} \right)$

Weight Function $w(Q_k)$ : Defined as $(Q_k / Q_{max})^r$ $(Q_{k} / Q_{ma x})^{r}$ .
- If $Q_k$ is low (old class, weak recent positive signal), $w(Q_k) \to 0$ . The negative supervision for this class is attenuated, protecting it from being suppressed.
- If $Q_k$ is high (new class, strong recent signal), $w(Q_k) \to 1$ . The class remains sensitive to negative supervision.
Frequency Alignment ( $\alpha$ ): A scaling factor derived to ensure that under perfectly balanced and temporally uniform conditions, TAL degenerates to standard CE, ensuring stability.
Recursive Update: $Q_k$ is updated online after every batch using a recursive formula, making the computational complexity $O(1)$ per class update (Markovian property).

3. Key Contributions

Identification of Temporal Imbalance: The paper formally defines and proves that prediction bias in CIL is driven by the temporal distribution of positive vs. negative supervision, independent of current task class imbalance.
Theoretical Framework: They establish a temporal supervision model and prove that under equal sample counts, classes with later-appearing positives attain higher $Q$ values, leading to the observed precision-recall asymmetry.
Proposed Solution (TAL): A plug-and-play loss function that dynamically adjusts negative supervision sensitivity based on a class's temporal status.
Theoretical Guarantees: Proved that TAL degenerates to standard CE under balanced conditions and derived the steady-state properties of the supervision vector $Q$ .
Efficiency: The method adds negligible computational overhead (approx. 0.8% increase in training time) as it only requires vector operations on the logits.

4. Experimental Results

The authors evaluated TAL on CIFAR-100, ImageNet-100, and Food101 across multiple baselines (iCaRL, FOSTER, DER, MEMO, TagFex).

Performance Gains: TAL consistently improved both Average Accuracy (AMean) and Last Accuracy (ALast) across all datasets and baselines.
- Example: On CIFAR-100 (20-task), iCaRL+TAL achieved 58.68% AMean, surpassing FOSTER (56.99%) and MEMO (56.98%).
Mitigation of Bias: Visualizations showed that TAL significantly reduced the precision-recall asymmetry. Earlier classes saw a boost in recall without sacrificing precision, while newer classes maintained high performance.
Feature Space Stability: UMAP visualizations indicated that TAL prevents the feature regions of older classes from being "occupied" or mixed by newer classes, suggesting the correction happens at the representation level, not just the classifier.
Ablation Studies:
- $\lambda$ (Memory): Optimal performance found around 0.995, showing robustness to hyperparameter choices.
- $r$ (Steepness): Controls how sharply the weight changes with $Q$ . Higher $r$ provides stronger protection to old classes but slightly suppresses new classes.
Generalization: TAL was also effective in Pre-trained Model (PTM) based CIL settings (without exemplars) and on Long-Tailed datasets, improving tail-class accuracy.
Standard Supervised Learning: Interestingly, even in standard supervised learning (no CIL), TAL slightly outperformed CE, suggesting it acts as a mild regularizer against subtle temporal biases within epochs.

5. Significance

Paradigm Shift: The paper shifts the focus from "class imbalance within a task" to "temporal imbalance across the training lifecycle." It argues that forgetting is a systemic issue caused by the temporal order of data, not just the quantity of data.
Architectural Agnostic: TAL operates at the loss level and does not require modifying the network architecture, changing the classifier head, or complex post-processing steps like weight alignment.
Practicality: It is a lightweight, plug-and-play solution that can be integrated into existing CIL frameworks with minimal computational cost, offering a robust way to achieve stable long-term learning.

In conclusion, this work provides a fundamental insight into the mechanics of catastrophic forgetting and offers a theoretically grounded, empirically validated, and computationally efficient solution to mitigate prediction bias in Class-Incremental Learning.