ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels

Imagine you are training a team of two detectives to solve a mystery (classify images). However, the case files they are given contain a lot of fake clues (noisy labels). Some files say "This is a cat" when it's actually a dog.

If you let the detectives read these files blindly, they will eventually start believing the fake clues, memorizing them as facts. This ruins their ability to solve real cases later. This is the core problem the paper addresses: How do you train smart AI when the data is full of lies?

The authors propose a new system called ACD-U. Think of it as a clever training camp with two special rules: The "Different Personalities" Rule and the "Memory Eraser" Rule.

1. The "Different Personalities" Rule (Asymmetric Co-Teaching)

Most previous methods used two identical detectives (two identical AI models) to check each other's work. If both detectives agreed on a fake clue, they would both get tricked, and the error would stick forever.

ACD-U changes the team dynamic. Instead of two identical detectives, they hire two very different ones:

Detective V (The Veteran): This is a Vision Transformer (a type of AI) that has already read millions of books and seen the world before. It's like a seasoned expert who knows what a "cat" looks like immediately. Because it's so experienced, it's very confident and rarely gets confused by bad clues early on.
Detective A (The Apprentice): This is a CNN (a standard AI) starting from scratch. It's eager to learn but gets confused easily. It needs to be taught carefully.

How they work together:

The Veteran (V) only looks at the clues it is 100% sure are real. It refuses to touch the messy, confusing files. It acts as a stable anchor, teaching the Apprentice what true looks like.
The Apprentice (A) is allowed to look at everything, including the messy files, but it uses a special "semi-supervised" technique to guess the truth.
The Magic: Because they are so different, they rarely make the same mistake at the same time. If the Apprentice gets confused by a fake clue, the Veteran usually spots it and says, "No, that's wrong." This stops them from reinforcing each other's errors.

2. The "Memory Eraser" Rule (Machine Unlearning)

Here is the paper's biggest breakthrough. Even with two great detectives, sometimes they still accidentally memorize a fake clue. In the past, once a detective memorized a lie, there was no way to fix it. The lie became part of their permanent memory.

ACD-U introduces a "Memory Eraser" (Machine Unlearning).

The Detective's Diary: The system keeps a diary of what the detectives thought at the start of the day.
The "Oops" Moment: Later in the training, the system looks at the clues. If a clue that used to be confusing suddenly becomes "easy" (low loss) but contradicts what the Veteran (who uses a pre-trained "CLIP" model to check facts) thinks, the system knows: "Wait, we just memorized a lie!"
The Eraser: Instead of ignoring the mistake, the system actively erases the influence of that specific clue from the detective's brain. It uses a mathematical "force" to push the detective's memory away from that fake clue, effectively saying, "Forget that you ever saw this."

This turns the process from passive (trying not to make mistakes) to active (finding mistakes and fixing them after they happen).

The Analogy: The Classroom

Imagine a classroom with a Professor (The Veteran ViT) and a Student (The Apprentice CNN).

The Problem: The textbook has typos. The Student reads the typos and learns them.
Old Method: Two students sit together. If they both read the typo, they convince each other it's correct.
ACD-U Method:
- The Professor has read the correct version of the book before class. He only teaches the Student the pages he knows are right.
- The Student tries to learn from the whole book but listens to the Professor to correct himself.
- The Twist: If the Student accidentally memorizes a typo, the Professor notices the Student is acting strangely compared to his own "perfect memory." The Professor then uses a special technique to make the Student unlearn that specific typo, wiping it from his short-term memory so he can learn the truth instead.

Why This Matters

It fixes the unfixable: Previous AI methods could only try to avoid mistakes. This method can find and fix mistakes even after they happen.
It works in chaos: It performs incredibly well even when the data is 80% or 90% wrong (high noise), which is a nightmare for other AI models.
It's efficient: By using a pre-trained "Professor" (CLIP) and a learning "Student," they cover each other's weaknesses.

In short: ACD-U is like a smart teacher who not only picks the best study materials but also has a magical eraser to wipe out any wrong facts the student accidentally memorized, ensuring the student learns the truth no matter how messy the source material is.

Here is a detailed technical summary of the paper "ACD-U: Asymmetric Co-Teaching with Machine Unlearning for Robust Learning with Noisy Labels."

1. Problem Statement

Deep Neural Networks (DNNs) are highly susceptible to memorizing incorrect labels during training, which severely degrades generalization. While existing methods like Co-teaching and DivideMix leverage the "memorization effect" (where networks learn clean patterns before noisy ones) to select reliable samples, they suffer from two critical limitations:

Irreversible Error Embedding: Once a noisy sample is misclassified as "clean" by the selection mechanism (often due to confirmation bias in early training), the error becomes permanent. Existing methods lack a mechanism to correct these initial selection mistakes.
Inefficient Use of Pretrained Models: Current approaches often treat pretrained models (like CLIP) as static validators or zero-shot classifiers, failing to exploit the distinct learning dynamics between pretrained models (which are stable early on) and randomly initialized models (which adapt gradually).

2. Methodology: The ACD-U Framework

The authors propose ACD-U, a framework that integrates Machine Unlearning with an Asymmetric Co-Teaching strategy. The framework operates through three core components:

A. Selective Unlearning for Error Correction

Unlike privacy-focused unlearning, ACD-U uses unlearning to dynamically identify and forget incorrectly memorized noisy samples during training.

Target Identification: The system identifies samples to "forget" using three conditions:
1. Low-Loss: Samples with low loss values (indicating potential overfitting to noise).
2. Loss-Drop: Samples where the loss has decreased significantly since the last selection point (indicating recent incorrect memorization).
3. CLIP-Consistency Check: A filter using a frozen, pretrained CLIP model. If a sample's label matches CLIP's zero-shot prediction, it is protected from forgetting (assuming it is likely clean). This prevents the accidental removal of clean data.
Forgetting Mechanism: Once targets are identified, the model is trained to maximize the Kullback–Leibler (KL) divergence between its current prediction and a "reference model" (the state before forgetting). This pushes the current model's predictions away from the "pre-forgetting" state, effectively erasing the influence of the noisy sample.

B. Asymmetric Co-Teaching (ACD)

Instead of training two identical networks symmetrically, ACD-U pairs two different architectures with distinct roles:

Net V (Vision Transformer): A CLIP-pretrained ViT.
- Role: Acts as a stable teacher.
- Training: Trained only on labeled (clean) samples selected by the CNN. It does not use unlabeled data or pseudo-labels to avoid early-stage noise contamination.
Net A (CNN): A standard Convolutional Neural Network (e.g., ResNet).
- Role: Acts as the adaptive learner.
- Training: Trained via Semi-Supervised Learning (SSL) using both labeled and unlabeled data, guided by the predictions of Net V.

Benefit: This asymmetry mitigates confirmation bias. The stable ViT provides high-confidence clean samples to the CNN, while the CNN explores the data space, creating a complementary learning loop.

C. Training Algorithm

The process follows a multi-stage schedule:

Warmup: Stabilizes Net A (supervised) and Net V (self-supervised).
Preparation: Applies ACD to the full dataset without unlearning.
Unlearning Execution: Periodically (every $E_{UP}$ epochs), the system selects forgetting targets, saves reference models, and executes the unlearning loss for $E_{UD}$ epochs before resuming standard ACD training.

3. Key Contributions

First Application of Unlearning to LNL: Introduces a mechanism for post-hoc error correction, allowing the model to actively "unlearn" incorrectly memorized noisy labels, shifting the paradigm from passive error avoidance to active correction.
Asymmetric Co-Teaching Architecture: Proposes a novel pairing of a pretrained ViT and a CNN. This leverages the ViT's early stability for sample selection and the CNN's adaptability for SSL, effectively suppressing noise memorization in early training stages.
Dynamic Selection Mechanism: Combines loss trajectory analysis with CLIP-based consistency checks to identify forgetting targets without prior knowledge of which labels are noisy.
Complementary Roles: Demonstrates that Unlearning is critical for high-noise regimes (correcting accumulated errors), while Asymmetric Co-Teaching is most effective in low-to-moderate noise regimes (preventing initial errors).

4. Experimental Results

The method was evaluated on synthetic (CIFAR-10/100) and real-world noisy datasets (CIFAR-N, WebVision, Clothing1M, Red Mini-ImageNet).

State-of-the-Art Performance:
- CIFAR-100 (Sym. 90% noise): Achieved a 35% relative improvement over DivideMix.
- CIFAR-100N (Real-world noise): Outperformed LSL by 1.5% and Semi-RML++ by 2.3%.
- Red Mini-ImageNet: Surpassed all existing methods at all noise rates (e.g., +4.5% over NoiseBox+SS-KNN at 80% noise).
- WebVision & Clothing1M: Achieved the highest Top-5 accuracy on WebVision and Top-1 on Clothing1M.
Ablation Studies:
- Removing Unlearning caused a significant drop (1.7%) in high-noise settings (80%) but minimal impact in moderate noise.
- Removing ACD caused a significant drop (1.9%) in moderate noise (50%) but less impact in high noise.
- Sample Selection Accuracy: ACD-U reduced the number of noisy samples misidentified as "clean" (HN) to ~1/6th of the rate observed in DivideMix during early training.

5. Significance

ACD-U represents a paradigm shift in Learning with Noisy Labels (LNL).

From Static to Dynamic: It moves beyond the limitation of "once selected, always trusted" by introducing a feedback loop where the model can actively correct its own mistakes via unlearning.
Architectural Synergy: It demonstrates that heterogeneous model architectures (Pretrained ViT + CNN) can be more effective than homogeneous co-teaching pairs when trained asymmetrically.
Robustness: The framework proves highly robust against extreme noise levels (up to 90%) and complex, instance-dependent real-world noise, making it a strong candidate for training models on large-scale, crowdsourced, or web-scraped datasets where label quality is uncertain.

The code is publicly available at: https://github.com/meruemon/ACD-U.

ACD-U: Asymmetric co-teaching with machine unlearning for robust learning with noisy labels

1. The "Different Personalities" Rule (Asymmetric Co-Teaching)

2. The "Memory Eraser" Rule (Machine Unlearning)

The Analogy: The Classroom

Why This Matters

1. Problem Statement

2. Methodology: The ACD-U Framework

A. Selective Unlearning for Error Correction

B. Asymmetric Co-Teaching (ACD)

C. Training Algorithm

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities