The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

This paper argues that mainstream Class Incremental Learning evaluation protocols are biased due to insufficient sequence sampling, and proposes EDGE, a new protocol that leverages inter-task similarity to identify extreme sequences for accurately characterizing the full performance distribution.

Guannan Lai, Da-Wei Zhou, Xin Yang, Han-Jia Ye

Published 2026-03-05
📖 4 min read☕ Coffee break read

🚗 The Problem: The "Average" Driver is a Lie

Imagine you are hiring a driver for a very important job: navigating a city where the traffic rules change every day. Sometimes the streets are wide and empty; other times, they are narrow, filled with construction, and confusing.

In the world of Class Incremental Learning (CIL), the "driver" is an AI model, and the "traffic rules" are new categories of things it needs to learn (like recognizing a new type of animal or a new brand of car). The AI must learn these new things without forgetting the old ones.

The Current Mistake (The RS Protocol):
Right now, scientists test these AI drivers by asking them to drive on just three random routes.

  • Route A: Easy, sunny day.
  • Route B: A bit rainy.
  • Route C: Some traffic.

They then calculate the average speed of the driver across these three trips and say, "Great! This driver is safe and fast!"

The Paper's Warning:
The authors say this is a lie.
Just because a driver is good on three random routes doesn't mean they can handle every route. There might be a specific, nightmare route (a "Hard Sequence") where the driver gets completely lost and crashes. Conversely, there might be a "Dream Route" where they drive perfectly.

If you only look at the average, you might hire a driver who looks great on paper but fails catastrophically in a real-world emergency. The current method hides the worst-case scenarios and underestimates how much the driver's performance can swing.

🔍 The Solution: EDGE (The "Extreme" Test)

The authors propose a new way to test drivers called EDGE. Instead of picking three random routes, EDGE uses a smart strategy to find the three most important routes:

  1. The Nightmare Route (Hard Sequence): A route designed to be as confusing as possible. It forces the driver to make the hardest possible turns between similar-looking streets.
  2. The Dream Route (Easy Sequence): A route designed to be as smooth as possible, grouping similar streets together so the driver never gets confused.
  3. The Normal Route (Medium Sequence): A standard, random route.

How does it find these routes?
The AI uses a "semantic map" (like a GPS that understands the meaning of street names).

  • To make a Hard Route, it groups similar-looking things together in the same task (e.g., teaching the AI to distinguish between an Apple and a Pear in the same lesson). This is hard because they look alike.
  • To make an Easy Route, it separates similar things into different lessons (e.g., teaching Apples in Lesson 1 and Pears in Lesson 10). This is easy because the AI has time to reset.

📊 The Results: Why It Matters

When the authors tested this new method, they found shocking differences:

  • The "Average" Lie: A model might have an "Average Score" of 85%. The old method (Random Sampling) would say, "This model is safe!"
  • The EDGE Reality: The EDGE method reveals that on the "Nightmare Route," that same model drops to 70%. On the "Dream Route," it hits 95%.

Why is this a big deal?
If you are building a self-driving car, you don't care about the average performance. You care about the worst-case scenario. If the car crashes 10% of the time because it got confused by a specific sequence of events, an "average" score of 85% is dangerous.

🧠 The Big Takeaway

This paper argues that we need to stop judging AI models based on a few lucky random tests. Instead, we should stress-test them with the hardest possible scenarios and the easiest possible scenarios to get a true picture of their reliability.

Think of it like this:

  • Old Way: Testing a parachute by jumping out of a plane three times on a calm day and saying, "It works 100% of the time!"
  • EDGE Way: Testing the parachute by simulating a calm day, a stormy day, and a day with a broken parachute, just to see if it really saves you when things go wrong.

By using EDGE, researchers can finally see the full range of an AI's abilities, ensuring that the models we deploy in the real world are truly robust and safe.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →