Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

This paper addresses the issue of data leakage in bearing fault diagnosis by proposing a rigorous, leakage-free evaluation methodology based on bearing-wise data partitioning and multi-label classification, which demonstrates that the number of unique training bearings is critical for achieving robust generalization across real-world industrial applications.

João Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a mechanic trying to teach a robot how to listen to a car engine and tell if a specific bearing (a small, spinning part) is broken. You want this robot to be so good that it can fix any car in the world, not just the one it practiced on.

This paper is a wake-up call to the scientists building these robots. It says: "You are cheating, and your robots aren't actually as smart as you think they are."

Here is the breakdown of the problem and the solution, using simple analogies.

1. The Problem: The "Cheat Sheet" (Data Leakage)

In the world of machine learning, scientists train models using "training data" and then test them on "test data" to see how well they learned.

The Mistake:
Many researchers have been making a huge mistake called Data Leakage.

  • The Analogy: Imagine you are studying for a math test. You practice with a specific set of problems from your teacher. Then, on the day of the test, you are given the exact same problems, just with the numbers slightly rearranged. You get a 100% score! You feel like a genius.
  • The Reality: But if you go to a different school and face new problems you've never seen, you might fail. You didn't learn math; you just memorized the specific questions.

In bearing diagnosis, researchers often split the data from a single physical bearing into both the training set and the test set.

  • What happens: The robot learns the unique "voice" or "fingerprint" of that specific bearing (like its background noise or how it was installed). It thinks, "I know this sound, it's broken!" because it heard that exact sound during practice.
  • The Result: The papers report 99% or 100% accuracy. But in the real world, when the robot meets a new bearing it has never seen, it fails miserably. It's like a student who memorized the answer key but can't solve a new equation.

2. The Solution: The "Strict Teacher" (Bearing-Wise Splitting)

The authors propose a new, strict rule for how to split the data, which they call Bearing-Wise Splitting.

  • The Analogy: Imagine you have 20 different students (bearings).
    • The Old Way: You let Student A practice on questions 1–50, and then you test Student A on questions 51–100. If Student A memorized the style of the questions, they pass.
    • The New Way: You give questions 1–50 to Student A. Then, you give questions 51–100 to Student B.
  • Why it works: If the robot can correctly identify a broken bearing in Student B's test, it proves the robot actually learned what a "broken bearing" sounds like, not just what "Student A's broken bearing" sounds like. This is the only way to know if the robot will work in the real world.

3. The "Magic Trick" (Multi-Label Classification)

The paper also suggests changing how the robot answers questions.

  • The Old Way (Multiclass): The robot has to pick one answer from a list: "Healthy," "Inner Race Broken," "Outer Race Broken," or "Ball Broken." It's like a multiple-choice test where you can only circle one bubble. If a bearing has two things wrong at once, the robot gets confused and guesses wrong.
  • The New Way (Multi-Label): The robot gets a checklist. It asks: "Is the inner race broken? Yes/No. Is the outer race broken? Yes/No."
  • The Benefit: This is more realistic. Real machines often have multiple problems at once. It also helps the robot learn better because a "broken inner race" signal can be used as a "healthy" example for the "outer race" question.

4. The Big Discovery: Diversity is Key

The researchers ran experiments to see what happens when you follow these new rules. The results were surprising:

  1. The "Fake" Genius: When they used the old, cheating methods, deep learning models (complex AI) looked like super-geniuses with 99% accuracy.
  2. The Real World: When they used the strict "Bearing-Wise" rules, the accuracy of those complex AI models dropped dramatically (sometimes to near 50%, which is basically guessing).
  3. The Underdog Wins: In many cases, simple, old-school math models (like Random Forests) actually performed better than the fancy AI when tested fairly. They were less likely to "memorize" the specific bearing and more likely to learn the actual physics of the fault.

The Lesson on Diversity:
The paper found that the most important thing for a robot to learn isn't how much data it sees, but how many different bearings it sees.

  • Analogy: If you teach a chef to cook a steak using only one specific cow, they might learn that cow's texture. If you teach them using 50 different cows, they learn what "beef" actually tastes like. The robot needs to see many different bearings to learn the universal signs of a broken part.

5. The Takeaway

This paper is a call to action for the engineering community. It says:

  • Stop testing your models on the same bearings you trained them on.
  • Stop celebrating 99% accuracy if it's based on a "cheat sheet."
  • Use simpler models if they work better.
  • Test your models on completely new, unseen hardware to ensure they are truly ready for the factory floor.

By following these rules, we can stop building robots that are just good at taking tests and start building robots that can actually fix machines.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →