On Linear Separability of the MNIST Handwritten Digits… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a teacher trying to sort a massive pile of handwritten homework into ten different boxes, labeled 0 through 9. This is the MNIST dataset, a famous collection of 70,000 tiny pictures of handwritten numbers used to train computers to "see" and recognize digits.

For decades, scientists have argued over a specific question: Can a single, straight line (or a flat sheet in higher dimensions) perfectly separate every single "3" from every single "8," or every "0" from everything else?

This paper by Ákos Hajnal is like a referee stepping in to settle a long-standing debate. Here is the story of what he found, explained simply.

The Core Concept: The "Straight Line" Test

Imagine you have a bag of red marbles (Digit 3) and blue marbles (Digit 5).

Linearly Separable: If you can stick a straight ruler between the red and blue marbles so that all reds are on one side and all blues are on the other, they are "linearly separable."
Not Separable: If the marbles are mixed up in a swirl, or if a red marble is hiding inside a circle of blue ones, no straight ruler can separate them without cutting through a marble. You'd need a curved line or a complex shape.

The big question was: Is the MNIST dataset like the neat bag of marbles, or the messy swirl?

The Experiment: The "Perfect Sorter"

The author didn't just guess; he built a digital "Perfect Sorter" using a powerful math tool called CVXPY. Think of this tool as a super-strict judge that tries to draw a straight line between two groups of numbers.

If the judge finds a line, it says, "Yes, these can be separated!"
If the judge tries every possible angle and fails, it says, "No, these are hopelessly mixed."

He tested this in three different ways:

One-vs-One: Can we separate just 0s from 1s? Just 2s from 3s? (Like separating red marbles from blue ones).
One-vs-Rest: Can we separate all the 0s from everything else (1s, 2s, 3s... all the way to 9)? (Like separating red marbles from a giant pile of every other color).
The Sets: He tested the Training Set (the 60,000 examples the computer learns from), the Test Set (the 10,000 examples used to check if the computer learned), and the Combined Set (all 70,000 together).

The Results: It's Complicated!

Here is the twist: The answer depends entirely on which numbers you are comparing and which pile of homework you are looking at.

1. The "Messy" Training Set (The Learning Phase)

When looking at the main pile of 60,000 images used for training:

One-vs-One: Some pairs were easy to separate (like 0 vs. 1), but others were impossible. For example, you cannot draw a straight line to separate all the 2s from all the 3s because some handwritten 2s look too much like 3s.
One-vs-Rest: This was a total failure. No single digit could be separated from the other nine using a straight line. The shapes are just too varied and overlapping.
Verdict: The training set is NOT linearly separable.

2. The "Clean" Test Set (The Exam Phase)

When looking at the smaller pile of 10,000 images used for testing:

One-vs-One: Surprisingly, every single pair of digits could be separated by a straight line!
Why? Because the test set is smaller. It's like taking a smaller sample of marbles; by pure luck, the messy ones that caused the overlap in the big pile weren't in this smaller group.
One-vs-Rest: Most digits could be separated from the rest, but not all (5 and 8 failed).
Verdict: The test set is mostly linearly separable, but this is likely a fluke of the small sample size, not a rule for the whole dataset.

3. The Combined Set (The Whole Story)

When you mix the training and test sets together, the result looks like the training set: Not separable. The messy, overlapping examples from the training set ruin the perfect separation found in the test set.

The Big Takeaway

For years, people have been arguing: "MNIST is easy!" vs. "MNIST is impossible!"

This paper says: "It depends on what you're asking."

If you ask, "Can a straight line separate a 2 from a 3 in the entire dataset?" The answer is No.
If you ask, "Can a straight line separate a 2 from a 3 in the test set?" The answer is Yes (but only because the test set is small and lucky).
If you ask, "Can a straight line separate all 2s from everything else?" The answer is No.

Why Does This Matter?

Think of it like trying to sort a deck of cards.

If you have a deck where every card is perfectly ordered, a simple rule works.
But the real world (and the MNIST dataset) is messy. Handwriting varies wildly. Some people write a "1" that looks like a "7," and some "8"s look like "3s."

The paper proves that simple straight-line rules aren't enough to perfectly sort the entire world of handwritten digits. This is why we need complex AI (like Deep Neural Networks) that can draw curved lines and learn complex patterns, rather than just simple straight ones.

In short: The MNIST dataset is a beautiful, messy puzzle. You can't solve it with a single straight line, but you can solve it with a smart, flexible mind (or a modern AI).

1. Problem Statement

The MNIST dataset, a foundational benchmark in machine learning containing 70,000 handwritten digit images (28x28 pixels), has a long history of use in evaluating pattern recognition models. Despite its ubiquity, a fundamental theoretical question remains unresolved: Is the MNIST dataset linearly separable?

While informal consensus often claims MNIST is "not linearly separable," and some scientific sources make conflicting assertions, there is no comprehensive empirical verification. The paper addresses this ambiguity by distinguishing between two specific separation scenarios:

Pairwise Separability: Can a single linear hyperplane separate one specific digit class (e.g., '0') from another specific class (e.g., '1')?
One-vs-Rest (OvR) Separability: Can a single linear hyperplane separate one specific digit class from all other nine classes combined?

The study investigates these scenarios across the training set, the test set, and the combined dataset.

2. Methodology

The author employs a rigorous computational approach to determine linear separability, moving beyond heuristic approximations used in previous studies.

Formulation as a Feasibility Problem: The core problem is formulated as a Linear Program (LP). The objective is to find a weight vector $w$ $w$ and bias $b$ $b$ such that a separating hyperplane exists.
- The optimization problem minimizes a constant (0) subject to constraints: $y_i(w^\top x_i + b) \geq 1$ .
- If the solver finds a feasible solution, the dataset is linearly separable. If the solver returns "Infeasible," the dataset is not linearly separable.
Tools and Environment:
- Solver: The study utilizes CVXPY (version 1.6.7), an open-source convex optimization modeling tool, which automatically selects the CLARABEL solver.
- Hardware: Experiments were conducted on Google Colab using a T4 GPU and Intel Xeon CPU.
- Comparison: The author compares execution times against a previous study by Zhong et al. [6] (which used Minimum Enclosing Ball algorithms in MATLAB), demonstrating a significant speedup (4–8x) with the CVXPY approach.
Experimental Scope:
- Pairwise: All 45 unique combinations of digit pairs (0–9) were tested.
- One-vs-Rest: All 10 digits were tested against the remaining 9 classes.
- Datasets: Tests were run on the 60,000-sample training set, the 10,000-sample test set, and the combined 70,000-sample set.

3. Key Results

A. Pairwise Linear Separability

Training Set:
- Non-Separable Pairs: Seven specific digit pairs were found to be not linearly separable: (2–3), (2–8), (3–5), (3–8), (4–9), (5–8), and (7–9).
- Separable Digits: Digits 0, 1, and 6 were found to be linearly separable from every other individual digit.
- Most Challenging Digit: Digit 8 was the most difficult to separate, conflicting with three other digits (2, 3, and 5).
Test Set:
- All pairs were linearly separable. This is attributed to the smaller sample size of the test set (1,000 per digit) compared to the training set, which reduces the likelihood of overlapping convex hulls.
Combined Set:
- The results mirrored the training set. The addition of test data did not change the separability status of any pair; if a pair was non-separable in the training set, it remained so in the combined set.

B. One-vs-Rest (OvR) Linear Separability

Training Set:
- All 10 digits were non-separable from the rest of the classes. Even digits 0, 1, and 6 (which were separable in pairwise tests) failed to be separated from the aggregate of all other digits.
Combined Set:
- Consistent with the training set, all 10 digits were non-separable.
Test Set:
- Mixed Results: Digits 0, 1, 2, 3, 4, 6, and 7 were found to be linearly separable from the rest. Digits 5, 8, and 9 were not separable.
- Note: The author cautions that these positive results on the test set are likely due to the small sample size and may not hold for the full distribution.

C. Performance Metrics

Execution Times:
- Pairwise tests on the training set took 6.4–24.7 seconds (non-separable cases took longer).
- One-vs-Rest tests on the training set took 89–209 seconds due to the larger number of constraints (negative samples).
- The CVXPY/CLARABEL approach proved significantly faster than previous methods (e.g., Zhong et al.'s MATLAB implementation).

4. Key Contributions

Definitive Empirical Evidence: The paper provides the first comprehensive, systematic verification of linear separability for MNIST across all pairwise and one-vs-rest configurations for training, test, and combined sets.
Resolution of Conflicting Claims: It clarifies that the statement "MNIST is linearly separable" is only true for the test set in pairwise scenarios, while "MNIST is not linearly separable" is true for the training set in one-vs-rest scenarios.
Methodological Benchmark: The study establishes a high-performance baseline for testing linear separability using modern convex optimization tools (CVXPY/CLARABEL), showing superior speed compared to older Minimum Enclosing Ball (MEB) or Simplex-based approaches.
Identification of "Hard" Classes: It identifies specific digit pairs (e.g., 3 vs 8, 5 vs 8) and the digit '8' generally as the most geometrically complex and difficult to separate linearly.

5. Significance

This work is significant for both theoretical and practical machine learning:

Theoretical Clarity: It dispels the myth that MNIST is a simple linearly separable problem. The fact that the training set is not linearly separable in a one-vs-rest setting explains why simple linear classifiers (like a single Perceptron or Logistic Regression) cannot achieve perfect accuracy on MNIST without feature engineering or non-linear transformations.
Model Selection: The results justify the necessity of using non-linear models (such as Deep Neural Networks, CNNs, or SVMs with non-linear kernels) for high-accuracy MNIST classification.
Reproducibility: By providing open-source code and detailed execution times, the paper enables future researchers to benchmark new separability testing algorithms against a known standard.

In conclusion, the paper demonstrates that while MNIST is a "simple" dataset in terms of resolution and size, its underlying geometry is complex enough to prevent linear separation in realistic (training/combined) scenarios, particularly when distinguishing a single class from all others.

On Linear Separability of the MNIST Handwritten Digits Dataset