Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Imagine you are hiring a new employee for a very important job: identifying objects in pictures.

For years, the only thing you cared about was their test score. If they could correctly name 90% of the pictures, you hired them. But recently, you've noticed some weird things. The employee with the highest test score:

Panics if the lighting changes slightly (not robust).
Is 99% sure they are right, even when they are wrong (poor calibration).
Only recognizes a "dog" if it's on a grassy background, but calls it a "cat" if it's on a carpet (bad object focus).
Is great at spotting dogs but terrible at spotting cats (unfair class balance).

This paper, titled "Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?", is like a massive, 326-person job interview. The researchers didn't just look at the test scores (Accuracy); they put 326 different AI models through a gauntlet of 9 different challenges to see which ones are truly "well-behaved."

Here is the breakdown of their findings, using simple analogies:

1. The 9 Dimensions of a "Good" Employee

Instead of just one test score, the researchers looked at 9 different traits. Think of these as the "soft skills" and "hard skills" of an AI:

Accuracy: The standard test score. (Can they name the picture?)
Adversarial Robustness: Can they still recognize a picture if someone puts a tiny, invisible sticker on it to trick them? (Like a security guard spotting a fake ID).
Corruption Robustness: Can they recognize a picture if it's blurry, snowy, or low-quality? (Like reading a sign in the rain).
OOD (Out-of-Domain) Robustness: Can they recognize a dog if it's drawn as a sketch or a painting, even though they only studied photos? (Adaptability).
Calibration: Do they know when they are guessing? A well-calibrated AI says "I'm 50% sure" when it's unsure, rather than confidently saying "It's a cat!" when it's actually a toaster.
Class Balance (Fairness): Do they treat all categories equally? Or do they love "dogs" but hate "pandas"?
Object Focus: Do they look at the animal, or do they get distracted by the background (like a dog on grass)?
Shape Bias: Do they recognize things by their shape (like a human does) or just by their texture (like a furry blob)?
Parameters (Efficiency): How big and heavy is their brain? Smaller is usually better for speed and cost.

2. The Big Discoveries

The researchers tested 326 models and found some surprising things:

Bigger Datasets = Better All-Rounders: Just like a student who reads more books performs better in every subject, models trained on larger datasets (like ImageNet-21k) were generally better at almost everything, not just the test score.
The "Self-Taught" Advantage: Models that taught themselves first (Self-Supervised Learning) before being tested on specific tasks turned out to be the most "well-behaved." They were more robust, fair, and adaptable. It's like hiring someone who learned by exploring the world on their own, rather than just memorizing a textbook.
Vision-Language Models are the "Superheroes" of Adaptability: Models that learn by looking at pictures and reading text (like CLIP) were terrible at the standard test score (because they didn't memorize the specific answers) but were amazing at recognizing sketches, paintings, and weird angles. They are the most adaptable employees.
The Old Guard is Falling Behind: The famous "ResNet50" and original "ViT" models, which used to be the gold standard, are actually performing quite poorly when you look at these other 8 dimensions. They are like the employee who got an A on the final exam but fails every other part of the job.

3. The Trade-Offs (The "Juggle")

The paper found that you can't always have it all.

If you train a model to be super-resistant to hackers (Adversarial Training), it often becomes less accurate and less fair.
If you make a model super small and efficient, it usually loses some of its "brainpower" (robustness).

4. The New Scorecard: QUBA

Since there is no single "best" model, the authors invented a new score called QUBA (Quality Understanding Beyond Accuracy).

Think of QUBA as a customizable report card.

If you are a self-driving car company, you might weight "Robustness" and "Safety" heavily.
If you are a social media app, you might weight "Speed" (Parameters) and "Fairness" heavily.

The QUBA score takes all 9 dimensions, normalizes them, and gives you a single number that tells you which model is best for your specific needs.

The Bottom Line

The paper argues that the AI world has been too obsessed with the "Accuracy" number. It's time to stop looking at the test score alone and start looking at the whole person.

The takeaway? If you want a truly reliable, fair, and robust AI, don't just pick the one with the highest accuracy. Pick the one that was trained on a massive dataset, learned by itself first, and fits your specific needs. And maybe, just maybe, stop using the old ResNet50 model—it's time for an upgrade!

1. Problem Statement

Deep Neural Networks (DNNs) have achieved state-of-the-art performance in image classification, primarily measured by accuracy. However, high accuracy does not guarantee that a model is "well-behaved" in real-world scenarios. Models often fail in critical dimensions such as robustness (to adversarial attacks, corruptions, and domain shifts), calibration (confidence alignment), fairness (class balance), and efficiency.

Existing research has largely addressed these quality dimensions in isolation (orthogonally), leading to a fragmented understanding of how improvements in one area affect others. There is a lack of large-scale, simultaneous analysis of multiple quality dimensions across diverse model architectures and training paradigms. Consequently, it remains unclear which design choices yield models that excel across a broad spectrum of desirable properties.

2. Methodology

The authors conducted a large-scale empirical study involving 326 publicly available backbone models trained for ImageNet-1k classification.

A. Evaluation Framework

The study evaluates models across nine distinct quality dimensions:

Accuracy: Top-1 accuracy on clean ImageNet-1k.
Adversarial Robustness: Performance against FGSM and PGD attacks (normalized by clean accuracy).
Corruption Robustness (C-Robustness): Performance on ImageNet-C (various corruptions like noise, blur, JPEG).
Out-of-Domain (OOD) Robustness: Performance on five datasets with domain shifts (ImageNet-R, Sketch, Stylized-ImageNet, etc.).
Calibration Error: Geometric mean of Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE).
Class Balance: Fairness measured by the standard deviation of per-class accuracies and confidences.
Object Focus: Ability to rely on foreground objects rather than background cues.
Shape Bias: Ability to rely on shape rather than texture (critical for human-like generalization).
Parameters: A proxy for computational cost and memory efficiency.

B. Experimental Setup

Model Zoo: 326 models including CNNs (ResNet, EfficientNet, ConvNeXt), Transformers (ViT, Swin, DeiT), Vision-Language models (CLIP, SigLIP), and specialized architectures (B-cos).
Training Paradigms Compared:
- Standard Supervised Learning (SL).
- Adversarial Training (AT).
- Self-Supervised Learning (SSL) with Linear Probing (LP) and End-to-End (E2E) fine-tuning.
- Semi-Supervised Learning.
- Extended training schedules (A1, A2, A3 strategies).
- Vision-Language (ViL) zero-shot classification.
Analysis: The authors analyzed correlations between dimensions and the impact of architectural choices and training strategies on the nine metrics.

C. The QUBA Score

To address the difficulty of ranking models across multiple conflicting metrics, the authors introduced the QUBA (Quality Understanding Beyond Accuracy) score.

Mechanism: It calculates a weighted arithmetic mean of the normalized scores (z-scores) for each dimension.
Normalization: Each dimension is normalized by the mean ( $\mu$ ) and standard deviation ( $\sigma$ ) of the entire model zoo to handle different scales.
Weighting: Default weights balance robustness dimensions and shortcut-learning dimensions. Calibration error and parameter count are inverted (multiplied by -1) so that higher scores indicate better performance (lower error/fewer params).
Flexibility: Users can adjust weights to prioritize specific needs (e.g., robustness vs. accuracy).

3. Key Contributions

Comprehensive Benchmark: Introduced a unified benchmark capable of evaluating 326 models across 9 quality dimensions simultaneously without requiring model retraining (fine-tuning) for the evaluation protocols.
Large-Scale Empirical Study: Provided a "bird's-eye view" of the current state of the art, revealing trends that smaller studies missed.
QUBA Score: Proposed a novel, flexible metric for ranking models based on multi-dimensional quality, enabling tailored recommendations.
Resolution of Conflicting Findings: Used the large dataset to validate, contradict, or refine previous findings regarding relationships between quality dimensions (e.g., the relationship between accuracy and adversarial robustness).

4. Key Results & Insights

A. Impact of Training Strategies

Dataset Size: Training on larger datasets (ImageNet-21k vs. ImageNet-1k) significantly improves almost all dimensions, particularly accuracy, corruption robustness, and calibration.
Self-Supervised Learning (SSL):
- Linear Probing (LP): Generally underperforms supervised models in most dimensions (except OOD robustness and shape bias).
- End-to-End (E2E) Fine-tuning: Crucial Finding: Fine-tuning SSL-initialized models improves most quality dimensions compared to standard supervised models, including class balance and robustness. This suggests SSL pre-training creates features less biased toward specific classes.
Adversarial Training (AT): Improves adversarial robustness, OOD robustness, and shape bias but significantly degrades accuracy and class balance.
Training Duration: Longer training (A1/A2 strategies) improves adversarial robustness and calibration but harms C/OOD robustness and class balance.

B. Architectural Comparisons

Transformers vs. CNNs: Transformers consistently outperform CNNs across almost all quality dimensions (robustness, calibration, shape bias), not just accuracy.
Vision-Language (ViL) Models:
- Strengths: Exhibit exceptional OOD robustness (often near-perfect), high class balance, and strong shape bias.
- Weaknesses: Lower zero-shot accuracy compared to supervised models, poor calibration, and significantly higher parameter counts.
B-cos Models: While designed for interpretability, they negatively impact most quality dimensions (robustness, accuracy) due to restrictive inductive biases.

C. Dimensional Relationships

Positive Correlations: Accuracy is positively correlated with OOD robustness, object focus, shape bias, and parameter count.
Object Focus: Strongly correlated with almost all other quality dimensions (except calibration), suggesting it is a key indicator of a "well-behaved" model.
Calibration: Shows weak or no correlation with most other dimensions, highlighting it as an independent property requiring dedicated research.
Trade-offs: Few hard trade-offs exist; most desirable properties improve together, especially when using large datasets and SSL fine-tuning.

D. Top Performing Models

Based on the QUBA score, the top models are:

EVA02-B/14 (Transformer, IN21k, SSL-E2E): Highest overall score.
Hiera-B-Plus (Transformer, IN1k, SSL-E2E): Excels in class balance and shape bias.
ConvNeXtV2-B (CNN, IN21k, SSL-E2E): Strong in adversarial robustness and calibration.

Notable: Popular canonical models like ResNet50 and ViT-B/16 (standard supervised) rank poorly (ranks 214 and 124 respectively) in the QUBA ranking, suggesting the community should reconsider its choice of baselines.

5. Significance

Paradigm Shift: The paper argues that the field must move beyond optimizing solely for accuracy. It demonstrates that "well-behavedness" is achievable and that specific training paradigms (SSL + E2E fine-tuning on large datasets) can simultaneously improve robustness, fairness, and efficiency.
Practical Guidance: Provides concrete recommendations for practitioners. For example, if a user needs a robust model, they should look for SSL-fine-tuned Transformers or ViL models, rather than standard supervised CNNs.
Community Resource: The release of the benchmark, code, and the QUBA score allows for reproducible, multi-dimensional evaluation of future models, fostering the development of more trustworthy AI systems.
Re-evaluation of Baselines: The finding that ResNet50 and standard ViTs underperform in many quality dimensions challenges the status quo of using them as the default baselines for new research.