A Visualization for Comparative Analysis of Regression Models

Imagine you are a chef trying to decide which of two new recipes for a chocolate cake is better.

Currently, most people judge recipes by a single number: the "Average Taste Score." If Recipe A has a score of 8.5 and Recipe B has a score of 8.4, everyone assumes Recipe A is the winner.

But here's the problem: Average scores lie.

Recipe A might taste perfect 99 times, but the one time it fails, it tastes like burnt rubber (a huge disaster).
Recipe B might be a little bland every single time, but it never burns.

If you are baking for a wedding, you might prefer the consistent, slightly bland cake (Recipe B) over the risky one (Recipe A). But if you only look at the "Average Score," you miss this crucial difference.

This is exactly what the paper "A Methodology for Graphical Comparison of Regression Models" is about. It argues that in machine learning (where computers predict numbers, like stock prices or machine failure dates), we rely too much on single-number scores like MAE (Mean Absolute Error) or RMSE (Root Mean Square Error). These scores are like the "Average Taste Score"—they hide the messy details of how a model makes mistakes.

Here is the paper's solution, broken down into simple steps:

1. The Problem: The "Blindfolded" Judge

The authors explain that standard metrics are like judging a marathon runner while blindfolded. You know the final time (the score), but you don't know:

Did they sprint at the start and collapse at the end?
Did they run perfectly straight, or did they zigzag wildly?
Did they run fast but trip over a rock once?

In the paper's examples, two models can have almost identical scores, but one might consistently guess too high (overestimating), while the other guesses too low (underestimating). In real life, this matters!

Medical Diagnosis: It's better to slightly overestimate a tumor size (to be safe) than to underestimate it.
Stock Trading: It might be better to be consistently slightly wrong than to occasionally be wildly wrong.

2. The Solution: A Two-Step Visual Inspection

Instead of just looking at a number, the authors propose a visual "health check" for models.

Step 1: The "Box Plot" (The Quick Scan)

First, they use a simple chart called a Box Plot. Imagine looking at a row of boxes, where each box represents a model's errors.

A short, narrow box means the model is consistent (it makes small, similar mistakes).
A tall, wide box means the model is all over the place.
Dots sticking out (outliers) show where the model made a huge, embarrassing mistake.

This helps you quickly filter out the "bad" models and pick the top contenders.

Step 2: The "2D Error Space" (The Deep Dive)

Once you have two top contenders (let's call them Model A and Model B), you don't just compare their scores. You plot them against each other on a special map called the 2D Error Space.

Imagine a graph where:

The X-axis is how wrong Model A was.
The Y-axis is how wrong Model B was.
Every dot on the map represents one specific prediction (e.g., "The prediction for Machine #42").

The Magic of the Map:

The Diagonal Line: If a dot is on the diagonal line, both models made the exact same mistake.
The Zones: The map is split into zones. If a dot is in the "Green Zone," Model B was better. If it's in the "Orange Zone," Model A was better.
The Heat Map (The Colormap): Instead of just dots, they color the map based on density.
- Hot colors (Red/Orange): Where most predictions cluster (the "safe zone").
- Cool colors (Blue): Where the rare, weird mistakes happen.

This allows you to see patterns. Maybe Model A is great for small numbers but terrible for big numbers. Maybe Model B is consistent but always guesses too high. You can see these patterns immediately, whereas a single number would hide them.

3. The Secret Weapon: The "Rubber Band" (Mahalanobis Distance)

The authors introduce a fancy math trick called Mahalanobis Distance.

Think of it this way: If you are measuring how far a point is from the center of a group, a normal ruler (Euclidean distance) treats all directions the same. But what if the group of points is shaped like a long, stretched-out oval (like a rubber band)?

Normal Ruler: Might think a point is "far away" because it's far in one direction.
Rubber Band (Mahalanobis): Understands the shape of the oval. It knows that being far along the "long" part of the oval is actually normal, but being far off the "short" side is a huge anomaly.

This helps the computer spot the real weird outliers that other methods miss, especially when the errors are correlated (when one model messes up, the other tends to mess up too).

4. The Real-World Test: The "Machine Breakdown"

To prove their method works, they tested it on a dataset about predicting when industrial machines will break down (Remaining Useful Life).

The Scenario: If you guess a machine will last 10 more days but it breaks in 1, that's dangerous (you didn't fix it in time). If you guess it will last 1 day but it lasts 10, that's just annoying (you fixed it too early).
The Result: Standard scores said Model 1 was slightly better. But the 2D Map showed that Model 1 was "conservative" (always guessing the machine would break sooner than it actually did), while Model 2 was "optimistic" (guessing it would last longer).
The Decision: Because a machine breaking unexpectedly is dangerous, the visual map proved that Model 1 was the safer, better choice, even though the difference in scores was tiny.

The Takeaway

The paper is a call to stop relying on "Average Scores" alone. Just like you wouldn't judge a movie only by its Rotten Tomatoes score, you shouldn't judge a machine learning model only by its error number.

By using these visual maps, data scientists can:

See the shape of the mistakes.
Spot dangerous outliers that numbers hide.
Choose the model that fits the specific risks of their real-world problem.

It turns the boring task of "comparing numbers" into an exciting detective game of "finding patterns."

1. Problem Statement

The paper addresses the limitations of traditional scalar metrics (e.g., MAE, RMSE, $R^2$ ) used to evaluate and compare regression models. While these metrics are effective for filtering out poor-performing models, they suffer from information aggregation, which obscures critical behavioral differences between competitive models. Specifically, standard metrics fail to capture:

Error Distribution Nuances: They cannot distinguish between models with similar aggregate scores but different error patterns (e.g., one model having many small errors vs. another having few large outliers).
Directionality: They mask whether a model consistently overestimates or underestimates, as they rely on absolute or squared errors.
Instance-Level Differences: They hide cases where models perform differently on specific subsets of data despite similar global scores.
Contextual Risks: In critical applications (e.g., medical diagnosis, autonomous driving), the nature of the error (extreme outliers vs. stable moderate errors) is often more important than the average error, a distinction scalar metrics cannot make.

2. Methodology

The authors propose a two-step visual analysis framework designed to move beyond scalar summaries and provide a comprehensive view of model performance.

Step 1: 1D Visualization (Model Selection)

The first step involves filtering a pool of models to identify the most promising candidates using one-dimensional tools:

Boxplots: Used to visualize the spread, median, and outliers of prediction errors for each model. This helps identify consistency and the presence of extreme errors.
Scatter Plots (Predicted vs. Real): Generated individually for each model. A color scale is applied where warm colors indicate accurate predictions and cool colors indicate large errors. This reveals performance trends across the range of values (e.g., accuracy on low vs. high values).

Step 2: 2D Error Space (Model Comparison)

Once top candidates are selected, the authors introduce a novel 2D Error Space to compare two models directly.

Coordinate System: The plot maps the error of Model A ( $x$ -axis) against the error of Model B ( $y$ -axis) for every data point.
Comparison Zones: The space is divided by two diagonals:
- $y = x$ : Points where both models have equal absolute errors.
- $y = -x$ : Points where one model overestimates as much as the other underestimates.
- These diagonals create "hourglass" regions indicating which model performs better for specific data points.
Density Visualization (Colormap): To handle overplotting and visualize distribution density, the authors propose a specific coloring strategy based on proximity to the median of the error distribution.
- Points closer to the median are colored "warm" (high density/core distribution).
- Points further away are colored "cool" (outliers/extreme deviations).
- A white boundary marks the percentile where the number of points inside equals the number outside, visualizing the core distribution.
Distance Metric (Mahalanobis Distance): Instead of standard Euclidean distance, the methodology employs the Mahalanobis distance to calculate proximity to the median.
- Rationale: Euclidean distance treats variables as independent and ignores scale differences. Mahalanobis distance accounts for the correlation between the error axes and differences in scale, providing a more accurate representation of the data's true structure and identifying outliers relative to the distribution's shape.

3. Key Contributions

Novel Visualization Framework: A structured two-step approach (1D filtering $\to$ 2D comparison) that complements traditional metrics.
2D Error Space Concept: A specific plot type that visualizes paired errors, allowing for the direct identification of which model is superior for specific instances and the nature of their disagreement.
Percentile-Based Colormap: A density visualization technique that uses proximity to the median rather than raw counts (like hexbins) or probability density (like KDE), making it easier to identify outliers and the "core" of the error distribution.
Integration of Mahalanobis Distance: The application of Mahalanobis distance in the 2D error space to correctly handle correlations and scale differences between model errors, which Euclidean distance fails to capture.
Asymmetric Cost Awareness: The methodology explicitly supports scenarios where overestimation and underestimation have different costs (e.g., safety-critical systems).

4. Results and Case Study

The methodology was validated using three datasets, with a detailed case study on the AI4I 2020 Predictive Maintenance dataset (estimating Remaining Useful Life - RUL).

Experimental Setup: Two neural networks with identical architecture and data but different loss functions were compared:
- Model E1: High penalty for overestimation (conservative, underestimates RUL).
- Model E2: Lower penalty for overestimation (more optimistic).
Metric Analysis: Standard metrics (MAE, RMSE, $R^2$ ) suggested Model E1 was slightly better but the difference was marginal and ambiguous regarding the type of error.
Visual Analysis (2D Error Space):
- The plot revealed a strong correlation between the models' errors (they failed on the same instances).
- Crucially, the cloud of points was shifted above the $y=x$ line, indicating that Model E2's errors were systematically larger than Model E1's.
- The visualization confirmed that Model E1 was the correct choice for minimizing unexpected failures (overestimation), a conclusion that was difficult to derive solely from the scalar metrics.

5. Significance

This paper provides a critical tool for data scientists and practitioners who need to make nuanced decisions between regression models.

Beyond Aggregation: It demonstrates that "better" metrics do not always mean "better" models in a specific operational context.
Interpretability: It transforms abstract error numbers into visual patterns, revealing biases (over/under-estimation) and structural correlations between models.
Safety and Strategy: The approach is particularly significant for high-stakes domains (healthcare, finance, autonomous systems) where understanding the distribution and direction of errors is vital for risk management.
Future Work: The authors suggest extending this toolkit to visualize how model errors evolve across different operational conditions or domains, further enhancing model monitoring in dynamic environments.