Learning with the Nash-Sutcliffe loss

Imagine you are a coach training a team of 100 different runners. Some run on flat tracks, some on hills, some in the rain, and some in the sun. Your goal is to pick the best training method to help them all run faster.

For decades, coaches have used a specific stopwatch metric called NSE (Nash-Sutcliffe Efficiency) to judge how good a runner is compared to just guessing their average speed. It's a popular metric because it's fair: it doesn't matter if a runner is naturally fast or slow; it only cares if they improved relative to their own baseline.

However, this paper reveals a hidden trap. The authors, Hristos Tyralis and Georgia Papacharalampous, discovered that while everyone has been using this stopwatch to judge the runners, they have been using a completely different set of rules to train them.

Here is the breakdown of their discovery in simple terms:

1. The "Wrong Map" Problem

Imagine you want to drive to a specific destination (the best prediction).

The Old Way: You use a GPS that tells you to minimize "total distance traveled" (this is the standard Mean Squared Error or MSE). This gets you to the geometric center of all possible paths.
The Judge's Rule: But the judge (the NSE metric) doesn't care about total distance. The judge cares about a "weighted score" that penalizes you differently depending on how bumpy the road was.

The paper argues that for a long time, scientists have been training their models using the "distance" GPS (MSE) but then judging them with the "bumpy road" score (NSE).

The Result: The models are driving toward the wrong destination. They are optimized for the wrong goal.

2. The "Weighted Average" Secret

The authors prove that the NSE metric isn't just looking for the "average" runner. It is actually looking for a Data-Weighted Average.

The Analogy:
Imagine you are trying to guess the average temperature of a city.

Standard Average (MSE): You take every single day's temperature, add them up, and divide by the number of days. A day that is 100°F counts the same as a day that is 70°F.
Nash-Sutcliffe Average: This method says, "Wait! If a day has very little variation (it's always 70°F), it's easy to predict, so let's trust it more. But if a day is chaotic (swinging between 50°F and 90°F), it's hard to predict, so let's weigh it differently."

The NSE metric essentially says: "I care more about the days that are stable and less about the days that are chaotic."

The paper proves that if you want to win the NSE game, you must train your model to target this specific "Weighted Average," not the simple average.

3. The New Solution: "Nash-Sutcliffe Regression"

The authors introduce a new training method called Nash-Sutcliffe Linear Regression.

Old Training (OLS): "Hey model, try to be as close to the middle as possible for everyone."
New Training (Nash-Sutcliffe): "Hey model, I'm going to give you a special pair of glasses. Through these glasses, some days look bigger and more important than others. Train yourself to hit that specific target."

They show mathematically that if you use this new training method, your model will perform significantly better when judged by the NSE metric. In fact, in their tests with real river flow and temperature data, the new method improved scores by huge margins (sometimes cutting the error in half) compared to the old way.

4. The "Apples and Oranges" Warning

The paper also gives a very important warning about how we compare different things.

The Analogy:
Imagine you are comparing the performance of a Formula 1 car and a bicycle.

If you use a standard ruler (MSE), you might say the car is "better" because it covers more distance.
If you use a "relative speed" metric (NSE), you might say the bicycle is "better" because it's doing amazing things relative to a human walking.

The authors say: You cannot mix these comparisons.
If you have 100 rivers, and 50 are small mountain streams (fast, chaotic) and 50 are huge slow rivers (stable), you cannot simply average their NSE scores to say "Our model is 80% good."

The NSE score only makes sense if all the rivers are behaving like the same type of river.
If you mix different types of rivers, the "Weighted Average" breaks, and the score becomes meaningless.

The Big Takeaway

This paper is a wake-up call for scientists, data analysts, and machine learning engineers.

Stop Mismatching: If you plan to judge your model with the NSE metric, you must train it with the Nash-Sutcliffe loss function. Using the standard "average" training method is like trying to win a chess tournament by playing checkers.
Respect the Data: Don't just throw all your time series into one big bucket. Make sure the things you are comparing actually belong to the same "family" of data.
The New Tool: They have provided a new mathematical tool (Nash-Sutcliffe Regression) that is easy to use and guarantees that your model is aiming for the right target.

In short: To win the game, you have to play by the rules of the game, not the rules of a different sport.

1. Problem Statement

The Nash-Sutcliffe Efficiency (NSE) is a widely used metric in geosciences (particularly hydrology) for evaluating forecast performance across multiple time series. It is defined as a skill score relative to a naive benchmark (predicting the mean):
$NSE = 1 - \frac{MSE}{MSE_{benchmark}}$
Despite its popularity, the paper identifies a critical theoretical gap: NSE lacks a decision-theoretic foundation for model estimation.

Misalignment: Practitioners typically train models by minimizing the Mean Squared Error (MSE) (which targets the conditional mean) but evaluate them using NSE.
Theoretical Flaw: The paper argues that NSE is not strictly consistent for the conditional mean. Therefore, minimizing MSE does not necessarily optimize NSE, leading to suboptimal forecasts when NSE is the primary evaluation metric.
Multi-Series Complexity: When averaging NSE across multiple time series ( $\overline{NSE}$ ), the implicit assumption is that all series originate from a single stochastic process. The paper seeks to characterize the specific functional that $\overline{NSE}$ actually targets and provide a consistent estimation framework.

2. Methodology

The authors employ the theory of strictly consistent loss functions and elicitability (Gneiting, 2011) to reframe the problem.

A. The Nash-Sutcliffe Loss ( $L_{NS}$ )

The authors define a negatively oriented loss function, the Nash-Sutcliffe loss, as:
$L_{NS}(\mathbf{z}_d, \mathbf{y}_d) = 1 - NSE(\mathbf{z}_d, \mathbf{y}_d) = \frac{\|\mathbf{z}_d - \mathbf{y}_d\|_2^2}{\|\mu(\mathbf{y}_d)\mathbf{1}_d - \mathbf{y}_d\|_2^2}$
where $\mathbf{y}_d$ is a $d$ -dimensional realization (a single time series of length $d$ ), $\mathbf{z}_d$ is the prediction, and the denominator represents the variance of the observation relative to its mean.

B. The Nash-Sutcliffe Functional

Using theorems on weighted loss functions, the authors prove that minimizing the expected $L_{NS}$ does not yield the conditional mean. Instead, it elicits a new functional, the Nash-Sutcliffe functional ( $T^{(w)}$ ):
$T^{(w)}(F) = \frac{\mathbb{E}_F[\mathbf{y}_d w(\mathbf{y}_d)]}{\mathbb{E}_F[w(\mathbf{y}_d)]}$
where $w(\mathbf{y}_d) = 1 / \|\mu(\mathbf{y}_d)\mathbf{1}_d - \mathbf{y}_d\|_2^2$ .

Interpretation: This functional is a data-weighted component-wise mean. It re-weights observations based on their internal variability (variance). Series with lower internal variability receive higher weights in the estimation.
Elicitability & Identifiability: The paper proves that $L_{NS}$ is strictly consistent for this functional and that the functional is identifiable via a specific identification function.

C. Nash-Sutcliffe Linear Regression

To align estimation with evaluation, the authors propose Nash-Sutcliffe Linear Regression.

Formulation: Instead of Ordinary Least Squares (OLS), which minimizes the Euclidean norm, this method minimizes the realized Nash-Sutcliffe loss.
Solution: The closed-form solution is a Weighted Least Squares (WLS) estimator where the weights are determined by the inverse variance of each time series relative to its own mean.
$\hat{\Theta} = \mathbf{Y}\mathbf{W}\mathbf{X}(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}$
Here, $\mathbf{W}$ is a diagonal matrix where each entry corresponds to the weight $w(\mathbf{y}_j)$ for the $j$ -th time series.

D. Data Orientations

The paper distinguishes between two data matrix orientations:

$d \times n$ (Spatial/Ensemble): $d$ time steps, $n$ series. Assumes all series are realizations of the same stochastic process.
$n \times d$ (Temporal/Forecasting): $n$ time steps, $d$ series (variables). Assumes the vector of variables at each time step is a realization of a $d$ -dimensional random vector.
The authors derive specific regression formulations for both, noting that the standard realized NSE (averaging over series) corresponds to the $d \times n$ view, while the forecasting view requires careful handling of the weight function.

3. Key Contributions

Theoretical Foundation: Established that NSE is strictly consistent for the Nash-Sutcliffe functional (a weighted mean), not the standard conditional mean. This explains why NSE and MSE often produce different model rankings.
New Estimator: Introduced Nash-Sutcliffe Linear Regression, a data-weighted least squares approach that aligns the training objective with the NSE evaluation metric.
Resolution of Inconsistency: Demonstrated that training with MSE and evaluating with NSE is theoretically inconsistent unless the data follows specific conditions (e.g., independent Gaussian variables where the weighted mean equals the arithmetic mean).
Handling Numerical Instability: Proposed an Extended Nash-Sutcliffe Loss (adding a small constant $\epsilon$ to the denominator) to handle cases where time series have zero variance, proving it elicits a transformed functional.
Practical Guidelines: Provided a framework for when and how to use NSE, emphasizing that averaging NSE across disparate datasets (e.g., mixing daily and monthly flows) is statistically invalid unless they share the same underlying stochastic process.

4. Results

The authors validated their theory through simulations and real-world applications:

Simulation 1 (Functional Divergence): Under non-Gaussian distributions (e.g., log-normal), the Nash-Sutcliffe functional diverged significantly from the component-wise mean. The Nash-Sutcliffe climatology (weighted mean) achieved a lower realized NSE loss than the standard mean, confirming the theoretical target.
Simulation 2 & 3 (Regression Performance):
- In scenarios with non-Gaussian errors, Nash-Sutcliffe Regression significantly outperformed Multi-dimensional Linear Regression (OLS) when evaluated by NSE (reducing loss by orders of magnitude in some cases).
- Conversely, OLS performed better when evaluated by Euclidean norm (MSE), confirming that each method targets its respective functional.
Real-World Application (Hydrology & Meteorology):
- Applied to streamflow and temperature forecasting for French river basins.
- Streamflow: Nash-Sutcliffe Regression reduced the realized NSE loss by ~46% compared to Multi-dimensional Linear Regression and ~68% compared to One-dimensional Linear Regression.
- Temperature: While gains were smaller (due to temperature data being more Gaussian), Nash-Sutcliffe Regression still outperformed OLS under the NSE metric.

5. Significance and Implications

Paradigm Shift in Hydrology/Geosciences: The paper challenges the standard practice of using MSE for training and NSE for testing. It argues that to maximize NSE, one must train using the Nash-Sutcliffe loss.
Global vs. Local Models: The findings clarify the benefits of global models (fitting all series simultaneously) over local models. A global Nash-Sutcliffe regression can leverage cross-series information to estimate the weighted functional more robustly than fitting separate local models.
Decision-Theoretic Rigor: By grounding NSE in elicitation theory, the paper provides a rigorous justification for why NSE behaves differently from MSE, particularly regarding how it penalizes errors in low-variance vs. high-variance series.
Practical Utility: The proposed regression method is computationally efficient (closed-form solution) and can be easily implemented in existing statistical software (e.g., R, Python) as a weighted least squares problem.

In conclusion, this paper provides the missing theoretical link for the Nash-Sutcliffe Efficiency, transforming it from a heuristic skill score into a rigorously defined objective function for model estimation, with immediate applications in improving multi-series forecasting in environmental sciences.

Learning with the Nash-Sutcliffe loss

1. The "Wrong Map" Problem

2. The "Weighted Average" Secret

3. The New Solution: "Nash-Sutcliffe Regression"

4. The "Apples and Oranges" Warning

The Big Takeaway

1. Problem Statement

2. Methodology

A. The Nash-Sutcliffe Loss (LNSL_{NS}LNS​)

B. The Nash-Sutcliffe Functional

C. Nash-Sutcliffe Linear Regression

D. Data Orientations

3. Key Contributions

4. Results

5. Significance and Implications

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields

A. The Nash-Sutcliffe Loss ( $L_{NS}$ )