Uncertainty-aware data assimilation through variational inference

Imagine you are trying to track a runaway train moving through a thick fog. You can't see the whole train at once; you only get blurry, partial glimpses from a few windows, and sometimes your binoculars are shaky. You also have a map (a mathematical model) that predicts where the train should be, but the map isn't perfect because the tracks are slippery and the wind is unpredictable.

This is the problem of Data Assimilation: combining a shaky map with blurry glimpses to figure out exactly where the train is.

Most computer programs trying to solve this act like a single, overconfident detective. They look at the clues and say, "The train is definitely at mile marker 50." But they don't tell you how sure they are. If the fog is thick, they should be saying, "The train is somewhere between mile 40 and 60," but they often just give you one number and hope for the best.

This paper introduces a new way to be a detective: Uncertainty-Aware Data Assimilation. Instead of guessing one single location, the new method guesses a whole range of possibilities, complete with a confidence score.

Here is how the paper breaks it down, using simple analogies:

1. The Old Way vs. The New Way

The Old Way (Deterministic): Imagine a GPS that gives you one specific route. If you miss a turn, the GPS doesn't know you're lost; it just recalculates a new single route. It doesn't tell you, "Hey, traffic is bad here, so you might be 10 minutes late."
The New Way (Variational Inference): This new method is like a GPS that says, "Based on the traffic and the fog, there's a 90% chance you're on this road, but a 10% chance you took a wrong turn." It outputs a cloud of possibilities (a Gaussian distribution) rather than a single dot. It tells you not just where the train is, but how sure it is about that location.

2. How They Taught the Computer (The "Unsupervised" Magic)

Usually, to teach a computer to track a train, you need a "teacher" who knows exactly where the train was at every second (the "ground truth"). But in the real world, we rarely have that perfect teacher.

The authors used a clever trick called Unsupervised Learning.

The Analogy: Imagine you are trying to learn a song by listening to a radio with static. You don't have the sheet music (the ground truth), but you know the song should sound consistent. If you hum a note that doesn't fit the melody, you know you're wrong.
The Method: The computer (called CODA) looks at the blurry glimpses and the map. It tries to guess the train's position. Then, it checks: "If I move my guess forward in time using the map, does it match the next blurry glimpse I see?" If the answer is no, it adjusts its guess. It learns by trying to make its own predictions consistent with the noisy data, without ever seeing the "real" answer.

3. The "Spread" vs. The "Skill"

The paper tests how good this new detective is using two main concepts:

Skill: How close is the guess to the truth? (Did the detective find the train?)
Spread: How wide is the cloud of possibilities? (Did the detective admit they were unsure?)

A perfect detective has High Skill (found the train) and Perfect Spread (the cloud of uncertainty is exactly the right size).

If the cloud is too small, the detective is overconfident (they think they know the answer, but they are wrong).
If the cloud is too big, the detective is underconfident (they know the train is somewhere, but they are too scared to narrow it down).

The authors found that their new method creates clouds that are "well-calibrated." This means when the computer says, "I'm 95% sure the train is in this area," it is actually right 95% of the time.

4. The Super-Tool: 4D-Var

The paper doesn't stop there. They took their new, smart, uncertainty-aware detective and plugged it into a massive, old-school supercomputer method called 4D-Var.

The Analogy: Think of 4D-Var as a giant, slow-motion movie editor. It tries to reconstruct the entire movie of the train's journey by looking at a huge chunk of footage at once. It's very accurate but takes a long time to compute.
The Innovation: Usually, this movie editor starts with a blank slate or a very rough guess. The authors used their new CODA model to give the editor a smart starting point.
- Instead of saying, "Start guessing from zero," they said, "Start with our smart cloud of possibilities."
- They even used the "uncertainty" part of the cloud to tell the editor, "Be very careful here (high uncertainty), but you can be bold there (low uncertainty)."

The Result: By feeding the "smart guess" into the "slow, powerful editor," they got the best of both worlds. They could reconstruct the train's path over very long periods with much higher accuracy than before, especially when the data was very sparse or noisy.

Summary

This paper is about teaching computers to admit what they don't know.

They built a neural network that doesn't just guess a number, but guesses a range of numbers with confidence levels.
They taught it using only noisy data, without needing a "perfect answer key."
They proved that when you use this "uncertainty-aware" guess to help a powerful, slow computer system, the whole system becomes much better at tracking chaotic, unpredictable things (like weather or ocean currents).

In short: Don't just give me the answer; tell me how sure you are, and I'll trust you more.

1. Problem Statement

Data assimilation (DA) aims to estimate the state of a dynamical system ( $x_t$ ) by combining a physical model with noisy, incomplete observations ( $y_t$ ). This is an inverse problem where the goal is to infer the posterior probability distribution $p(x_{1:T} | y_{1:T})$ .

Key Challenges:

Uncertainty Quantification: Most existing machine learning (ML) approaches for DA are deterministic, providing only a point estimate (Maximum A Posteriori) rather than a full probability distribution. This lacks the ability to quantify prediction confidence.
Training Data Requirements: Supervised ML methods require ground truth states ( $x_{1:T}$ ) for training, which are often unavailable in real-world scenarios.
Computational Efficiency vs. Accuracy: Classical variational methods (like 4D-Var) are accurate but computationally expensive, while fast ML models often fail to utilize long observation windows effectively.

2. Methodology

The authors propose a Variational Inference (VI) extension to an existing unsupervised ML framework called CODA (Combined Optimization of Dynamics and Assimilation).

A. Stochastic CODA Model

Instead of the original deterministic CODA which outputs a single state estimate $\hat{x}_t$ , the proposed model outputs the parameters of a diagonal Gaussian distribution:
$G_\theta(y_{t-w:t+w}) = (\mu_t, \sigma_t)$
where the variational posterior is $q_t(\hat{x}_t) = \mathcal{N}(\mu_t, \Sigma_t)$ with $\Sigma_t = \text{diag}(\sigma_t^2)$ .

B. Loss Function Adaptation

The original CODA loss (Equation 3) minimizes observation error and enforces self-consistency. The authors adapt this for probabilistic outputs:

Observation Term: The mean squared error is treated as a negative log-likelihood, averaged over samples from the predicted distribution $q_t$ .
Consistency Term: Instead of comparing point estimates, the model compares the distribution propagated forward by the dynamics ( $q_{t \to t+h}$ $q_{t \to t + h}$ ) with the future variational posterior ( $q_{t+h}$ $q_{t + h}$ ).
- Ideally, this minimizes the Kullback-Leibler (KL) divergence $D_{KL}(q_{t \to t+h} \parallel q_{t+h})$ .
- Since the density of the propagated distribution is intractable, the authors approximate this by adding the entropy of the propagated distribution as a regularization term.

The final loss function (Equation 6) is:
$L(\theta) = \mathbb{E}_{t, \hat{x}_t \sim q_t} \left[ \sum_{i=0}^h ||y_{t+i} - H_{t+i} \circ M^{(i)}(\hat{x}_t)||^2 - \lambda \log q_{t+h}(M^{(h)}(\hat{x}_t)) \right]$

Hyperparameter $\lambda$ : Critical for calibration. If $\lambda=0$ , the model collapses to a deterministic one (variance $\to$ 0). Increasing $\lambda$ encourages higher entropy (uncertainty).

C. Integration with 4D-Var

The trained stochastic CODA model is used as a prior in a classical weak-constraint 4D-Var scheme.

Background Prior: The mean ( $\mu_0$ ) and variance ( $\Sigma_0$ ) from CODA initialize the 4D-Var cost function.
Foreground Prior: A novel addition where the CODA prediction at the end of the window ( $\mu_T, \Sigma_T$ ) is also used as a constraint.
Cost Function: Combines observation misfit, model error, and the Gaussian priors derived from CODA.

3. Experimental Setup

Dataset: Lorenz-96 system (40 variables, chaotic dynamics).
Observations: 75% of variables masked (randomly) at each step; remaining variables have Gaussian noise.
Training Data: Three dataset sizes (Small: $10^4$ , Medium: $3 \times 10^5$ , Big: $3 \times 10^6$ time steps).
Baselines:
1. Variational CODA: The proposed method.
2. Dropout: Deterministic CODA with dropout layers during inference (mimicking Bayesian NNs).
3. Ensembling: Average of 5 independent Dropout models.

4. Key Results

A. Uncertainty Calibration (Standalone Performance)

Metric: Continuous Ranked Probability Score (CRPS), Spread-Skill Ratio (SSRAT), and Spread-Skill Reliability (SSREL).
Findings:
- The Variational CODA method achieves near-perfect calibration (SSRAT $\approx$ 1.0) on medium and large datasets, significantly outperforming Dropout and Ensembling in terms of reliability (SSREL).
- Data Dependency: Variational CODA requires sufficient data to learn the uncertainty structure. On the smallest dataset, Dropout performed better (likely due to dropout's inherent regularization preventing overfitting).
- Hyperparameter Sensitivity: The $\lambda$ parameter directly controls the spread. $\lambda=0$ leads to zero variance; optimal $\lambda$ yields well-calibrated uncertainty.

B. Integration with 4D-Var

Setup: Using the pre-trained CODA model to initialize and constrain a 4D-Var optimization over long windows ( $T$ up to $10^5$ time steps).
Findings:
- Initialization: Initializing 4D-Var with CODA predictions ("CODA init") significantly reduces Mean Squared Error (MSE) compared to naive initialization ("Nearest init"), with the gap widening as the window length increases.
- Uncertainty Utilization: Adding the background prior (using CODA's variance) further improves performance, especially for shorter windows.
- Foreground Prior: Adding a "foreground prior" (constraint at the end of the window) provides marginal but consistent improvements.
- Conclusion: The stochastic CODA model acts as a highly effective, fast surrogate that provides both a good initial guess and a calibrated uncertainty estimate, allowing the expensive 4D-Var solver to converge faster and more accurately on long windows.

5. Key Contributions

Unsupervised Uncertainty Quantification: Demonstrated that unsupervised training of neural networks can produce well-calibrated probabilistic forecasts without access to ground truth states.
Variational Extension of CODA: Proposed a specific loss function modification (incorporating entropy and KL-divergence approximation) to transform a deterministic dynamics model into a probabilistic one.
Hybrid DA Pipeline: Showed that a fast, pre-trained ML model can serve as a superior prior for classical variational data assimilation (4D-Var), bridging the gap between speed and accuracy.
Novel "Foreground Prior": Introduced the concept of using the model's prediction at the end of the assimilation window as a constraint, improving reconstruction in long windows.

6. Significance and Limitations

Significance: This work addresses a critical gap in geoscience ML: moving from point estimates to reliable uncertainty quantification without supervised labels. It offers a practical pathway to integrate ML into operational DA systems by using ML to guide classical physics-based solvers.
Limitations:
- Simplified Dynamics: Tested only on Lorenz-96 (a toy model). Real-world systems involve non-Gaussian errors, complex observation operators, and heterogeneous data.
- Diagonal Covariance: The model assumes a diagonal covariance matrix, ignoring spatial correlations between variables, which is a simplification for computational tractability.
- Perfect Model Assumption: The experiments assumed the dynamical model $M$ was perfectly known, whereas real-world DA often deals with model error.

Future Work: The authors suggest extending this to systems with unknown dynamics, handling non-diagonal covariances, and applying the framework to operational, large-scale geophysical systems.