Using the SEKF to Transfer NN Models of Dynamical Systems with Limited Data

The Big Problem: Learning a New Skill with Almost No Practice

Imagine you are a master chef who has spent 10 years perfecting a recipe for a specific type of cake (let's call it the "Source Cake"). You have baked thousands of them, so you know exactly how much sugar, flour, and heat is needed.

Now, imagine you need to bake a new cake for a different kitchen (the "Target System"). This new kitchen has slightly different ovens, and the flour is a tiny bit different. The problem? You only have one egg and a handful of flour left to practice with.

If you try to learn this new recipe from scratch with only one egg, you will likely fail. You won't have enough data to figure out the right ratios. This is the problem scientists face with Dynamical Systems (like chemical reactors, car engines, or weather patterns). They often have plenty of data for one machine, but very little data for a new, similar machine because collecting data is expensive, dangerous, or time-consuming.

The Solution: "Fine-Tuning" Instead of Starting Over

The authors propose a clever solution called Transfer Learning. Instead of throwing away your 10 years of experience and trying to learn the new cake from zero, you take your "Master Chef" brain and make tiny adjustments to fit the new kitchen.

In the world of Artificial Intelligence (AI), this is called Fine-Tuning. You take a pre-trained AI model (the Master Chef) and tweak its internal settings (the "weights" or "parameters") just enough to handle the new, slightly different system.

The Secret Weapon: The Subset Extended Kalman Filter (SEKF)

The paper introduces a special tool called the Subset Extended Kalman Filter (SEKF) to do this tweaking. To understand it, let's use a metaphor:

The GPS vs. The Compass

Standard AI Training (Gradient Descent): Imagine you are trying to find a hidden treasure. You have a map (the data), but it's very foggy. You take a step, check the map, take another step, and check again. If the map is blurry (limited data), you might wander off a cliff (overfitting) because you trust the blurry map too much.
The SEKF Approach: Imagine you have a Compass that points to where you already know the treasure is (the pre-trained model). The SEKF is like a smart navigator that says: "You are already very close to the right spot. Trust your compass (the old model) heavily, but if you see a tiny clue in the fog (the new data), adjust your path just a little bit."

The SEKF is special because it doesn't just guess; it calculates uncertainty. It knows, "I am 99% sure the old settings are right, so I will only change them if the new data is very convincing." This prevents the AI from "forgetting" what it already knows.

The "Subset" Part: Only Fixing What's Broken

The word "Subset" in the name is important. The SEKF is smart enough to know it doesn't need to re-calculate every single number in the AI's brain. It picks a subset of the most important numbers to tweak at any given moment. This makes the process faster and less prone to errors, like a mechanic who only tightens the specific bolts that are loose, rather than taking the whole engine apart.

What Did They Find? (The Results)

The researchers tested this on two things:

A Damped Spring (a mathematical model of a bouncing spring).
A Temperature Control Lab (a real physical device with heaters and sensors).

Here are their four main discoveries, translated into everyday terms:

1. You need very little new data.
They found that by using this "Fine-Tuning" method, they could get the new system working perfectly with only 1% of the data usually required. It's like learning to drive a new car model just by driving it for 10 minutes, instead of needing 1,000 hours of practice.

2. Don't freeze the layers (The Surprise).
In computer vision (like teaching AI to recognize cats), experts usually say: "Freeze the early layers (the eyes that see shapes) and only change the last layers (the brain that names the animal)."
This paper says: That doesn't work for physics!
When adapting to a new machine, the AI needs to make tiny adjustments across its entire brain, from the "eyes" to the "brain." It's not just the final decision that changes; the whole system needs to shift slightly to accommodate the new physics.

3. It prevents "Overfitting" (The Memory Trap).
If you try to learn a new skill with very little data using standard methods, you tend to memorize the few examples you have instead of learning the general rule. This is called Overfitting.
The SEKF method acts like a strict teacher who says, "Don't memorize that one example; stick close to the general rules you already know." This results in a model that works better on new situations it hasn't seen before.

4. The "How" matters less than the "Where you start."
They tried three different ways to do the tweaking (Adam, L-BFGS, and SEKF). They found that as long as you start with the pre-trained model (Fine-Tuning), it doesn't matter which tool you use to finish the job. They all ended up with a good model. However, SEKF was the best at handling the "uncertainty" of the new data.

The Bottom Line

This paper proves that if you have a smart AI model for one machine, you can easily adapt it to a similar machine even if you have almost no data for the new one.

Instead of building a new AI from scratch (which is expensive and data-hungry), you should take your existing AI, treat it as a "Bayesian Prior" (a strong starting guess), and use the Subset Extended Kalman Filter to gently nudge it toward the new reality. It's the difference between trying to learn a new language from a blank notebook versus taking a fluent speaker and teaching them just a few new slang words.

1. Problem Statement

Data-driven models, particularly Artificial Neural Networks (ANNs), are powerful tools for approximating complex dynamical systems. However, their widespread deployment in industrial settings is hindered by three fundamental limitations:

Data Scarcity: Collecting sufficient training data is often infeasible due to high costs, safety constraints, or time limitations.
Poor Generalization: ANNs often fail to generalize when operating conditions shift from the training distribution or when statistical properties change.
Lack of Transfer Learning Frameworks: While transfer learning is standard in computer vision (e.g., freezing early layers), it is unclear how to apply it to dynamical systems. Unlike images, dynamical systems lack clear hierarchical feature representations, making it difficult to determine which parameters to adapt. Furthermore, existing gradient-based fine-tuning methods lack probabilistic frameworks to prevent overfitting when target data is severely limited.

The core research questions address whether pre-trained models can be adapted to similar systems with minimal data, how small parameter perturbations affect dynamics, and what principled strategies exist to mitigate overfitting in this context.

2. Methodology

The authors propose a Transfer Learning Framework based on the Subset Extended Kalman Filter (SEKF).

Core Hypothesis

The framework treats transfer learning as a Bayesian inference problem. It assumes that if a source system ( $S$ ) and a target system ( $T$ ) are functionally similar, the source model parameters ( $\pi_S$ ) define a Gaussian prior distribution over the target parameters:
$p(\pi) = \mathcal{N}(\pi_S, P_0)$
where $P_0$ encodes the uncertainty of transferability.

The SEKF Approach

Instead of standard gradient descent, the method uses the SEKF to sequentially update parameters as new target observations become available.

State Estimation: Neural network parameters are treated as hidden states to be estimated from noisy observations.
Probabilistic Formulation:
- Process Noise ( $Q$ ): Controls the flexibility of the prior. Small values constrain parameters near the source model; larger values allow more adaptation.
- Measurement Noise ( $R$ ): Weights the reliability of new observations.
- Subset Selection: To address the computational intractability of full EKF (which scales as $O(n_\pi^3)$ for $n_\pi$ parameters), the SEKF updates only a subset of parameters at each step. This subset is selected based on the highest expected impact on prediction uncertainty.
Regularization: The Bayesian nature of the SEKF provides implicit regularization. When target data is scarce, the posterior remains close to the informative prior (the source model), naturally preventing overfitting.

Experimental Setup

The framework was validated on two benchmark systems:

Damped Spring-Mass System: A simulated second-order ODE system where the target system had a 10% variation in the damping coefficient compared to the source.
Temperature Control Lab (TCLab): A physical experimental system with two heaters and sensors. The transfer scenario involved moving from a simulated source model to a real-world physical target (Sim-to-Real).

The study compared Finetuning (starting from source weights) vs. Retraining (random initialization) using three optimizers: Adam, L-BFGS, and SEKF.

3. Key Contributions

Probabilistic Transfer Learning for Dynamics: The paper establishes a principled Bayesian framework for adapting dynamical system models, moving beyond heuristic layer-freezing strategies used in computer vision.
Data Efficiency: Demonstrated that high-fidelity target models can be achieved with as little as 1% of the original training data required for training from scratch.
Overfitting Mitigation: Showed that the SEKF and finetuning approaches significantly reduce the "Train-Test Gap" (overfitting) compared to random initialization, particularly in data-scarce regimes.
Insight into Parameter Adaptation: Challenged the computer vision paradigm. The study found that for dynamical systems, effective transfer requires small but distributed parameter changes across all network layers, rather than concentrating changes only in the final layers.

4. Key Results

Performance with Limited Data: Finetuning consistently outperformed retraining. In the damped spring system, finetuning achieved accuracy comparable to the source model using only 10 samples (1% of data), whereas retraining failed to converge to a useful solution with the same data.
Cosine Similarity: Finetuned models maintained a cosine similarity of >0.99 with the source parameters, confirming that successful adaptation occurs within a small neighborhood of the source parameter space.
Overfitting Reduction: Finetuning resulted in significantly smaller Train-Test gaps than retraining, proving that starting from a well-generalized source model acts as a strong regularizer.
Optimizer Comparison:
- Generalization: All three optimizers (Adam, L-BFGS, SEKF) achieved statistically indistinguishable generalization performance when finetuning. The choice of optimizer did not significantly impact the final model's ability to generalize.
- Computation: SEKF was computationally more expensive than gradient-based methods (approx. 8.5x slower in some cases) due to matrix inversions. However, its ability to process data sequentially allows for online adaptation during normal system operation, avoiding the need for batch data collection and discrete redeployment cycles.
Distribution of Updates: Unlike image classification where early layers are frozen, the study revealed that weight changes in dynamical systems are distributed across all layers. However, the mechanism differs by optimizer:
- Adam: Small, uniform updates across many parameters.
- L-BFGS: Larger updates concentrated in fewer influential weights.
- SEKF: Highly selective updates restricted to specific neurons.

5. Significance and Implications

Industrial Applicability: This approach offers a practical workflow for industries (chemical, automotive, healthcare) where data collection is expensive or dangerous. It allows engineers to leverage high-quality source models (from simulation or similar units) and adapt them to new units with minimal operational data.
Theoretical Shift: The findings suggest that the "layer-freezing" heuristic from computer vision does not directly apply to regression-based dynamical systems. Instead, coordinated adaptation across the entire architecture is required, provided the magnitude of change is constrained by a strong prior.
Online Adaptation: The SEKF's sequential nature is particularly valuable for Sim-to-Real transfer and continuous process control, enabling models to adapt in real-time as new data streams in, without halting operations for batch retraining.

In conclusion, the paper demonstrates that Subset Extended Kalman Filtering provides a robust, probabilistic mechanism for transfer learning in dynamical systems, enabling accurate modeling with minimal data while inherently mitigating overfitting through Bayesian regularization.