SODAs: Sparse Optimization for the Discovery of Differential and Algebraic Equations

Imagine you are a detective trying to figure out the rules of a complex game just by watching people play it. You have a video camera recording their every move, but you don't know the rulebook. Your goal is to write down the exact laws that govern how the game works.

This is exactly what scientists do when they try to understand complex systems like chemical reactions, power grids, or even swinging pendulums. They have data (the video), but they need to find the math (the rulebook).

For a long time, scientists used a method called SINDy (Sparse Identification of Nonlinear Dynamics). Think of SINDy as a detective who assumes every single player in the game is actively moving and changing the rules. They try to find a simple equation for every player.

The Problem:
In many real-world systems, not everyone is "moving" in the same way. Some players are actually just anchors or constraints.

The Anchor: Imagine a pendulum. The string has a fixed length. The bob (the weight) can swing, but it must stay exactly that distance from the center. It can't just float away. This is an algebraic constraint.
The Moving Part: The swinging motion itself is the differential equation.

Old methods tried to force the "anchor" (the fixed string length) to act like a "moving part." To do this, they tried to mathematically eliminate the anchor from the equations. This is like trying to describe a game of chess by only looking at the King and ignoring the fact that the board has 64 squares and specific rules about how pieces move. It makes the math incredibly messy, full of fractions and complex terms, and very sensitive to noise (static in the video).

The Solution: SODAs
The authors of this paper introduced a new method called SODAs (Sparse Optimization for the Discovery of Differential and Algebraic Equations).

Here is how SODAs works, using a simple analogy:

1. The "Two-Step" Detective

Instead of trying to solve the whole puzzle at once, SODAs splits the investigation into two distinct phases:

Phase 1: Find the Anchors (Algebraic Finder)
First, the detective looks at the data to find the "rules that don't change." It asks: "Which variables are locked together?"
- Analogy: If you see a video of a car driving on a track, SODAs first notices, "Hey, the car is always on the track. The distance from the center is fixed." It identifies this as a hard rule (an algebraic equation) without worrying about how fast the car is going.
- Why this helps: It finds the "glue" holding the system together.
Phase 2: Clean the Mess (Library Refinement)
Once the anchor is found, SODAs realizes that the "moving parts" and the "anchors" are mathematically redundant. If you know the car is on the track, you don't need to list every possible position the car could be in; you just need to know it's on the track.
- Analogy: Imagine you have a library of 1,000 books to find the rulebook. You realize 500 of them are just copies of the same rule written in different ways. SODAs throws those 500 copies away. This makes the remaining library much smaller and easier to search. It removes "noise" and confusion.
Phase 3: Find the Motion (Dynamic Finder)
Now that the library is clean and the anchors are identified, the detective looks at the remaining data to figure out how the system moves.
- Analogy: Now that we know the car is on the track, we can easily figure out the engine rules (how fast it accelerates, how it turns) without getting confused by the track's shape.

2. Why is this better?

It handles "Static" better: Old methods got confused when data was slightly noisy (like a shaky camera). SODAs ignores the noise when finding the "anchors" because it doesn't need to calculate speed or acceleration for that step. It just looks for relationships.
It keeps the physics real: By keeping the "anchors" as separate equations, the final model looks like the real physical world. It tells you, "This part is fixed, and this part moves," rather than a messy, unrecognizable math soup.
It needs less data: Because it cleans up the library first, it doesn't need millions of data points to find the answer. It can work with smaller, noisier datasets.

Real-World Examples from the Paper

The authors tested SODAs on three very different "games":

Chemical Reactions (The Kitchen):
Imagine mixing ingredients. Some ingredients (enzymes) are used up and regenerated instantly, acting like a fixed rule. SODAs figured out which ingredients were the "fixed rules" (conservation laws) and which were the "cooking process" (dynamics), even when the data was noisy.
Power Grids (The City):
In a city's power grid, electricity must balance perfectly at every node (Kirchhoff's laws). These are the "anchors." SODAs looked at data from power lines and successfully mapped out the entire network topology (who is connected to whom) by first finding these balance rules, then figuring out how the generators swing.
Swinging Pendulums (The Playground):
They filmed a single pendulum and a chaotic double pendulum. The raw video showed the weight moving in X and Y coordinates (a messy circle). SODAs looked at the X and Y data and realized, "Wait, $X^2 + Y^2$ is always constant!" It discovered the hidden rule that the string length is fixed. Once it found that rule, it could easily translate the messy X/Y data into a simple angle (polar coordinates) and write down the simple equation for how the pendulum swings.

The Bottom Line

SODAs is a smarter way to reverse-engineer nature. Instead of trying to force a square peg into a round hole (treating fixed constraints as moving variables), it first identifies the fixed constraints, cleans up the math, and then solves for the movement. It's like realizing that to understand a dance, you first need to know the floor plan (the constraints) before you try to describe the dancers' steps.

1. Problem Statement

Complex dynamical systems in physics, biology, and engineering are often governed by Differential-Algebraic Equations (DAEs). These systems integrate ordinary differential equations (ODEs) with algebraic constraints representing conservation laws, physical limits, or quasi-steady-state approximations (QSSA).

Existing data-driven model discovery methods, such as SINDy (Sparse Identification of Nonlinear Dynamics), face significant limitations when applied to DAEs:

Reduction Assumption: Most current approaches assume DAEs can be reduced to ODEs by eliminating algebraic variables before discovery. This often results in complex rational functions that are difficult to optimize and lose the physical interpretability of the original constraints.
Variable Identification: Existing methods generally require prior knowledge of which variables are differential and which are algebraic. In many real-world scenarios, this distinction is unknown.
Multicollinearity: The presence of exact algebraic relationships creates perfect multicollinearity in the candidate function library, leading to numerical instability and failure in sparse regression.
Noise Sensitivity: Methods that attempt to discover implicit ODEs (e.g., Implicit-SINDy, SINDy-PI) are highly sensitive to noise, particularly when estimating derivatives or nullspaces, often requiring massive datasets (e.g., >150,000 points) to function robustly.

The core challenge is to discover the explicit form of DAEs directly from time-series data without prior knowledge of the algebraic constraints or the distinction between differential and algebraic variables, while maintaining robustness to noise and preserving physical interpretability.

2. Methodology: SODAs

The authors propose SODAs (Sparse Optimization for Differential-Algebraic Systems), a sequential, data-driven framework that identifies algebraic and dynamic components separately. The method operates in two primary phases:

Phase 1: Algebraic Relation Finder (Iterative Refinement)

This phase identifies algebraic constraints ( $G(y, z, t) = 0$ ) without using time derivatives, thereby avoiding noise amplification associated with differentiation.

Library Construction: A candidate library $\Theta$ of functions (e.g., polynomials, trigonometric terms) is constructed from the state variables.
Iterative Sparse Regression: For each term $\theta_l$ in the library, the algorithm performs a sparse regression against all other terms in the library: $\theta_l(x) \approx \sum_{j \neq l} p_j \theta_j(x)$ .
Selection & Refinement:
- The relationship with the highest fit score (e.g., $R^2$ ) is selected as the most prominent algebraic constraint.
- Complexity Scoring: Terms within the discovered relationship are assigned a complexity score (e.g., polynomial degree). The term with the highest complexity is selected for removal.
- Library Pruning: The selected term and all its multiples (which create degenerate relationships) are removed from the library. This step iteratively reduces multicollinearity.
Stopping Criteria: The process continues until the condition number of the library stabilizes or a pre-specified number of constraints ( $K$ ) is found. Singular Value Decomposition (SVD) is used to monitor the nullspace and determine when algebraic constraints have been fully identified.

Phase 2: Dynamic System Finder

Once algebraic constraints are identified and the library is refined:

Variable Assignment: State variables are classified as differential or algebraic based on the discovered constraints, domain knowledge, or measurement quality.
ODE Discovery: Standard sparse regression methods (e.g., SINDy with LASSO and sequential thresholding) are applied to the remaining library terms to discover the differential equations ( $F(y, \dot{y}, z, t) = 0$ ).
Noise Mitigation: Since the algebraic step is derivative-free, the dynamic step can utilize established techniques (e.g., Savitzky-Golay filters, weak formulations) to handle noisy derivatives specific to the differential variables.

3. Key Contributions

Sequential Discovery: A novel framework that decouples the identification of algebraic constraints from differential dynamics, eliminating the need to reduce DAEs to ODEs a priori.
Automatic Variable Identification: The method does not require prior knowledge of which variables are algebraic; it infers them through the discovery of conservation laws and QSSA relationships.
Multicollinearity Mitigation: By iteratively removing terms involved in algebraic relationships, SODAs significantly improves the conditioning of the candidate library, solving a major numerical instability issue in DAE discovery.
Derivative-Free Algebraic Discovery: By identifying algebraic constraints without calculating time derivatives, the method is inherently more robust to measurement noise than implicit ODE discovery methods.
Open-Source Implementation: The authors provide DaeFinder, a Python package implementing the SODAs algorithm.

4. Results

The authors validated SODAs on three distinct domains:

Chemical Reaction Networks (CRNs):
- Successfully rediscovered conservation laws and QSSA approximations for enzyme-mediated reactions (Michaelis-Menten kinetics).
- Demonstrated robustness to up to 15% Gaussian noise with significantly fewer data points (5 initial conditions, ~1200 points each) compared to Implicit-SINDy or SINDy-PI, which failed even at 1% noise with similar data volumes.
- Showed that data requirements scale with library complexity but are manageable for low-degree polynomials.
Power Grid Dynamics:
- Applied to IEEE-4, IEEE-9, and IEEE-39 benchmark systems.
- Successfully inferred network topology (admittance matrix) and power flow equations (algebraic constraints) from phase angle and power injection data.
- Achieved 100% recovery of algebraic and dynamic components at Signal-to-Noise Ratios (SNR) of 30 dB and 40 dB.
- Demonstrated that failing to separate algebraic constraints leads to incorrect dynamic models, even at high SNR.
Non-linear and Chaotic Pendulums:
- Applied to pixel data extracted from video footage of single and double pendulums.
- Successfully identified the geometric constraints ( $x^2 + y^2 = l^2$ ) directly from Cartesian coordinates, effectively discovering the reduced polar coordinate system.
- For the single pendulum, the method recovered the correct ODE in the transformed coordinate system. For the double pendulum, it identified the algebraic constraints, though full dynamic recovery required more data/advanced weak formulations.

5. Significance

SODAs represents a paradigm shift in data-driven model discovery for constrained systems.

Interpretability: By preserving the explicit DAE structure, the resulting models retain physical meaning (e.g., conservation of mass, Kirchhoff's laws) that is often lost in reduced ODE formulations.
Efficiency: It drastically reduces data requirements compared to implicit methods, making it applicable to experimental settings where data is scarce or expensive to collect.
Robustness: The derivative-free approach to algebraic discovery makes it uniquely suited for noisy experimental data, a common bottleneck in real-world applications.
Generalizability: The framework is applicable across diverse fields, from biochemical networks to electrical grids and mechanical systems, offering a unified approach to handling timescale separation and physical constraints.

The paper concludes that while SODAs currently requires full state observation, it provides a robust foundation for future extensions to systems with unobserved states and partial measurements.

SODAs: Sparse Optimization for the Discovery of Differential and Algebraic Equations

1. The "Two-Step" Detective

2. Why is this better?

Real-World Examples from the Paper

The Bottom Line

1. Problem Statement

2. Methodology: SODAs

Phase 1: Algebraic Relation Finder (Iterative Refinement)

Phase 2: Dynamic System Finder

3. Key Contributions

4. Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank