Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

The Big Idea: "Same Score, Different Story"

Imagine you are a coach trying to pick the best player for your team. You have two players, Alex and Jordan. You run them through a series of drills, and they both get the exact same score: 95 out of 100.

In the world of finance and machine learning, this is what usually happens. Researchers build different AI models (like deep neural networks) to predict stock market volatility (how much prices will jump around). When they test these models, they often find that a complex AI model scores exactly the same as a simple, old-fashioned math model.

The paper's big discovery: Just because two models get the same score doesn't mean they are doing the same thing. They might be solving the puzzle in completely different ways, and that difference matters a lot when you actually try to use them to make money.

The Analogy: The Hiking Trip

Imagine you and a friend are trying to hike to the top of a mountain (the "best prediction"). You both start at the bottom and want to reach the summit with the least amount of effort (lowest "error").

The Landscape: The mountain is foggy and flat at the top. There isn't just one peak; there is a huge, flat plateau where many different paths lead to the same height.
The Hikers (The Models): You have different hiking styles (Architectures). One is a fast runner (Transformer), one is a steady walker (LSTM), and one is a simple hiker (Linear Model).
The Guide (The Optimizer): This is the most important part. The "Optimizer" is like the GPS or the guide telling you which direction to step next.
- Guide A (SGD): Tells you to take small, steady, cautious steps. You might wander a bit, but you tend to stay on wide, safe paths.
- Guide B (Adam): Tells you to sprint, slide, and take shortcuts. You move faster and might find a steeper, more direct route.

The Paper's Finding:
Even though both you and your friend end up at the exact same altitude (the same prediction error), your paths were totally different.

Guide A (SGD) led you to a path that is wide, flat, and stable. If the wind blows (market stress), you don't fall off.
Guide B (Adam) led you to a path that is narrow, steep, and full of sharp turns. It gets you there just as fast, but if the wind blows, you might slip and have to scramble back up.

In finance, this difference isn't just about the hike; it's about how often you have to stop and change your gear (trading).

Why Does This Matter? (The "Turnover" Problem)

The paper looks at what happens when you use these models to build a stock portfolio (a basket of investments).

The Stable Hiker (SGD): Because this model is "cautious," it doesn't change its mind often. It says, "This stock is risky," and sticks with that view for a while.
- Result: You trade less. You pay fewer fees. Your portfolio is calm.
The Sprinting Hiker (Adam): Because this model is "aggressive," it reacts to tiny changes in the data. It says, "This stock is risky!" then five minutes later, "Wait, it's safe!" then "Risky again!"
- Result: You are constantly buying and selling (high turnover). Even though your predictions are just as accurate as the stable hiker's, you are bleeding money on transaction fees and taxes because you are moving too much.

The Paper's Conclusion:
In the financial world, the "Guide" (Optimizer) is actually part of the model. You can't just pick the model with the best score. You have to ask: "Does this model's 'personality' (cautious vs. aggressive) fit my goals?"

If you want a stable portfolio, you might choose the "cautious" optimizer even if it has the exact same score as the "aggressive" one. If you pick the aggressive one, you might end up with a portfolio that turns over 3 times faster, eating up your profits.

Key Takeaways in Plain English

Don't Trust the Scoreboard Alone: In finance, many different AI models get the same "grade." But getting an 'A' doesn't mean they are all doing the same job.
The "Who" Matters as Much as the "What": It's not just about what the model predicts, but how it learns to predict it. The tool used to train the model (the Optimizer) leaves a hidden fingerprint on the final result.
Simplicity is Often Better: The paper found that the simplest training method (SGD) often creates models that are more stable and less "jumpy." In the noisy world of the stock market, being less jumpy is often more valuable than being slightly more complex.
The "Rashomon Effect": This is a reference to a famous movie where four people tell different stories about the same event, and all stories are technically true. In finance, many different models tell different "stories" (make different predictions) about the market, but they all end up with the same error score. The paper argues we need to look at which story makes the most sense for our money, not just which story has the best score.

The Bottom Line

When building AI for money, don't just look at the final grade. Look at the personality of the model. If two models get the same score, pick the one that behaves more like a steady, reliable friend rather than a nervous, jittery one, because that friend will save you money in the long run.

1. Problem Statement

The paper addresses the phenomenon of underspecification in financial time series forecasting, specifically regarding volatility prediction for S&P 500 stocks.

The Core Issue: In low signal-to-noise ratio (SNR) domains like finance, different model architectures (e.g., MLP, CNN, LSTM, Transformer) and training pipelines often achieve indistinguishable out-of-sample performance (e.g., identical Normalized Mean Squared Error or NMSE) when evaluated against standard linear baselines (OLS, LASSO).
The Gap: Current literature assumes that if models have the same test loss, they are functionally equivalent and interchangeable. Consequently, the choice of optimizer (e.g., Adam vs. SGD) is often treated as a mere implementation detail or engineering choice rather than a critical modeling decision.
Research Questions:
1. Are models with identical test loss actually interchangeable?
2. Does the optimizer choice materially affect the learned function even when the test loss remains unchanged?

2. Methodology

The authors designed a rigorous experimental framework to isolate the role of training dynamics from predictive performance.

Task: One-step-ahead volatility forecasting for S&P 500 constituents (2000–2024). The target is realized variance, proxied by the Garman-Klass estimator.
Experimental Grid:
- Architectures (4): MLP, CNN, LSTM, Transformer.
- Optimizers (3): SGD (non-adaptive), Adam (adaptive), Muon (matrix-aware).
- Total Systems: 12 unique (Architecture, Optimizer) pairs.
- Robustness: Each system was trained with 13 independent random seeds and underwent hyperparameter tuning (learning rate, weight decay) via Optuna to ensure performance parity was not due to "bad tuning."
Diagnostic Tools (Beyond Scalar Loss):
- Impulse Response Analysis ( $R(k, \delta)$ ): Measures the model's sensitivity to shocks at specific time lags to visualize the shape of the response surface.
- Functional Difference Surfaces: Computes $D(x) = \hat{y}_{Muon}(x) - \hat{y}_{Adam}(x)$ to check if differences are simple rescalings or structurally distinct mappings.
- Feature Attribution (SHAP): Analyzes which time lags (e.g., $t-1$ vs. $t-60$ ) drive predictions.
- Ensembling: Tests if combining models reduces error, indicating orthogonal signal recovery.
- Curvature Analysis: Measures the maximum Hessian eigenvalue ( $\lambda_{max}$ ) to assess solution sharpness and the "Edge of Stability" (EoS).
- Intervention Experiments: Swapping optimizers mid-training (e.g., Adam $\to$ SGD) to test the stability of learned attractors.
Downstream Evaluation: Construction of volatility-ranked portfolios to measure Sharpe Ratio vs. Portfolio Turnover (trading frequency).

3. Key Findings & Results

A. Predictive Equivalence (The "Tie")

Result: All deep learning architectures (MLP, CNN, LSTM, Transformer) tied with linear baselines (OLS, LASSO) in out-of-sample NMSE.
Observation: Even after rigorous hyperparameter tuning, no single architecture or optimizer significantly outperformed the others in terms of raw prediction error. The "leaderboard tie" is structural, not an artifact of poor tuning.

B. Functional Divergence (The "Difference")

Despite identical NMSE, the models learned qualitatively different functions:

Response Surfaces:
- SGD tends to learn simpler, flatter response surfaces (near-linear behavior).
- Adaptive Optimizers (Adam, Muon) learn complex, non-linear response surfaces (e.g., sigmoidal dampening of extreme shocks).
Temporal Dependence:
- Optimizers dictate lag importance. For example, in LSTMs, Muon facilitates long-term memory (quarterly cycles), while Adam and SGD focus almost exclusively on recent history ( $t-1$ ).
- This implies the optimizer implicitly selects between competing economic narratives (e.g., microstructure effects vs. fundamental disclosures).
Geometry:
- SGD converges to flatter minima (lower $\lambda_{max}$ ).
- Adam/Muon converge to sharper minima (higher $\lambda_{max}$ ), allowing them to navigate high-curvature regions inaccessible to SGD.
Intervention Evidence: Swapping an optimizer after convergence causes the model to rapidly drift toward the attractor of the new optimizer (e.g., an Adam-trained model collapses to a simple SGD-like solution when switched to SGD), proving the divergence is driven by optimization geometry, not initialization.

C. Decision-Level Consequences

The functional differences translate into distinct trading behaviors:

Sharpe-Turnover Frontier: While all models achieved similar Sharpe ratios, they exhibited vastly different portfolio turnover.
Dispersion: Adaptive optimizers (Adam/Muon) induced up to 3x higher turnover compared to SGD at comparable risk-adjusted returns.
Implication: Adaptive models are more sensitive to small input changes, leading to frequent rank reversals and higher transaction costs. SGD models produce more stable rankings.

D. Ensembling Benefits

A heterogeneous ensemble (combining SGD, Adam, and Muon models) achieved strictly lower NMSE than any individual model.
Significance: This confirms that the residual errors are not perfectly correlated; different optimizer-architecture pairs recover partially orthogonal components of the signal.

4. Mechanism: The Optimizer as an Implicit Prior

The paper posits that in underspecified regimes, the optimizer acts as a consequential source of inductive bias.

Edge of Stability (EoS): Training dynamics are constrained by stability boundaries. SGD is naturally repelled from sharp minima, favoring flat, simple solutions. Adaptive methods use preconditioning to stabilize descent into sharp, complex minima.
Selection of Admissible Solutions: When the loss landscape contains a "Rashomon set" (many solutions with equal error), the optimizer determines which solution is selected. It is not just a tool for convergence speed; it is part of the model definition.

5. Significance and Contributions

Redefining Model Selection: The paper argues that in finance, model selection is function selection. Choosing a model based solely on a leaderboard tie is insufficient; one must select the function that aligns with downstream economic objectives (e.g., stability vs. reactivity).
Optimizer as a Hyperparameter: The choice of optimizer is not an engineering detail but a fundamental design choice that encodes specific views on market dynamics (e.g., memory length, non-linearity).
Benchmarking Reform: The authors advocate for expanding evaluation metrics beyond scalar loss (NMSE) to include:
- Functional diagnostics (impulse responses, SHAP).
- Decision-level metrics (turnover, stability).
- Interpretability of the learned function.
Practical Impact: For practitioners, using an adaptive optimizer might yield the same predictive accuracy as SGD but at the cost of significantly higher transaction costs due to turnover. Conversely, SGD might offer a more robust, lower-turnover strategy despite being "simpler."

Conclusion

The paper concludes that predictive equivalence is real but misleading. In low-signal financial time series, different optimizers induce distinct inductive biases that result in materially different functions. These differences are invisible to standard loss metrics but have profound consequences for trading strategy viability, stability, and transaction costs. Therefore, the optimizer must be treated as an integral part of the model, and evaluation must extend to the functional and decision-level implications of the learned predictor.