Warm Starting State-Space Models with Automata Learning

Here is an explanation of the paper using simple language, analogies, and metaphors.

The Big Idea: Teaching a Robot to Think Like a Human (and Then Like a Machine)

Imagine you are trying to teach a very smart, but very naive, robot how to manage a busy airport.

The Problem:
If you just throw the robot into the airport and say, "Figure it out," it has to watch thousands of planes take off and land to guess the rules. It might take years, and it might still make mistakes because it doesn't understand the logic of the system. It's like trying to learn to play chess by watching random pieces move without knowing the rules.

The Old Way (Symbolic Learning):
There is an older method where you teach the robot the rules explicitly: "If a plane is red, it goes left. If it's blue, it goes right." This is fast and accurate, but it's rigid. It can't handle new situations, like a plane that is "kind of red and kind of blue" or a situation where the rules depend on the entire history of the day (e.g., "Give the red plane a break because it's been waiting since 6 AM").

The New Way (Neural Networks/SSMs):
Modern AI (like the State-Space Models or SSMs mentioned in the paper) is like a super-learner. It can handle complex, messy, real-world data and remember long histories. But, as the paper shows, if you start this learner from scratch (randomly), it needs massive amounts of data to figure out the basic rules. It's inefficient.

The Breakthrough: The "Warm Start"

The authors of this paper discovered a magic trick. They proved that you can translate the rigid, rule-based "Symbolic" brain into the flexible, fluid "Neural" brain perfectly.

Think of it like this:

The Symbolic Brain is a Map. It has clear roads and intersections.
The Neural Brain is a Compass. It knows how to navigate, but it doesn't know where the roads are yet.

The paper says: "Don't let the Compass wander aimlessly. Give it the Map first, then let it refine its navigation."

They call this "Warm Starting." Instead of starting the neural network with random guesses, they load it with the "Map" (the rules learned by the old symbolic method). Then, they let the neural network learn the complex, messy details on top of that solid foundation.

The Analogy: Learning to Drive a Car

Random Initialization (The Old Neural Way):
Imagine putting a person in a car with no driving lessons, no map, and no idea what a steering wheel does. You tell them, "Drive to the store." They will crash a lot, spin in circles, and eventually (maybe) get there after driving 10,000 miles. This is what happens when you train these models from scratch. They need huge amounts of data.
Symbolic Learning (The Old Symbolic Way):
Imagine giving the person a perfect, rigid map that says "Turn left at the red house, right at the tree." They can get to the store instantly. But if the road is blocked or the red house is painted blue, they freeze. They can't adapt.
Warm Starting (The Paper's Solution):
You give the driver the Map (the symbolic rules) so they know the general route. But then, you let them drive the car themselves. Because they already know the basics, they don't crash. They can now focus on the hard parts: "Oh, there's a pothole here, I need to swerve," or "The traffic is heavy, I need to slow down."
- Result: They get to the store 2 to 5 times faster and make fewer mistakes than the person who started with no map.

Why This Matters (The "Cloud" Example)

The paper uses a real-world example of Cloud Computing (like AWS). Imagine a manager who has to decide which customer gets to use a limited number of GPUs (computer power).

The Simple Rule: "Give everyone 25% of the power." (This is the Symbolic part).
The Complex Reality: "Actually, Customer A has been waiting all night, and Customer B only needs a tiny bit. We need to be fair but also efficient." (This requires remembering the entire history of who asked for what).

The old symbolic methods couldn't handle the "entire history" part because it's too complex. The new neural methods could handle the history but were too slow to learn the basic fairness rules.

By Warm Starting, the researchers taught the AI the basic fairness rules first (using the symbolic method), and then let the AI learn the complex history-tracking. The result? The AI learned the complex job much faster and better than if it had tried to learn everything from scratch.

The Key Takeaways

Symbolic Structure is a Superpower: The paper proves that the rigid, logical rules of old-school AI are actually a perfect "blueprint" for modern, flexible AI.
Don't Reinvent the Wheel: You don't need to throw away the old, logical methods. Instead, use them to give the new, powerful AI a head start.
Efficiency: By using this "Warm Start," the AI learns the same task with orders of magnitude less data. It's the difference between reading a whole library to learn a concept versus reading a single, well-written summary.

In a nutshell: The paper shows that the best way to teach a super-smart AI complex tasks is to first teach it the simple rules, and then let it figure out the rest. It combines the best of both worlds: the logic of the past and the power of the future.

Here is a detailed technical summary of the paper "Warm Starting State-Space Models with Automata Learning" by Fishell, Kouteili, and Santolucito.

1. Problem Statement

The paper addresses the limitations of current learning paradigms for complex sequential systems:

Symbolic Automata Learning (Active/Passive): Methods like $L^*$ $L^{*}$ and RPNI are highly sample-efficient for systems with finite-state representations (regular languages). However, they struggle with:
- Scalability: Passive learning does not scale well to complex problems.
- Infinite Memory: They cannot model systems requiring unbounded history (e.g., tracking cumulative counts or deviations over time).
- Lack of Continuity: There is no notion of "proximity" between discrete automata; solving a simpler version of a problem does not provide a useful initialization for a more complex, related version.
Neural State-Space Models (SSMs): Architectures like Mamba and S4 are efficient continuous models capable of handling long sequences and infinite memory via recurrence. However, when trained from random initialization via gradient descent:
- They require orders of magnitude more data than symbolic methods to learn simple regular languages.
- They fail to recover the underlying discrete symbolic structure (state transitions), often learning "black box" approximations that do not generalize well or converge efficiently.

Core Question: Can the strengths of symbolic structure (inductive bias, sample efficiency) be combined with the flexibility of continuous neural models (infinite memory, gradient-based refinement) to learn complex systems efficiently?

2. Methodology

The authors propose a hybrid approach that uses classical automata learning to "warm-start" State-Space Models (SSMs).

A. Theoretical Foundation: Moore-SSM Correspondence

The paper establishes a formal proof that Moore machines (a type of finite-state automaton) can be exactly realized as linear State-Space Models.

Mapping: A Moore machine with state set $S$ $S$ and input alphabet $\Sigma$ $Σ$ is mapped to an SSM where:
- The state vector $x(t)$ is a one-hot encoding of the current symbolic state.
- The input vector $\mu(t)$ is constructed via a Kronecker product of the current state and input ( $S \otimes \Sigma$ ). This decouples the state-input dependence found in standard Moore transitions, allowing the transition to be represented as a linear operation.
- Matrices $A$ , $B$ , and $C$ are constructed deterministically to preserve the exact transition and output logic of the original Moore machine.
Significance: This proves that SSMs have the representational capacity to perfectly encode symbolic automata, providing a bridge between discrete and continuous domains.

B. Symbolic Warm-Starting Pipeline

Symbolic Extraction: Use classical automata learning (e.g., $L^*$ or RPNI) on a simpler version of the target system or a base policy to recover a finite-state Moore machine.
Initialization: Initialize the SSM matrices ( $A, B, C$ ) using the exact construction from the Moore-SSM proof (Algorithm 1), adding small Gaussian noise to facilitate gradient flow.
Refinement: Train the initialized SSM on data from the complex target system (which may require infinite memory or adaptive policies) using gradient descent. The model starts with the correct symbolic structure and refines it to handle continuous or unbounded constraints.

C. Experimental Setup

Benchmarks: SYNTCOMP (a suite of synthesis benchmarks for reactive systems).
Task 1 (Finite-State): Learning regular languages from scratch. Comparing random-initialized SSMs against Active ( $L^*$ ) and Passive (RPNI) learning.
Task 2 (Infinite-State/Complex): Learning Dynamic Arbitration Policies for cloud resource allocation.
- Base: A finite-state arbiter (e.g., Round Robin).
- Augmentation: The policy must track the deviation of grants from the average over the entire history (requiring infinite memory).
- Goal: Learn an adaptive policy that respects the base logic but dynamically adjusts safety constraints based on cumulative history.

3. Key Contributions

Formal Equivalence Proof: The first proof establishing that Moore machines admit exact realizations as SSMs, preserving both structure and input-output behavior.
Symbolic Warm-Starting Framework: A novel pipeline that initializes continuous SSMs with symbolically learned approximations, effectively lifting automata learning into continuous domains.
Empirical Evidence of Inductive Bias: Demonstration that symbolic structure provides a critical inductive bias. Randomly initialized SSMs fail to recover discrete structure even on simple tasks, whereas symbolic methods are orders of magnitude more sample-efficient.
Efficient Learning of Complex Systems: Showing that warm-started SSMs can learn systems requiring infinite memory (dynamic arbitration) significantly faster and more accurately than random initialization.

4. Results

Sample Efficiency (Finite-State Tasks)

Data Requirements: Symbolic methods ( $L^*$ , RPNI) achieved perfect emulation of SYNTCOMP benchmarks with orders of magnitude fewer samples than gradient-trained SSMs.
Accuracy:
- $L^*$ : 77.3% perfect emulation.
- RPNI: 56.0% perfect emulation.
- Random SSM: Only 33.3% perfect emulation.
Structure Recovery: PCA visualizations of SSM hidden states showed that random initialization failed to cluster into distinct symbolic states (low ARI/NMI scores), confirming that gradient descent does not naturally discover discrete automaton structure.

Warm-Starting Performance (Complex/Infinite-State Tasks)

Convergence Speed: Symbolic warm-started SSMs reached 90% test accuracy 243 epochs earlier on average than randomly initialized SSMs.
Statistical Significance: The difference in convergence speed was statistically significant (Mann-Whitney U test, $p = 0.0122$ ).
Accuracy: Warm-started models achieved 2–5x faster convergence and higher overall test accuracy.
- Example: On a 5-channel arbiter, warm-started models reached ~100% accuracy in 300 epochs, while random models reached only 60% after 950 epochs.
Memory Constraints: The approach increased model dimensionality (due to the Kronecker product input), occasionally causing GPU memory exhaustion for very large parameterizations ( $N=5, 6$ ), highlighting a trade-off between expressiveness and scalability.

5. Significance and Conclusion

This work fundamentally shifts the paradigm for learning sequential systems by:

Bridging Discrete and Continuous: It moves automata learning out of purely discrete spaces, allowing symbolic structures to serve as "seeds" for deep learning models.
Solving the Infinite Memory Problem: It enables the learning of systems that require unbounded history (which classical automata cannot represent) by leveraging the recurrence of SSMs, initialized with the correct finite-state logic.
Inductive Bias: It demonstrates that for complex control problems, providing a model with the correct structural prior (symbolic logic) is far more effective than relying on the model to discover structure from raw data.

The authors conclude that this approach offers a principled way to exploit symbolic structure in continuous domains, enabling efficient learning in complex settings where neither pure symbolic nor pure neural approaches succeed alone. Future work aims to explore spectral learning to reduce the dimensionality of these warm-started models.