V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space

Imagine you are trying to teach a robot how to stand up or balance a pole. Usually, to make sure the robot doesn't fall, engineers need a perfect, mathematical map of the robot's body, its joints, and its speed. They need to know exactly where every part is at every millisecond.

But what if you only have a camera? What if the robot can only "see" the world through a video feed, without knowing its own internal math? This is the problem the paper V-MORALS solves.

Here is the story of how they did it, explained with simple analogies.

1. The Problem: The "Black Box" Camera

In the old days, if you wanted to know if a robot was safe, you needed a full report card of its internal state (speed, angle, position).

The Issue: Cameras are messy. A single picture of a robot doesn't tell you if it's moving fast or slow. It's like looking at a single frame of a movie; you don't know if the car is speeding up or stopping. Plus, a picture has millions of pixels (too much data), while the robot's actual "state" is just a few numbers.
The Challenge: How do you predict if a robot will fall (fail) or stand up (succeed) just by watching a video, without knowing the robot's internal math?

2. The Solution: The "Dreaming" Robot

The authors created a system called V-MORALS. Think of it as a robot that learns to dream in a simplified world.

Instead of trying to process millions of pixels, the system does three things:

Step 1: The Silhouette Filter (The Mask)
Imagine you are looking at a busy street scene. To understand a car's movement, you don't care about the trees or the clouds. You only care about the car.
V-MORALS takes the video and turns it into a black-and-white silhouette. It strips away the background, the lighting, and the textures. It only keeps the shape of the robot. This makes the data much simpler to handle.
Step 2: The Time-Lapse Compressor (The Latent Space)
A single picture is confusing, but a short video clip tells a story.
The system takes a short sequence of these silhouettes (like a 10-second time-lapse) and squishes them down into a tiny, abstract "thought bubble."
- Analogy: Imagine taking a 10-minute movie and summarizing it into a single 3D shape. If the robot is falling, the shape looks like a "falling triangle." If it's balancing, the shape looks like a "stable pyramid."
- This "thought bubble" is called the Latent Space. It's a compressed map where the robot's complex movements are reduced to simple coordinates.
Step 3: The Crystal Ball (The Dynamics Network)
Once the robot is in this "thought bubble" world, the system learns the rules of physics. It learns: "If the shape is tilted this way, it will likely tip over next."
It creates a crystal ball that predicts the future shape based on the current shape, entirely within this simplified world.

3. The Map: The Morse Graph

Now that the robot has a crystal ball and a simplified map, it needs to know where it can go safely.

The Analogy: Imagine a topographical map of a mountain range.
- The Valleys (Attractors): These are the safe places where the robot naturally settles. One valley is "Standing Up" (Success). Another valley is "Lying on the Floor" (Failure).
- The Slopes: These show how the robot moves. If you push the robot from a certain spot, gravity pulls it into one of the valleys.
The Morse Graph: This is a flowchart the system builds. It connects the dots on the map. It draws arrows showing: "If you start here, you will end up in the 'Success' valley. If you start there, you will end up in the 'Failure' valley."

4. Why This Matters

Previously, you needed the robot's internal "soul" (its exact math data) to draw this map. V-MORALS proves you can draw the map just by watching the robot move on a screen.

Real-world impact: This means we can test safety for robots in the real world using just cameras, without needing to program every single joint's physics. It's like teaching a child to ride a bike by watching them, rather than measuring their heart rate and muscle tension.

Summary

V-MORALS is like a smart observer that:

Watches a robot via a camera.
Filters the video to see only the robot's shape.
Compresses the video into a simple 3D "thought."
Predicts the future by simulating how that "thought" moves.
Draws a map showing exactly which starting positions lead to success and which lead to failure.

It turns a chaotic, high-dimensional video feed into a clear, simple map of safety, allowing us to trust robots even when we can't see their internal code.

1. Problem Statement

Context: Reachability analysis is critical in robotics for distinguishing safe states from unsafe ones. Traditional methods (e.g., Hamilton-Jacobi reachability) often fail in high-dimensional systems or with complex controllers due to the "curse of dimensionality."
Limitations of Existing Work:

MORALS: A recent method that uses topological tools (Morse Graphs) to estimate Regions of Attraction (ROA) in a learned latent space. However, MORALS requires full state information (e.g., joint velocities, positions), which is often unavailable in real-world scenarios where only sensor data (images) is accessible.
Visual Challenges: Using images introduces partial observability (a single frame lacks motion data) and high dimensionality. Encoding images into a latent space without losing critical dynamic information (like velocity) is difficult, and learning transitions between image-based latent vectors is ambiguous without explicit state knowledge.

Goal: The paper aims to extend the MORALS framework to operate solely on image-based trajectories (partial observability) to estimate ROAs and generate Morse Graphs, enabling safety analysis without access to the system's internal state.

2. Methodology: V-MORALS

V-MORALS learns system dynamics directly from image sequences to construct a Morse Graph and compute ROAs in a low-dimensional latent space.

A. Data Preprocessing & Representation

Binary Masking: To reduce input complexity and isolate the system from the background, raw images are converted into binary masks ( $\psi$ ). This removes irrelevant texture and lighting data, focusing on the system's physical configuration.
Spatiotemporal Encoding: To address partial observability (lack of velocity in single frames), the method encodes sequences of images rather than single frames. A sequence of $h$ consecutive binary frames is treated as a single input unit.

B. Model Architecture

The system utilizes three neural networks trained jointly:

Encoder ( $E$ ): A 3D Convolutional Autoencoder. It processes the sequence of binary images to extract spatiotemporal features (motion, velocity) and compresses them into a low-dimensional latent vector $z \in \mathbb{R}^d$ . The output is normalized to $[-1, 1]$ using a tanh activation.
Decoder ( $D$ ): Mirrors the encoder using 3D transposed convolutions to reconstruct the original image sequence from the latent vector. It uses a sigmoid activation to ensure valid binary outputs.
Latent Dynamics Network ( $LD$ ): A feedforward neural network that predicts the next latent state ( $\hat{z}_{t+1}$ ) given the current latent state ( $z_t$ ). It also uses tanh to maintain bounds.

C. Training Objectives

The model is trained using a composite loss function ( $L_{total}$ ) with four components:

Reconstruction Loss ( $L_{recon}$ ): Binary Cross-Entropy (BCE) between the input image sequence and the decoder's reconstruction.
Dynamics Loss ( $L_{dynamics}$ ): Mean Squared Error (MSE) between the encoded actual next sequence and the latent dynamics network's prediction.
Predictive Reconstruction Loss ( $L_{recon\_pred}$ ): BCE between the actual next image sequence and the sequence reconstructed from the predicted latent state.
Contrastive Loss ( $L_{contrast}$ ): A novel addition to MORALS. It organizes the latent space by:
- Inter-class: Pushing apart latent vectors of successful ( $Y=1$ ) and failed ( $Y=0$ ) trajectories.
- Intra-class: Pulling together vectors within the same class (success or failure) to create tight clusters.

D. Morse Graph & ROA Generation

Once the latent space is learned:

Discretization: The latent space is discretized into a grid of "cells."
Transition Mapping: The dynamics network ( $LD$ ) is used to propagate corner points of each cell forward. A "safety bubble" (radius $\delta$ ) is added to account for prediction uncertainty.
Graph Construction: A directed graph is built where an edge exists between cells if the predicted future states intersect.
Morse Graph: The graph is decomposed into Strongly Connected Components (SCCs). These SCCs are collapsed into nodes (Morse Sets), forming a Directed Acyclic Graph (DAG) representing the system's long-term behavior.
ROA Calculation: The ROA for an attractor (leaf node) is defined as the set of all initial cells that have a path leading to that attractor.

3. Key Contributions

V-MORALS Framework: The first extension of MORALS to handle partial observability using only high-dimensional image data, removing the dependency on explicit state vectors.
Spatiotemporal Latent Learning: Introduction of a 3D convolutional autoencoder architecture that successfully captures dynamic variables (like velocity) from image sequences, solving the ambiguity of single-frame observations.
Contrastive Latent Organization: The integration of contrastive loss to explicitly structure the latent space, ensuring distinct clustering of success and failure trajectories, which improves the accuracy of the resulting Morse Graph.
Empirical Validation: Extensive testing on four standard control benchmarks (Pendulum, CartPole, Acrobot, Humanoid) demonstrating the method's ability to learn dynamics and predict outcomes across different controllers (LQR, DDPG, SAC).

4. Experimental Results

The authors evaluated V-MORALS on four tasks with varying complexity and trajectory lengths.

Impact of Latent Dimensionality:
- Increasing the latent dimension from 2 to 3 significantly improved performance.
- Example (CartPole): F-score jumped from 0.29 (2D) to 0.81 (3D).
- Example (Humanoid): F-score improved from 0.54 (2D) to 0.84 (3D).
- Reasoning: A 2D space was insufficient to capture the complex dynamics of high-degree-of-freedom systems, leading to ambiguous Morse Graphs. A 3D space provided a richer topological representation.
Comparison with State-Based MORALS:
- When using a 2D latent space, V-MORALS (image-based) underperformed compared to the original MORALS (state-based) (e.g., Humanoid F-score: 0.54 vs 0.91).
- However, by increasing the latent dimension to 3, V-MORALS narrowed the gap significantly (Humanoid F-score: 0.84 vs 0.94), proving that image-based analysis can approach state-based accuracy with sufficient latent capacity.
Robustness:
- The method showed similar performance for both state-based and vision-based controllers in CartPole, indicating generalization across control policies.
- Limitation: Performance dropped significantly when Gaussian noise was added to images (F-score dropped to ~0.29), attributed to the decoder's inability to reconstruct noisy images effectively.

5. Significance and Future Work

Significance:

Bridging the Gap: V-MORALS enables formal safety analysis (ROA estimation) for systems where only cameras are available, a common constraint in real-world robotics.
Interpretability: It provides a visual, low-dimensional map (Morse Graph) of complex, high-dimensional system behaviors, allowing engineers to predict long-term outcomes (success/failure) without simulating every pixel.
Scalability: It offers a computationally efficient alternative to Hamilton-Jacobi methods for high-dimensional visual inputs.

Limitations & Future Directions:

Partial Observability: The method assumes images are a "relatively complete" representation. It may struggle with severe occlusion or missing dynamic variables not visible in the mask.
Binarization: The requirement to binarize images may discard subtle environmental details.
Simulation Only: Current validation is limited to simulated environments (MuJoCo). Future work aims to test on real-world robotic tasks and explore cross-embodiment transfer.

In conclusion, V-MORALS represents a significant step toward making formal reachability analysis accessible for vision-based robotic systems, leveraging topological tools to ensure safety in complex, high-dimensional environments.