FoSS: Modeling Long Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier State Space Integration

Imagine you are trying to predict where a car will be in the next few seconds. This is a critical job for self-driving cars. If the car guesses wrong, it could cause an accident.

The problem is that current computer models are stuck in a dilemma:

The "Super-Observer" (Transformers): These models look at everything at once and are very accurate, but they are incredibly slow and heavy, like trying to read a whole library to find one word.
The "Speed-Reader" (Recurrent Models): These are fast but often miss the big picture or get confused by complex, long-term patterns.

The paper introduces a new model called FoSS (Fourier–State Space). Think of FoSS as a two-brained detective that solves the mystery of the car's future by looking at the problem in two completely different ways at the same time.

The Two Brains of FoSS

1. The Time-Brain (The "Storyteller")

This part looks at the car's movement exactly as it happens, second by second. It's like watching a movie frame-by-frame.

What it does: It uses a special "Selective State Space" (SSM) engine. Imagine a librarian who only reads the parts of a book that are relevant to the current chapter, ignoring the rest. This allows the model to remember long-term patterns (like "this car usually turns left at this intersection") without getting overwhelmed by too much data. It's fast and efficient.

2. The Frequency-Brain (The "Music Composer")

This is the clever new trick. Instead of looking at the car's path as a line on a graph, this brain breaks the movement down into musical notes (frequencies).

The Low Notes (Bass): These represent the big picture. Is the car going straight? Is it slowing down for a stop sign? These are the "global trends."
The High Notes (Treble): These represent the tiny details. Is the car swerving slightly? Is it jittering because of a bump? These are the "local dynamics."

The Problem with Music: Usually, if you take a song and break it into notes, the low and high notes get mixed up randomly. It's hard for a computer to learn from a song if the bass drum is playing right after the cymbal crash.

The FoSS Solution (HelixSort): The authors invented a "HelixSort" module. Imagine a spiral staircase. They take all the musical notes and arrange them neatly: Low notes at the bottom, high notes at the top.

Now, the computer can listen to the "bass" first to understand the general direction, and then listen to the "treble" to understand the fine details. It's like reading a book from the beginning to the end, rather than jumping around randomly.

How They Work Together

Once both brains have done their job, they meet in the middle:

The Meeting (Cross-Attention): The "Storyteller" (Time) and the "Composer" (Frequency) compare notes. They ask, "Does the big picture match the tiny details?" If they agree, the prediction becomes very strong.
The Crystal Ball (Multimodal Prediction): The car might turn left, or it might go straight. FoSS doesn't just guess one path; it generates multiple possible futures (like a weather forecast saying "70% chance of rain, 30% chance of sun").
The Final Decision: It weighs these possibilities and gives the most likely path, while also telling the car, "I'm pretty sure about this," or "I'm a bit unsure, be careful."

Why is this a Big Deal?

It's Fast: It runs about 22% faster than the current best models. This is crucial for real-time driving where milliseconds matter.
It's Light: It uses 40% less memory (parameters). This means it can run on smaller, cheaper computers inside cars, not just giant supercomputers.
It's Accurate: In tests on real driving data (Argoverse), it predicted car movements more accurately than any previous method, especially for long-term predictions (looking 6 seconds ahead).

The Analogy Summary

Imagine you are trying to predict the path of a dancer.

Old models either stare at every single foot movement (too slow) or just guess the general dance style (too vague).
FoSS is like a choreographer who listens to the music (the rhythm and tempo = Frequency) to know the general style of the dance, while simultaneously watching the dancer's steps (Time) to see exactly where they are moving next. By organizing the music from slow beats to fast beats, the choreographer can predict the dance moves perfectly, quickly, and with very little effort.

In short, FoSS combines the best of "looking at the big picture" and "watching the details" to make self-driving cars safer, faster, and smarter.

1. Problem Statement

Accurate trajectory prediction is critical for safe autonomous driving, particularly in dense, multi-agent environments. Existing approaches face a fundamental trade-off between modeling power and computational efficiency:

Recurrent Models (RNNs/LSTMs): Struggle to capture long-range dependencies and often suffer from vanishing gradients.
Transformer-based Models: Achieve high accuracy via self-attention but incur quadratic computational complexity ( $O(N^2)$ ) relative to the sequence length, making them difficult to deploy in resource-constrained real-time systems.
Frequency-Domain Methods: While Fourier transforms can separate global trends from local dynamics, standard outputs lack ordered frequency semantics, making them difficult for sequence models to process effectively.

The core challenge is to design a model that simultaneously captures long-range temporal dependencies, fine-grained local dynamics, and multimodal uncertainty while maintaining linear computational complexity ( $O(N)$ ).

2. Methodology: The FoSS Framework

The authors propose FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling (Selective State Space Models - SSMs).

A. Dual-Branch Architecture

The framework processes historical trajectory data through two parallel branches:

Frequency-Domain Branch (FD-Mamba):
- Decomposition: Applies a Discrete Fourier Transform (DFT) to decompose trajectories into Amplitude (encoding global motion trends/intent) and Phase (encoding local variations/dynamics).
- HelixSort (Progressive Helix Reordering): Addresses the lack of ordered frequency semantics in standard DFT. Inspired by JPEG's zigzag encoding, it reorders spectral coefficients from low-frequency (global) to high-frequency (local) based on spectral radius. This creates a structured "artificial temporal axis" suitable for sequential processing.
- Selective SSM Submodules:
  - Coarse2Fine-SSM: Processes the reordered amplitude and phase sequences to refine spatial interactions, moving from coarse global trends to fine local details.
  - SpecEvolve-SSM: Operates on the channel dimension to model inter-channel correlations and spectral evolution.
- Complexity: Both submodules operate with $O(N)$ complexity, avoiding the quadratic cost of attention mechanisms.
Time-Domain Branch (TD-Mamba):
- Uses an input-dependent Selective SSM to model the raw trajectory sequence directly.
- Dynamically adjusts state transition matrices ( $A, B, C, D$ ) based on the current input and local convolutional features. This allows the model to adaptively capture long-range dependencies and suppress noise, effectively mimicking self-attention behavior but with linear complexity.

B. Fusion and Prediction

Cross-Attention Fusion: A cross-attention layer fuses the temporal features (from TD-Mamba) and spectral features (from FD-Mamba). This resolves feature-scale mismatches and integrates global priors with local dynamics.
Multimodal Decoding: Learnable query vectors interact with the fused features via cross-attention to generate $K$ candidate trajectories.
Uncertainty Modeling: A weighted fusion head predicts the final trajectory and its associated uncertainty, allowing for multimodal future prediction.

C. Loss Function

A unified loss function is employed to constrain predictions in both domains:

Temporal Loss ( $L_{time}$ ): L1 loss between the predicted and ground-truth trajectories.
Frequency Loss ( $L_{freq}$ ): L1 loss between the Fourier transforms of the predicted and ground-truth trajectories, ensuring consistency in the frequency domain.
Total Loss: $L_{total} = L_{time} + \lambda L_{freq}$ .

3. Key Contributions

Principled Integration: One of the first frameworks to integrate frequency-domain analysis (Fourier) with linear-complexity sequence modeling (SSM) for large-scale trajectory prediction.
HelixSort Mechanism: A novel progressive helix reordering module that imposes structural ordering on spectral data, enabling SSMs to process frequency information in a coarse-to-fine manner.
Dual-Branch SSM Design: Introduction of two specialized SSM submodules (Coarse2Fine-SSM and SpecEvolve-SSM) that refine spectral features with linear complexity, disentangling global trends from local dynamics.
Efficient Multimodal Prediction: A cross-attention fusion strategy that enables stable, multimodal trajectory generation with significantly reduced parameters and computational cost compared to Transformers.

4. Experimental Results

The model was evaluated on Argoverse 1 and Argoverse 2 benchmarks.

Accuracy (State-of-the-Art):
- Argoverse 2: Achieved a minFDE6 of 1.07 (11.6% improvement over SceneTransformer) and minADE6 of 0.61 (18.7% improvement). It also reduced the Miss Rate (MR6) to 0.11.
- Argoverse 1: Achieved a minADE1 of 1.67, outperforming LaneGCN by 13.0%.
Efficiency:
- Parameters: Reduced by >40% compared to competitors (only 4.18M parameters).
- Computation: Reduced FLOPs by 22.5% (22.1 G FLOPs).
- Latency: Achieved an average inference time of 64 ms on an NVIDIA RTX 3090, faster than HiVT (82 ms) and SceneTransformer (76 ms).
Ablation Studies: Confirmed that removing the frequency branch, the HelixSort module, or the specific SSM submodules leads to significant performance degradation, validating the necessity of each component.

5. Significance

The FoSS framework represents a significant advancement in autonomous driving perception by solving the scalability bottleneck of Transformer-based models. By leveraging the complementary strengths of Fourier analysis (for separating global/local dynamics) and Selective State Space Models (for efficient long-range modeling), FoSS achieves:

Real-time Viability: Linear complexity makes it suitable for edge devices (e.g., NVIDIA Jetson Orin).
Robustness: Effective handling of complex scenarios like U-turns and lane changes by explicitly modeling both global intent and local perturbations.
Scalability: The ability to maintain high accuracy while drastically reducing computational resources paves the way for deploying sophisticated prediction models in mass-market autonomous vehicles.

In conclusion, FoSS demonstrates that integrating spectral reasoning with modern sequence modeling is a highly effective strategy for next-generation trajectory prediction, offering a superior trade-off between accuracy, efficiency, and multimodal expressiveness.