Open-World Motion Forecasting

Imagine you are teaching a robot driver how to navigate a busy city. Traditionally, we've taught these robots in a "closed classroom." We give them a textbook with pictures of cars, pedestrians, and trucks, and we say, "These are the only three things that exist. Memorize them."

But the real world is messy. Suddenly, a new type of vehicle appears: an electric scooter. Or maybe a self-driving delivery bot. In the old "closed classroom" method, the robot would be completely confused. To teach it about scooters, we'd have to throw away its old textbook, re-write every single page with scooter pictures, and start from scratch. That's expensive, slow, and impossible to do every time a new object appears on the road.

This paper introduces a new way of teaching called Open-World Motion Forecasting. Think of it as giving the robot a living, growing encyclopedia instead of a static textbook.

Here is how the authors' solution, called OMEN, works, using simple analogies:

1. The Problem: The "Amnesia" Robot

When you try to teach a robot something new without showing it the old stuff, it suffers from Catastrophic Forgetting. It's like a student who studies for a math test, then immediately starts studying for a history test, and suddenly forgets how to do addition. The robot learns about the new "scooter" but forgets how to predict where a "car" is going.

2. The Solution: OMEN's Two Superpowers

The authors built a system that learns new things without forgetting the old things. They do this with two clever tricks:

Trick A: The "Crystal Ball" and the "Fact-Checker" (Pseudo-Labeling)

When the robot encounters a new class of object (like a scooter) for the first time, it doesn't have a perfect teacher to tell it, "That is a scooter, and it will move like this."

The Crystal Ball: The robot uses its own "future vision" (a 3D detection model) to guess where objects will be in the next few seconds. It essentially creates its own "practice test" answers (pseudo-labels) for the new objects.
The Fact-Checker (VLM): Sometimes, the robot gets overconfident and hallucinates things that aren't there (like predicting a ghost car). To stop this, they use a Vision-Language Model (VLM)—think of it as a very smart, literal-minded librarian. The robot shows the librarian a picture and says, "I think that's a scooter." The librarian looks at the image and says, "No, that's just a shadow. I don't see a scooter there." The librarian filters out the robot's bad guesses, keeping only the reliable ones to learn from.

Trick B: The "Highlight Reel" (Experience Replay)

To stop the robot from forgetting the old stuff (like cars and pedestrians), the system needs to review old lessons. But robots have limited memory (storage), so they can't save every single video clip they've ever seen.

The Highlight Reel: Instead of saving random clips, OMEN looks at the robot's internal "thoughts" (latent features) about how things move. It asks: "Which of these old clips had the most interesting, complex, or unpredictable movements?"
It saves those specific "highlight reels" (sequences with diverse motions) and mixes them into the new training. This ensures the robot keeps practicing its old skills while learning new ones, just like a musician practicing old scales while learning a new song.

3. The Result: A Robot That Grows With the World

The paper tested this on massive datasets (like nuScenes and Argoverse 2) and even on a real self-driving car.

The Test: They taught the robot about cars first, then added pedestrians, then trucks, and so on, one by one.
The Outcome: Unlike other methods that forgot the cars when learning about trucks, OMEN remembered everything. It could predict where a car would go and where a new type of scooter would go, all at the same time.
Zero-Shot Magic: Even when they drove the robot in a completely new city (real-world testing) with cameras it had never seen before, it still worked. It didn't need to be retrained; it just applied what it learned.

The Big Picture

In the past, autonomous vehicles were like students who could only pass a test if the questions were exactly the same as the ones they studied. OMEN turns them into lifelong learners. They can adapt to new traffic rules, new types of vehicles, and new environments on the fly, without needing a massive library of pre-stored data or a complete system reboot.

It's the difference between a robot that says, "I don't know what that is, I'm confused," and a robot that says, "I've never seen a scooter before, but I know how to watch it, and I haven't forgotten how to watch cars either."

Here is a detailed technical summary of the paper "Open-World Motion Forecasting" by Schischka et al.

1. Problem Definition: Open-World Motion Forecasting

Current motion forecasting systems for autonomous vehicles operate under a closed-world assumption. They rely on fixed object taxonomies (e.g., car, pedestrian, truck) and assume access to perfect perception data (accurate 3D bounding boxes and tracks). In real-world deployment, these assumptions fail because:

Perception is imperfect: Detection and tracking errors cascade into forecasting.
Taxonomy evolves: New object classes (e.g., e-scooters, construction barriers) appear that were not present in the initial training data.
Catastrophic Forgetting: Naively fine-tuning a model on new classes with limited data causes the model to forget previously learned classes.
Resource Constraints: Re-annotating historical datasets and retraining models from scratch for every new class is economically and operationally infeasible, especially on edge devices with limited storage.

The authors formalize Open-World Motion Forecasting as an end-to-end, class-incremental learning task. The goal is to predict future trajectories of all agents directly from raw multi-view camera images, sequentially introducing new semantic classes while retaining performance on previously learned classes, without access to the original training data for those old classes.

2. Methodology: The OMEN Framework

The authors propose OMEN (Open-World Motion PrEdictioN), the first end-to-end framework designed for this setting. It addresses the challenge through two primary mechanisms:

A. VLM-Guided Pseudo-Label Generation

Since ground truth for old classes is unavailable during incremental steps, OMEN generates pseudo-labels to create a unified training dataset.

Future Detection Pseudo-Labels: Instead of simply propagating current detections, the framework uses the previously trained model ( $\Phi_{i-1}$ ) to detect objects in future time steps ( $t+1$ to $t+T$ ). These future 3D bounding boxes are transformed back to the current ego-coordinate system to generate accurate motion trajectories for existing classes.
VLM-Based Filtering: As models become more confident over time, they generate false positives (hallucinations). OMEN employs a Vision-Language Model (VLM), specifically Grounded SAM 2, to filter these predictions.
- The VLM is prompted with class names (e.g., "car, pedestrian") to generate 2D instance masks.
- 3D detection points are projected onto 2D image planes.
- If the majority of projected points for a 3D object fall within a VLM-generated 2D mask of the same class, the pseudo-label is accepted as a true positive. Otherwise, it is discarded.

B. Sequence-Based Experience Replay

To prevent catastrophic forgetting, OMEN utilizes a replay buffer, but with a novel selection strategy.

Sequence-Level Selection: Unlike standard methods that sample individual frames, OMEN selects entire video sequences to preserve temporal context required for motion forecasting.
Variance-Based Sampling: The selection is not random. The framework analyzes the latent query feature variance of motion queries ( $Q_{motion}$ $Q_{m o t i o n}$ ) from the previous model.
- It calculates the mean motion query for each class.
- It scores sequences based on the sum of squared deviations (variance) of object instances from this mean.
- Sequences with high variance (representing diverse, informative motion patterns like non-linear trajectories) are prioritized for the replay buffer. This ensures the model rehearses complex scenarios rather than static or redundant ones.

C. Extension to Planning

The framework naturally extends to class-incremental end-to-end planning. By concatenating an ego-vehicle query to the set of object queries, the model can predict the ego-vehicle's future trajectory alongside other agents, enabling the full autonomous driving stack to adapt to new classes.

3. Key Contributions

Task Formalization: Defined the novel Open-World Motion Forecasting task, bridging the gap between closed-world research and real-world deployment needs.
OMEN Framework: Proposed the first end-to-end class-incremental motion forecasting architecture.
Pseudo-Labeling Strategy: Introduced a method using future-frame detections combined with VLM-based error filtering to generate high-quality pseudo-labels for old classes.
Variance-Based Replay: Developed a sequence-based experience replay mechanism that selects samples based on latent motion query variance, effectively mitigating forgetting of complex motion patterns.
Zero-Shot Capability: Demonstrated that the model can generalize to real-world driving scenarios without specific fine-tuning on the target domain.

4. Experimental Results

The authors evaluated OMEN on nuScenes and Argoverse 2 datasets under two settings: per-class incremental (one class at a time) and group-incremental (groups of classes).

Performance vs. Upper Bound: OMEN achieves performance close to Joint Training (the upper bound where all data and labels are available simultaneously), significantly outperforming baselines like CL-DETR and naive fine-tuning.
Forgetting Mitigation:
- On nuScenes, OMEN maintained high accuracy on previously learned classes (e.g., cars) even after learning new classes (e.g., pedestrians, trucks), whereas baselines suffered severe performance drops.
- In the per-class setting, OMEN achieved an mAP f of 15.60% (All classes), compared to 14.35% for the next best baseline (CL-DETR) and significantly higher than the "Forgetting" baseline (4.00%).
Motion Complexity: The model showed superior performance on non-linear trajectories (complex movements), which are typically the hardest to retain in incremental learning.
Real-World Validation: In zero-shot tests on a self-driving vehicle in a different country (Singapore vs. US data), OMEN successfully maintained forecasting capabilities for cars and pedestrians, demonstrating robustness to domain shifts.
Planning: The approach successfully extended to open-loop planning, showing decreasing L2 error and collision rates as the model learned new classes.

5. Significance and Impact

This work represents a critical step toward continual learning in autonomous driving.

Scalability: It offers a scalable solution for adding new object categories (e.g., new types of delivery robots or construction vehicles) without the prohibitive cost of re-annotating terabytes of historical data.
Robustness: By operating end-to-end from raw images and filtering pseudo-labels with VLMs, it is more robust to perception errors than traditional modular pipelines.
Practicality: The memory-efficient replay strategy makes it feasible to deploy continual learning systems on resource-constrained edge devices in vehicles.

In conclusion, OMEN successfully bridges the gap between theoretical motion forecasting and the dynamic, evolving reality of open-world autonomous driving, providing a blueprint for systems that can learn and adapt throughout their lifecycle.