Open-World Motion Forecasting

This paper introduces "Open-World Motion Forecasting," an end-to-end class-incremental framework that predicts future trajectories directly from camera images while mitigating catastrophic forgetting through pseudo-labeling with vision-language models and a novel query feature variance-based replay strategy, enabling continual adaptation to evolving object taxonomies in real-world autonomous driving.

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot driver how to navigate a busy city. Traditionally, we've taught these robots in a "closed classroom." We give them a textbook with pictures of cars, pedestrians, and trucks, and we say, "These are the only three things that exist. Memorize them."

But the real world is messy. Suddenly, a new type of vehicle appears: an electric scooter. Or maybe a self-driving delivery bot. In the old "closed classroom" method, the robot would be completely confused. To teach it about scooters, we'd have to throw away its old textbook, re-write every single page with scooter pictures, and start from scratch. That's expensive, slow, and impossible to do every time a new object appears on the road.

This paper introduces a new way of teaching called Open-World Motion Forecasting. Think of it as giving the robot a living, growing encyclopedia instead of a static textbook.

Here is how the authors' solution, called OMEN, works, using simple analogies:

1. The Problem: The "Amnesia" Robot

When you try to teach a robot something new without showing it the old stuff, it suffers from Catastrophic Forgetting. It's like a student who studies for a math test, then immediately starts studying for a history test, and suddenly forgets how to do addition. The robot learns about the new "scooter" but forgets how to predict where a "car" is going.

2. The Solution: OMEN's Two Superpowers

The authors built a system that learns new things without forgetting the old things. They do this with two clever tricks:

Trick A: The "Crystal Ball" and the "Fact-Checker" (Pseudo-Labeling)

When the robot encounters a new class of object (like a scooter) for the first time, it doesn't have a perfect teacher to tell it, "That is a scooter, and it will move like this."

  • The Crystal Ball: The robot uses its own "future vision" (a 3D detection model) to guess where objects will be in the next few seconds. It essentially creates its own "practice test" answers (pseudo-labels) for the new objects.
  • The Fact-Checker (VLM): Sometimes, the robot gets overconfident and hallucinates things that aren't there (like predicting a ghost car). To stop this, they use a Vision-Language Model (VLM)—think of it as a very smart, literal-minded librarian. The robot shows the librarian a picture and says, "I think that's a scooter." The librarian looks at the image and says, "No, that's just a shadow. I don't see a scooter there." The librarian filters out the robot's bad guesses, keeping only the reliable ones to learn from.

Trick B: The "Highlight Reel" (Experience Replay)

To stop the robot from forgetting the old stuff (like cars and pedestrians), the system needs to review old lessons. But robots have limited memory (storage), so they can't save every single video clip they've ever seen.

  • The Highlight Reel: Instead of saving random clips, OMEN looks at the robot's internal "thoughts" (latent features) about how things move. It asks: "Which of these old clips had the most interesting, complex, or unpredictable movements?"
  • It saves those specific "highlight reels" (sequences with diverse motions) and mixes them into the new training. This ensures the robot keeps practicing its old skills while learning new ones, just like a musician practicing old scales while learning a new song.

3. The Result: A Robot That Grows With the World

The paper tested this on massive datasets (like nuScenes and Argoverse 2) and even on a real self-driving car.

  • The Test: They taught the robot about cars first, then added pedestrians, then trucks, and so on, one by one.
  • The Outcome: Unlike other methods that forgot the cars when learning about trucks, OMEN remembered everything. It could predict where a car would go and where a new type of scooter would go, all at the same time.
  • Zero-Shot Magic: Even when they drove the robot in a completely new city (real-world testing) with cameras it had never seen before, it still worked. It didn't need to be retrained; it just applied what it learned.

The Big Picture

In the past, autonomous vehicles were like students who could only pass a test if the questions were exactly the same as the ones they studied. OMEN turns them into lifelong learners. They can adapt to new traffic rules, new types of vehicles, and new environments on the fly, without needing a massive library of pre-stored data or a complete system reboot.

It's the difference between a robot that says, "I don't know what that is, I'm confused," and a robot that says, "I've never seen a scooter before, but I know how to watch it, and I haven't forgotten how to watch cars either."