PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting

The paper introduces PPT, a scalable pretraining framework that leverages automatically generated pseudo-labeled trajectories from off-the-shelf detectors to enhance motion forecasting models' performance and generalization, particularly in low-data and cross-domain scenarios, while reducing reliance on costly manual annotations.

Yihong Xu, Yuan Yin, Éloi Zablocki, Tuan-Hung Vu, Alexandre Boulch, Matthieu Cord

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a self-driving car how to predict where pedestrians and other cars will go in the next few seconds. This is crucial for safety; if the car guesses wrong, it might crash.

Traditionally, to teach this skill, engineers had to hire armies of human annotators to watch hours of video and manually draw the exact path every single car took. This is like hiring a team of artists to redraw every frame of a movie by hand. It's expensive, slow, and the "rules" the artists follow often change from one city to another, making the car confused when it drives somewhere new.

Enter "PPT" (Pretraining with Pseudo-Labeled Trajectories).

Think of PPT as a revolutionary new way to train the car's brain. Instead of waiting for perfect, hand-drawn maps, PPT says: "Let's just use the raw, messy data the car's sensors see right now."

Here is how it works, broken down with simple analogies:

1. The "Messy Sketch" vs. The "Perfect Portrait"

  • The Old Way (Human Annotation): Imagine an art teacher asking students to draw a perfect portrait of a person. The teacher spends hours correcting every line to make it flawless. This is the "clean" data used in the past. It's great, but you can only get a few portraits because it takes so much time.
  • The PPT Way (Pseudo-Labels): Now, imagine you have a robot that can quickly sketch a person in seconds. The sketch isn't perfect; the nose might be slightly off, or the arm a bit crooked. But, the robot can draw millions of these sketches in the time it takes a human to draw one.
    • PPT uses off-the-shelf 3D cameras and tracking software (the "robots") to generate these "messy sketches" of car paths automatically.
    • The Magic: The authors discovered that even though these sketches are "noisy" and imperfect, they are actually better for learning than a few perfect portraits. Why? Because the mistakes teach the car to be robust. It learns that a car might drift left or right, rather than assuming it will always drive in a perfect straight line.

2. The "Musical Ear" Analogy

Imagine you want to teach a musician to play jazz.

  • Old Method: You give them sheet music written by a master composer (perfect, labeled data). They practice this specific song until they are perfect at it. But if you ask them to play a different style of jazz, they freeze.
  • PPT Method: You play them thousands of hours of live jazz recordings (the "noisy" pseudo-labels). Some recordings have background noise, some have the drummer rushing, some have the singer slightly off-key.
    • By listening to all this "messy" variety, the musician learns the essence of jazz. They learn how musicians interact, how rhythms shift, and how to adapt.
    • When you finally give them a specific sheet music (the small amount of perfect labeled data) to finish the job, they learn it incredibly fast because they already understand the "feel" of the music.

3. The "Diversity" Superpower

One of the coolest parts of PPT is that it doesn't just use one robot to draw the sketches. It uses nine different types of 3D detectors and trackers.

  • Think of it like asking nine different people to describe the same car. One might say it's "fast," another "blue," another "slightly to the left."
  • By combining all these different, slightly conflicting descriptions, the AI learns a much richer, more flexible understanding of the world. It stops relying on one specific "truth" and learns to handle the chaos of the real world.

Why Does This Matter?

The paper shows that PPT is a game-changer for three main reasons:

  1. It's Cheap and Fast: You don't need to hire humans to draw paths anymore. You just run the software on existing video data.
  2. It Works with Very Little Data: If you only have 1% of the usual labeled data (like having only 10 minutes of practice instead of 10 hours), a model trained with PPT still performs amazingly well. It's like a student who learns the concepts so well they only need a tiny bit of specific practice to ace the test.
  3. It Generalizes: A car trained with PPT in Paris can drive in Tokyo or New York without getting confused. Because it learned from "messy" and diverse data, it isn't stuck on the specific rules of one city.

The Bottom Line

PPT is like teaching a self-driving car by letting it watch millions of hours of "rough draft" traffic videos instead of waiting for a few hours of "perfect" videos. It turns the "noise" and "imperfections" of raw sensor data into a superpower, making the car safer, smarter, and ready to drive anywhere in the world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →