AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

Imagine you are teaching a robot how to walk through a busy shopping mall without bumping into anyone. The robot needs to predict where people will be in the next few seconds so it can move smoothly and safely. This is the challenge of trajectory forecasting.

For a long time, robots tried to learn this by "trial and error" (like a baby learning to walk) or by following rigid mathematical rules. But these methods often fail in complex, crowded places because they can't really "understand" human behavior.

Recently, scientists started using Large Language Models (LLMs)—the same AI brains behind chatbots like me—to help robots. The idea is: "If the AI can understand language and stories, maybe it can understand the 'story' of how people move."

However, previous attempts had a major flaw. They tried to teach the robot by having it "read" coordinates as text (e.g., writing out "7.133, 3.190" word by word). This was like trying to drive a car by reading the GPS coordinates out loud one number at a time. It was slow, inefficient, and the robot often got lost in the details.

Enter AutoTraces.

The researchers at Southeast University created a new system called AutoTraces. Here is how it works, using some simple analogies:

1. The "Special Token" Shortcut (The Magic Stamps)

Instead of making the robot write out every single number of a coordinate (which is like writing a novel just to say "turn left"), AutoTraces introduces a special stamp called <point>.

The Old Way: The robot sees a path and has to generate a long string of text: 7, ., 1, 3, 3, ,, 3, ., 1, 9... It's clunky and prone to errors.
The AutoTraces Way: The robot uses a special "stamp" token. When it sees a point on the map, it just stamps <point>. Behind the scenes, a tiny, efficient translator (an encoder) instantly converts that stamp into the exact mathematical coordinates the robot needs.

Analogy: Imagine you are sending a package.

Old Way: You write the address out letter by letter on a giant scroll.
AutoTraces Way: You just stick a pre-printed "Address Label" on the box. The delivery system (the LLM) knows exactly what to do with that label without needing to read every letter. This makes the robot much faster and more accurate.

2. The "Thinking Aloud" Mechanism (Chain-of-Thought)

Humans don't just move randomly; we have reasons. "I'm turning left because there's a crowd on the right." Previous AI models just guessed the next step without explaining why.

AutoTraces uses a technique called Chain-of-Thought (CoT). Before the robot decides where to go, it "thinks aloud" (internally).

How it works: The system automatically analyzes the video and the path, asking itself questions like: "Is the path clear? Is the person turning? Are there obstacles?"
The Magic: It doesn't need a human to write these thoughts down. Another AI helps generate these "thoughts" automatically, teaching the robot why a certain path makes sense.

Analogy: Think of a chess player.

Old AI: Moves a piece randomly because it saw a similar pattern before.
AutoTraces: Like a grandmaster who pauses and says, "I'm moving here because it blocks their attack and opens a path for my queen." This deeper understanding helps it handle new, weird situations it hasn't seen before.

3. The "Storyteller" Approach (Autoregressive Generation)

Most robots predict a whole path at once (like looking at a map and drawing the whole line). If they make a mistake at the start, the whole path is wrong.

AutoTraces predicts the path one step at a time, like telling a story.

It predicts the next step.
Then it takes that new step, adds it to the story, and predicts the next step based on the new situation.
It can keep going for as long as needed (flexible length), unlike other models that are stuck predicting a fixed number of steps.

Analogy:

Old Way: Trying to guess the ending of a movie by looking at the first frame and writing the whole script at once.
AutoTraces: Watching the movie scene by scene. After every scene, it asks, "Okay, what happens next?" This allows it to adapt if the plot twists unexpectedly.

Why is this a big deal?

The paper shows that AutoTraces is smarter, faster, and more flexible than previous methods.

It generalizes better: If you train it in a mall, it can handle a park or a subway station without needing to be retrained from scratch.
It handles long paths: It can predict where a robot should be 20 seconds from now, not just 5.
It's efficient: It uses fewer computer resources to do the same job.

In a nutshell: AutoTraces teaches robots to navigate human spaces by giving them a "special vocabulary" for movement and a "thinking process" to understand social cues, allowing them to move through crowds as naturally as a human would.

Here is a detailed technical summary of the paper "AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models."

1. Problem Statement

The core challenge addressed is socially compliant trajectory forecasting for autonomous robots operating in human-populated environments (e.g., campuses, malls).

Limitations of Current Methods:
- Deep Reinforcement Learning (DRL): Relies on trial-and-error, making deployment difficult and data-inefficient.
- Imitation Learning (Transformer-based): Models like ViNT and NoMad typically predict fixed-length trajectory sequences. They struggle with long-horizon predictions and lack generalization in open-world scenarios due to a lack of human-like reasoning.
- Existing LLM Approaches: Recent methods attempt to use LLMs but often treat coordinates as raw text (e.g., "[7.133, 3.190]"). This leads to token inefficiency (many tokens per coordinate) and poor spatio-temporal modeling. Furthermore, most LLM-based spatio-temporal models are non-autoregressive, generating the entire future sequence in one pass, which limits their ability to model long-term temporal dynamics and flexible-length forecasting.
The Gap: There is a need for a model that combines the reasoning capabilities of Multimodal LLMs with efficient, autoregressive generation of physical coordinates, capable of handling variable-length predictions and complex social interactions without manual annotation.

2. Methodology: AutoTraces

AutoTraces is an autoregressive vision-language-trajectory model built upon the LLaVA-Video architecture. It introduces three key technical innovations:

A. Novel Trajectory Tokenization Scheme

Instead of outputting coordinates as raw text strings, AutoTraces introduces a specialized Point Token mechanism:

Tokens: A special categorical token <point> is used to mark every waypoint (historical or future).
Point Embeddings: The numerical values of the coordinates ( $x, y$ ) are not tokenized as text but are encoded into continuous point embeddings via a lightweight Point Encoder (Transformer-style positional encoding + MLPs).
Integration: These point embeddings are seamlessly integrated into the LLM's latent space alongside visual and textual tokens.
Benefit: This preserves the LLM's native autoregressive generation mechanism while extending it to physical coordinate spaces, allowing for efficient, structured waypoint prediction.

B. Automated Chain-of-Thought (CoT) Reasoning

To enhance the model's understanding of complex social behaviors and spatio-temporal relationships:

Automation: Instead of relying on costly manual annotations, the authors use a powerful external VLM (Qwen-VL-Max) to automatically generate CoT reasoning.
Process: The system analyzes visual observations and trajectory data (including curvature analysis) to produce structured reasoning traces (e.g., "To avoid pedestrians, the robot veers right...").
Integration: These reasoning traces are included in the input prompt during training, bridging the gap between visual perception and trajectory prediction.

C. Two-Stage Training Strategy

The model is trained using a progressive two-stage approach:

Stage 1 (Reasoning Pre-training): The model is fine-tuned on video-text pairs with the automatically generated CoT rationales. The objective is to learn interpretable reasoning patterns and ground visual contexts into coherent logic. Only LoRA layers and the Text Head are optimized.
Stage 2 (Trajectory Forecasting): The model is specialized for trajectory prediction by integrating the Point Encoder and Point Head.
- Loss Function: Combines standard Cross-Entropy loss (for sequence structure) with a Point Loss ( $L_{point}$ ) (L1 regression loss) to directly supervise the accuracy of the predicted coordinates.
- Autoregressive Loop: During inference, the model generates one <point> token at a time. The predicted coordinate is decoded, re-encoded by the Point Encoder, and appended to the input sequence for the next step, enabling true autoregressive generation.

3. Key Contributions

Novel Tokenization: A trajectory tokenization scheme using <point> tokens and learnable embeddings, enabling autoregressive generation of trajectories with enhanced spatio-temporal modeling, overcoming the inefficiencies of text-based coordinate representation.
Automated CoT: An automated mechanism for generating Chain-of-Thought reasoning using multimodal LLMs, eliminating the need for manual annotation while improving the model's comprehension of complex social behaviors.
Flexible-Length Forecasting: The architecture supports variable-length predictions, allowing robots to adapt to different navigation scenarios and velocities without retraining on fixed horizons.
SOTA Performance: Demonstrates state-of-the-art accuracy in both short-term and long-horizon predictions, with superior cross-scene generalization.

4. Experimental Results

The model was evaluated on the SCAND dataset (social navigation) and tested for generalization on GoStanford (indoor) and RECON (outdoor) datasets.

Accuracy (SCAND):
- Outperformed all baselines (GNM, ViNT, NoMad, CityWalker, LLaVA-Video).
- At $T=10$ (long horizon), AutoTraces achieved an L2 error of 1.089m, significantly outperforming the second-best (CityWalker at 1.407m).
- Short-term ( $T=5$ ) error was 0.674m, surpassing GNM by ~0.18m.
Cross-Scene Generalization:
- On unseen datasets (GoStanford/RECON), AutoTraces consistently outperformed non-autoregressive baselines and the text-only LLaVA-Video.
- On RECON (outdoor), it reduced L2 error by 30-32% compared to LLaVA-Video at long horizons.
Efficiency & Long-Horizon:
- Instruction Following: Achieved 99.92% accuracy in generating the requested trajectory length, compared to 40.34% for LLaVA-Video.
- Token Efficiency: Reduced Tokens Per Response (TPR) from 375 (LLaVA-Video) to 25 by using single-point tokens instead of text serialization.
- Data Efficiency: Achieved high performance with minimal fine-tuning (1 epoch on 1/8th of the data) compared to baselines requiring more data.

5. Significance

AutoTraces represents a significant shift in robotic trajectory forecasting by successfully bridging Large Language Models with physical control tasks.

Reasoning over Regression: It moves beyond simple regression to incorporate reasoning (via CoT) about social norms and environmental constraints, leading to more human-compliant behaviors.
Scalability: The autoregressive, flexible-length design allows the model to adapt to diverse robotic platforms and dynamic environments without the rigidity of fixed-horizon models.
Efficiency: The point-tokenization scheme solves the token inefficiency problem of previous LLM approaches, making real-time, long-horizon prediction computationally feasible.
Generalization: The ability to generalize to unseen scenes with minimal retraining suggests a path toward truly general-purpose autonomous agents in complex human environments.

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

1. The "Special Token" Shortcut (The Magic Stamps)

2. The "Thinking Aloud" Mechanism (Chain-of-Thought)

3. The "Storyteller" Approach (Autoregressive Generation)

Why is this a big deal?

1. Problem Statement

2. Methodology: AutoTraces

A. Novel Trajectory Tokenization Scheme

B. Automated Chain-of-Thought (CoT) Reasoning

C. Two-Stage Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers