K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation

Imagine you are teaching a robot to drive a car through a busy city. The hardest part isn't just knowing how to press the gas or brake; it's understanding the story of the road. Is that pedestrian about to step off the curb? Is the car in front of you slowing down because they see a red light, or are they just distracted?

Most current AI driving systems are like students who only know how to read a spreadsheet. They see a list of coordinates (x, y, z) and try to guess the next move based on math. They miss the "vibe" of the scene.

K-Gen is a new approach that teaches the AI to be more like a human co-pilot who talks to you while driving. Here is how it works, broken down into simple concepts:

1. The Problem: The "Spreadsheet" vs. The "Movie"

Existing AI models often look at the road as a vector map—a bunch of lines and numbers. It's like trying to understand a movie by reading a list of character names and timestamps. You miss the context, the emotions, and the subtle details.

K-Gen changes the game by feeding the AI two things at once:

The Picture: A bird's-eye-view image of the road (like looking at a map on your phone).
The Story: A text description of what's happening (e.g., "A blue truck is merging slowly").

By combining the visual "movie" with the textual "story," the AI gets a much richer understanding of the scene.

2. The Secret Sauce: "Thinking Before Acting"

If you asked a normal AI to draw a 50-second driving path, it might just guess the whole line at once. If it gets one part wrong, the whole path is a disaster.

K-Gen uses a technique called Chain-of-Thought (CoT). Think of it like a chess player who doesn't just move a piece; they first say out loud: "If I move here, the opponent might move there, so I should be careful."

Instead of drawing the full line immediately, K-Gen:

Reasons: It writes a short paragraph explaining why the car should move a certain way (e.g., "I will stay straight because the intersection is clear").
Dots the Dots: It picks a few Key Points (sparse dots) along the path where the car needs to make a decision or change direction.
Connects the Dots: A special "Refiner" module then smoothly connects these dots into a full, realistic driving path.

This is like an architect sketching the corners of a building first, then filling in the walls, rather than trying to paint the whole building in one messy stroke.

3. The Coach: "T-DAPO" (The Tough Teacher)

Training an AI to drive is hard. If you just show it examples, it might learn to drive safely but boringly (like driving in a straight line forever).

The authors created a special training method called T-DAPO. Imagine a driving instructor who:

Focuses on the Hard Stuff: Instead of practicing on empty highways, the AI is forced to practice only on the most dangerous, tricky intersections (the top 30% of difficult scenarios).
Rewards Good Thinking: The AI gets "points" not just for hitting the right spot, but for writing a good explanation of why it moved there.
Punishes Bad Habits: If the AI writes a huge, confusing paragraph or draws a path that crashes, it gets a "red card" and has to try again.

This ensures the AI learns to be both safe and smart, not just lucky.

4. The Result: A Driver You Can Trust

When tested on real-world driving data (from Waymo and nuPlan), K-Gen outperformed all other methods.

It's Safer: It crashes less often in simulations.
It's Smoother: The paths it generates look more natural, like a human driving.
It's Explainable: Because it "thinks" out loud, if something goes wrong, we can read its reasoning to understand why.

The Big Picture

K-Gen is like upgrading a robot driver from a calculator to a narrator. It doesn't just calculate where to go; it understands the scene, explains its intentions, and plans its route step-by-step. By breaking the big, scary task of "driving" into smaller, understandable steps (Reasoning → Key Points → Refinement), it creates a system that is not only more accurate but also easier for humans to trust.

1. Problem Statement

In autonomous driving simulation, generating realistic and diverse traffic scenarios is critical for testing motion planning algorithms. However, existing methods face significant limitations:

Reliance on Structured Data: Most approaches depend on vectorized maps or structured agent representations. These formats abstract away rich spatial details and contextual semantics (e.g., lane markings, traffic signs, local geometry) essential for modeling complex interactions.
Lack of Interpretability and Control: While Large Language Models (LLMs) offer interpretability, direct trajectory generation by LLMs often suffers from coarse-grained motion control, physical inconsistency, and a lack of fine-grained spatial accuracy.
Generalization Issues: Methods relying on rigid intermediate representations struggle to generalize across diverse, unstructured driving environments.

The core challenge is to create a framework that leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs) to understand unstructured visual scenes (rasterized maps) while ensuring the generated trajectories are physically accurate, safe, and interpretable.

2. Methodology: K-Gen Framework

The authors propose K-Gen, a multimodal framework that unifies rasterized Bird's Eye View (BEV) map images with textual scene descriptions. Instead of predicting full trajectories directly, K-Gen employs a two-stage pipeline:

A. Keypoint-Guided Generation (MLLM)

The core of the system is an MLLM that processes:

Visual Input: Rasterized BEV map images.
Textual Input: Structured scene descriptions (agent types, positions, velocities) and system prompts.

The MLLM outputs two distinct components via a Chain-of-Thought (CoT) process:

Reasoning Tokens: Natural language explanations of agent intentions and scene analysis (e.g., "Vehicle 1 is moving southbound...").
Keypoint Tokens: Sparse, interpretable spatial keypoints (coordinates and timestamps) representing critical turning points or changes in velocity.

Data Preprocessing:

Keypoint Extraction: Uses the Douglas-Peucker algorithm for geometric simplification and kinematic thresholds for velocity changes to create a sparse set of ground-truth keypoints.
Reasoning Annotation: Uses an external LLM (Claude 3.7 Sonnet) to generate structured reasoning annotations regarding road geometry, collision risks, and intentions.

B. Trajectory Refinement (TrajRefiner)

Since MLLMs may produce keypoints that are geometrically correct but kinematically infeasible, a TrajRefiner module is used:

Input: Historical trajectories, agent states, and the sparse keypoints generated by the MLLM.
Process: It performs linear interpolation to create a coarse trajectory, then uses a Transformer-based residual network to predict corrections ( $\Delta Y$ ).
Loss Functions: The module is trained with a composite loss including Motion Loss (accuracy), Kinematic Consistency Loss (enforcing feasible velocity/heading), and Final Point Loss (endpoint precision).

C. Training Strategy: T-DAPO

To enhance the MLLM's performance beyond standard Supervised Fine-Tuning (SFT), the authors introduce T-DAPO (Trajectory-aware Decoupled Clip and Dynamic Sampling Policy Optimization):

Hard Sample Mining: Focuses reinforcement learning on the top 30% of samples with the highest prediction errors (mADE/mFDE) to prevent the model from overfitting to easy cases.
Composite Reward Function:
- Accuracy Reward ( $R_{acc}$ ): Based on ADE and FDE metrics.
- CoT Length Reward ( $R_{cot}$ ): Penalizes overly verbose reasoning to encourage conciseness.
- Format Reward ( $R_{fmt}$ ): Ensures strict adherence to output tags (e.g., <point>, <num>).
Stability: Uses a decoupled clipping mechanism to stabilize training on continuous trajectory spaces, preventing gradient oscillation.

3. Key Contributions

Multimodal Integration: K-Gen is the first framework to effectively integrate rasterized map images with textual reasoning for trajectory generation, moving beyond vectorized map limitations to capture rich visual context.
Keypoint-Guided Decomposition: By decomposing the task into "Reasoning + Keypoint Generation" followed by "Refinement," the model achieves higher accuracy and stability than direct trajectory prediction. This allows the MLLM to focus on high-level intent while the Refiner handles kinematic details.
T-DAPO Algorithm: A novel reinforcement fine-tuning algorithm tailored for trajectory generation that combines trajectory-centric rewards with dynamic sampling of hard examples, significantly improving performance in complex scenarios.

4. Experimental Results

The method was evaluated on two major datasets: WOMD (Waymo Open Motion Dataset) and nuPlan.

Quantitative Performance:
- K-Gen (8B model) achieved state-of-the-art results, outperforming baselines like LCTGen, InteractTraj, and various InternVL/Qwen models.
- WOMD: mADE: 0.915, mFDE: 2.422, Scenario Collision Rate (SCR): 0.006.
- nuPlan: mADE: 0.591, mFDE: 1.478, SCR: 0.027.
- Notably, K-Gen significantly reduced the Scenario Collision Rate (SCR), indicating superior safety compared to baselines that often overfit endpoint accuracy at the cost of safety.
Ablation Studies:
- Pipeline: Removing the TrajRefiner or the T-DAPO reinforcement stage resulted in significant performance drops, confirming the necessity of both components.
- TrajRefiner: The inclusion of Relative Coordinate Encoding (RCE) and Kinematic Consistency Loss (KCL) was crucial for reducing errors and ensuring physical feasibility.
Qualitative Analysis:
- Attention heatmaps showed the model correctly focusing on safety-critical regions (intersections, merging points, curved lanes) and interacting agents, validating its "reasoning" capability.

5. Significance

K-Gen represents a paradigm shift in autonomous driving simulation and trajectory prediction:

Interpretability: It provides human-readable reasoning alongside trajectory data, making the decision-making process of the AI transparent.
Robustness: By utilizing raw visual inputs rather than abstract vectorized maps, the system is better equipped to handle complex, unstructured real-world driving environments.
Safety: The combination of intent reasoning and kinematic refinement leads to trajectories that are not only accurate but also physically feasible and safe, addressing a critical gap in current generative models.
Scalability: The framework demonstrates that combining multimodal reasoning with specialized refinement modules yields better results than simply scaling up model parameters.