CAPS: Context-Aware Priority Sampling for Enhanced Imitation Learning in Autonomous Driving

Imagine you are teaching a brand-new student driver how to navigate a busy city. You have a massive video library of a perfect, professional driver's journey.

The Problem: The "Boring Commute" Trap
If you just show this student 10,000 hours of video, 9,900 of those hours will be boring: driving straight on an empty highway, stopping at a red light, or cruising down a quiet street. The student gets really good at these boring things.

But the real test isn't the boring stuff; it's the rare, scary moments: a kid running into the street, a car suddenly cutting in front of you while parking, or a sudden pile-up ahead. Because these "edge cases" happen so rarely in the video library, the student barely sees them. When they finally face one in real life, they freeze or crash because they never practiced it enough.

The Old Way: Guessing and Checking
Traditionally, to fix this, you might try to manually label the videos. "Okay, this is a 'parking cut-in' scenario, let's show this 100 times." But that takes forever and requires a human to watch every video. Or, you might try to just count how many times the car turns left vs. right, but that misses the context. Is the car turning left because it's a safe turn, or because it's swerving to avoid a crash? Simple counting can't tell the difference.

The New Solution: CAPS (The "Smart Librarian")
This paper introduces CAPS (Context-Aware Priority Sampling). Think of CAPS as a super-smart, AI-powered librarian who doesn't just read the books; it understands the stories.

Here is how CAPS works, using a simple analogy:

1. The "Magic Decoder Ring" (VQ-VAE)

Instead of just looking at the car's path (the line it draws on the road), CAPS looks at the whole story. It watches the car, the other drivers, the traffic lights, and the road signs all at once.

It uses a special tool called a VQ-VAE (which sounds complicated, but think of it as a "Story Summarizer"). It takes a complex driving scene and compresses it into a simple ID code (like a sticker with a number on it).

Scenario A: A car slowing down because of a red light gets "Sticker #12."
Scenario B: A car slowing down because a dog is in the road gets "Sticker #45."

Even though both cars are slowing down, the "Story Summarizer" knows they are totally different situations and gives them different stickers.

2. The "Rare Book Club" (Clustering)

Once every video clip has a sticker, the librarian groups them.

Group #12: 1,000 videos of red lights (Very common).
Group #45: Only 5 videos of dogs in the road (Very rare).

In a normal class, the teacher would spend 99% of the time teaching Group #12 because there are so many examples. The student would never learn about the dog.

3. The "Priority Pass" (Re-balancing)

This is where CAPS changes the game. It realizes that Group #45 is the most important to learn, even though it's small.

So, CAPS creates a Priority Pass. It tells the training computer:

"Hey, we have 1,000 examples of Red Lights. We only need to show the student 10 of those. But we only have 5 examples of the Dog scenario? Show those 5 examples 1,000 times!"

It artificially boosts the importance of the rare, difficult situations so the student driver practices them until they are an expert, without needing to film millions of new videos.

The Result: A Safer Driver

The paper tested this in a high-tech driving simulator (CARLA).

Without CAPS: The AI driver was okay at normal driving but crashed often in tricky situations.
With CAPS: The AI driver became much better at handling the scary, rare moments. It didn't just get a higher score; it actually became safer and more reliable.

Why This Matters

No Extra Work: You don't need humans to watch videos and label them. The AI figures out what's important on its own.
Smarter Learning: It teaches the AI to focus on what matters (safety and rare events) rather than what is easy (driving straight).
Scalable: As self-driving cars generate terabytes of data, we can't store or process everything. CAPS helps us pick out the "diamonds" in the rough and ignore the "dirt."

In a nutshell: CAPS is a smart filter that stops self-driving cars from over-practicing the boring stuff and forces them to master the dangerous, rare situations that keep us safe.

Here is a detailed technical summary of the paper "CAPS: Context-Aware Priority Sampling for Enhanced Imitation Learning in Autonomous Driving."

1. Problem Statement

The paper addresses a critical bottleneck in Imitation Learning (IL) for autonomous driving: data imbalance.

The Issue: Expert demonstration datasets are heavily skewed toward trivial scenarios (e.g., straight cruising, stopping at signs) which are easy for rule-based planners to handle. Conversely, rare but critical "edge cases" (e.g., parking cut-ins, sudden stops, near-accidents) are underrepresented.
The Consequence: Models trained on uniformly sampled data overfit to common behaviors and fail to generalize to rare, high-risk situations. In closed-loop evaluation, a single failure in an edge case can lead to catastrophic consequences.
Limitations of Existing Solutions:
- Manual Labeling: Too costly and subjective; criteria vary by task (planning vs. prediction).
- Rule-based Clustering (e.g., KNN on trajectories): Fails to capture context. For example, it cannot distinguish between decelerating for a red light versus decelerating to avoid a collision.
- Data Augmentation/SMOTE: Often relies on offline clustering or manual intervention, which does not scale well.

2. Methodology: CAPS Framework

The authors propose Context-Aware Priority Sampling (CAPS), a two-stage framework that leverages Vector Quantized Variational Autoencoders (VQ-VAE) to automatically identify and prioritize informative samples without human labeling.

Core Architecture

The system integrates a Context Encoder and a Trajectory Decoder with a Clustering Module.

Context Encoder: Uses VectorNet to process scene information, including the ego vehicle's past/future states ( $s_{ego}$ ), surrounding objects ( $s_{obj}$ ), and map context ( $c$ ).
VQ-VAE Clustering: Instead of continuous latent variables, VQ-VAE maps the ego's embedding ( $z_{ego}$ $z_{e g o}$ ) to a discrete set of latent codes (a codebook).
- The encoder maps the scene context to an embedding.
- The embedding is quantized to the nearest codebook vector ( $e_k$ ), assigning a Cluster ID to the sample.
- This process ensures that samples with similar contextual patterns (not just trajectory shapes) are grouped together.

Two-Stage Training Process

Stage I (Representation Learning):
- The planner and the VQ-VAE are trained jointly.
- The planner acts as a generative model to reconstruct the ego trajectory.
- The VQ-VAE learns to encode the scene context into discrete latent codes.
- Output: A trained VQ-VAE model capable of assigning a Cluster ID to any new data sample based on its context.
Stage II (Priority Sampling & Planning):
- The trained VQ-VAE is used to assign Cluster IDs to the entire training dataset.
- Re-balancing: Sampling weights are calculated based on the inverse of cluster frequency. Rare clusters (edge cases) receive higher weights.
- The planner is re-trained using these weighted samples, forcing the model to focus on underrepresented but critical scenarios.

3. Key Contributions

Novel Framework (CAPS): Introduces a method to learn context-aware representations of expert demonstrations for class-balanced training, moving beyond simple trajectory-based clustering.
Automated Edge Case Identification: Utilizes VQ-VAE to automatically cluster scenarios based on rich contextual information (agents, map, ego state), eliminating the need for costly manual labeling.
Decoupled Training: Separates the representation learning (Stage I) from the planner optimization (Stage II), ensuring high-quality clustering regardless of the downstream planner's loss function.
State-of-the-Art Performance: Demonstrates that context-aware sampling significantly outperforms rule-based strategies (endpoint/anchor clustering) and other baselines.

4. Experimental Results

The method was evaluated in the CARLA Leaderboard 2.0 simulator using the Bench2Drive benchmark (220 short-segment scenarios).

Performance Metrics

Driving Score: Penalizes infractions (collisions, off-road, rule violations).
Success Rate: Percentage of scenarios completed without critical failure.

Key Findings

Comparison with Baselines (Table I):
- Using Privileged Inputs (perfect scene representation): CAPS achieved a Driving Score of 68.91 and Success Rate of 56.97%, significantly outperforming Anchor-based (62.60/51.83) and End-point clustering (59.63/48.23).
- Using Sensor Inputs (camera-based): CAPS achieved 66.76 Driving Score and 52.87% Success Rate, again outperforming all rule-based and Prioritized Experience Replay (PER) baselines.
Ablation Study (Table II):
- Removing agent or map context during clustering degraded performance significantly.
- CAPS reduced Average Completion Time by 32% compared to models without agent/map context, proving that context is vital for efficient decision-making.
Qualitative Analysis:
- Visual inspection of clusters (Fig. 2) confirmed that CAPS groups semantically similar scenarios (e.g., "parking cut-ins" or "waiting behind obstacles") even if they occur in different locations.
- Temporal analysis (Fig. 3) showed that the VQ-VAE embedding space captures critical transitions (e.g., sudden lane changes due to congestion) via sudden jumps in codebook IDs.
Comparison with SOTA (Table III):
- CAPS outperformed other learning-based planners (e.g., UniAD, VAD, TCP-traj) with similar computational budgets.
- While still below a manually tuned "Expert" planner, CAPS closed the gap significantly compared to standard IL with uniform sampling.

5. Significance and Impact

Data Efficiency: CAPS proves that a model trained on a re-balanced subset of data (prioritizing rare events) can generalize better than one trained on the full, uniform dataset.
Scalability: The approach eliminates the need for human annotators to label edge cases, making it scalable for large fleets where terabytes of data are generated daily.
Safety: By explicitly prioritizing rare, high-risk scenarios during training, the framework directly addresses the safety gap in autonomous driving, reducing the likelihood of catastrophic failures in closed-loop environments.
Future Application: The authors suggest this framework can also be applied during the data collection phase to selectively store high-value driving experiences, optimizing storage and processing costs.

In summary, CAPS provides a robust, automated solution to the data imbalance problem in autonomous driving by leveraging deep representation learning to identify and prioritize the most critical driving scenarios for training.