UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing

Imagine you are teaching a robot car how to race around a track made entirely of traffic cones. The goal is simple: drive fast without hitting the cones. But for the car, this is incredibly hard. The cones are small, they get dirty, the lighting changes, and the car is moving at high speeds. If the car can't see exactly where the cones are, it will crash or drive too slowly.

This paper is about building a "super-eye" for that robot car to spot the cones perfectly. Here is the story of how they did it, explained simply.

1. The Problem: The "Needle in a Haystack"

In a normal race, the track is painted on the ground. In autonomous racing (like Formula Student), the track is defined by blue cones on the left and yellow cones on the right.

The car needs to know two things instantly:

Where the cone is in 3D space (how far away and to the side).
What color it is (blue or yellow).

Old methods were like trying to find a specific grain of sand on a beach using a magnifying glass. They worked okay in perfect weather, but if a cone was muddy, cracked, or the sun was glaring, the computer got confused. Newer methods using deep learning were better but often too slow to run on the car's computer while driving fast.

2. The Solution: A "Digital X-Ray" (The UNet)

The researchers built a new AI model based on something called a UNet.

Think of a standard camera as just taking a photo. A UNet is like a digital X-ray that doesn't just see the cone; it sees the skeleton of the cone.

Instead of just saying "There is a cone here," the AI looks at the cone and marks six specific points on it (like the top corners, the bottom corners, and the middle of the stripe).
Imagine you are drawing a stick figure on a cone. The AI learns to draw that stick figure perfectly, even if the cone is dirty or half-hidden.

3. The Secret Sauce: A Massive Training Library

To teach this AI, you need a lot of practice. The researchers created the largest dataset of its kind:

They took 25,000 photos of cones from every angle imaginable.
They manually labeled the "stick figure" points on every single cone.
They used this massive library to train the AI until it became an expert at spotting those six points, no matter how messy the cone looked.

4. How It Works in Real Life (The 3D Magic)

Once the AI spots the six points on the cone in the camera image, how does it know how far away the cone is?

The car has two cameras (like human eyes).

The AI finds the six points on the cone in the left camera and the six points in the right camera.
It measures the tiny difference between where the points appear in the left eye versus the right eye (this is called disparity).
Just like your brain uses the difference between your two eyes to judge depth, the car uses this math to calculate exactly how far away the cone is in 3D space.

Because the AI is so good at finding those six points, the 3D calculation is incredibly accurate.

5. The Bonus: Reading the Color

Since the AI marks the specific "stripe" on the cone, it can also easily tell if that stripe is black or white, which helps confirm if the cone is blue or yellow. It's like the AI doesn't just see a blob of color; it sees the pattern on the cone.

6. The Results: Fast and Accurate

The team tested this on a real racing car simulator and on the actual car hardware.

Accuracy: It was much better than previous methods. It made fewer mistakes, even when cones were dirty or partially hidden.
Speed: They were worried the AI would be too slow for a racing car. They found that while it did use a little more computer power, it was still fast enough to run in real-time. It's like adding a turbocharger to the car's brain; it uses a bit more fuel (electricity), but the car drives much safer and faster.

The Big Picture

This paper shows that by teaching a robot to look at the specific details of an object (the key points) rather than just the general shape, we can make autonomous cars much safer and faster.

The Analogy:
Imagine playing a game of "Pin the Tail on the Donkey" while blindfolded, but you have a friend whispering exactly where the tail should go.

Old methods: The friend guesses vaguely ("It's somewhere near the middle").
This new method: The friend says, "It's exactly 2 inches left and 1 inch up."
The result: The car doesn't just guess where the track is; it knows exactly where the track is, allowing it to race at top speed without crashing.

This technology is a big step forward for making self-driving race cars that can compete with human drivers.

1. Problem Statement

In autonomous racing (e.g., Formula Student), precise 3D localization of track cones is critical for navigation. The track boundaries are defined by blue (left) and yellow (right) cones.

Challenges: Cones are small, vary in distance, and suffer from environmental degradation (mud, scuffs, collision damage). Lighting conditions and high speeds further complicate detection.
Limitations of Current Methods:
- Traditional CV: Algorithms like SIFT and SURF are robust to scale/rotation but fail in dynamic racing environments due to low reliability on small, variable objects and high computational latency (often requiring offline inference).
- Existing Deep Learning: Many models are trained on limited data or lack the ability to run in real-time on embedded hardware.
- Gap: There is a lack of specialized, high-accuracy Keypoint Regression (KPR) methods tailored specifically for the unique constraints of autonomous racing cones.

2. Methodology

A. Dataset Curation

Scale: The authors assembled the largest publicly available labeled dataset for this task, containing 25,000 annotated cone images (20,000 high-quality samples after filtering).
Annotation: Each cone is annotated with 6 keypoints (defining the stripe and base) rather than just a bounding box. This allows for robust color classification and cross-validation.
Source: Images were captured from various perspectives using a Stereolab ZED2 stereo camera and labeled via a custom Flask-based tool.

B. Model Architecture (UNet-based KPR)

The core contribution is a custom UNet-based neural network designed for keypoint detection:

Structure: Composed of an encoder (downsampling), bottleneck, and decoder (upsampling).
Layers: Uses 2D convolutional layers with Batch Normalization and ReLU activation.
- Encoder: Reduces spatial dimensions (stride 2), doubling channels (64 $\to$ 512).
- Decoder: Reconstructs spatial dimensions (stride 1), halving channels.
Output: Predicts coordinates for 6 keypoints per cone.
Training Strategy:
- Loss Function: Combines heatmap-based and position-based losses (L1 and Smooth L1).
- Augmentation: Includes random rotations (0, 90, 180, 270 degrees) and boundary cropping to improve generalization.
- Optimizer: AdamW with an exponential learning rate scheduler.

C. 3D Localization Pipeline

The system integrates the KPR model into a broader perception pipeline:

Initial Detection: YOLOv8 detects cones and provides bounding boxes.
Keypoint Refinement: The UNet model refines the location to 6 specific keypoints.
Depth Estimation:
- Uses stereo disparity between the left and right camera frames.
- Calculates depth ( $Z$ ) using the formula $Z = \frac{f \cdot T}{D}$ , where $f$ is focal length, $T$ is the baseline, and $D$ is the disparity of the mean keypoint position.
- Converts 2D image coordinates to 3D world coordinates ( $x', y', z'$ ).
Color Estimation: The 6 keypoints define the cone's stripe and base, enabling algorithmic color masking (Blue vs. Yellow) to determine track boundaries.

3. Key Contributions

Dataset: Release of a large-scale (25k images), custom-labeled dataset for cone detection, addressing the scarcity of training data in this domain.
Novel Architecture: A specialized UNet-based KPR model that outperforms traditional feature-matching (SIFT) and standard CNNs (ResNet) in accuracy while maintaining real-time feasibility.
End-to-End Integration: Demonstrated the integration of the KPR model into a full autonomous racing perception pipeline, including stereo depth estimation and color classification.
Scalability: The approach leverages geometric constraints (stereo disparity) to derive 3D supervision without requiring expensive ground-truth 3D annotations, making it scalable.

4. Results

Quantitative Performance

The UNet model significantly outperformed a baseline ResNet model (replicated from prior work) on a test set:

Mean Squared Error (MSE): Reduced from 6.32 (ResNet) to 3.42 (UNet).
Root MSE: Reduced from 2.35 to 1.85.
Mean Average Precision (mAP): Improved from 0.42 to 0.83.
Standard Deviation: Decreased from 6.43 to 3.46, indicating higher consistency.

Qualitative & System Performance

Failure Modes: The model occasionally fails on occluded or densely clustered cones (approx. 3% of cases), but these are rare.
Real-Time Feasibility:
- CPU: Enabling KPR increased CPU load across all 12 cores but remained within real-time limits.
- Memory: Negligible increase (peak disparity ~7%).
- GPU: GTX 1060 usage increased by only 3% (from 14% to 17%).
Pipeline Impact: Integration into the perception pipeline improved the accuracy of the racing line calculation, preventing the "snowball effect" of error propagation in path planning.

5. Significance and Conclusion

This paper demonstrates that a lightweight, UNet-based KPR approach is superior to both traditional computer vision and standard object detection methods for autonomous racing.

Safety & Performance: By providing highly accurate 3D cone localization and color identification, the system enables safer and faster navigation on unseen tracks.
Efficiency: The method achieves high accuracy without compromising the real-time processing requirements of embedded automotive systems.
Future Work: The framework is extensible to full 3D scene understanding (semantic voxel grids) and could be enhanced with image-text supervision to handle complex occlusion scenarios.

In summary, the authors successfully bridged the gap between high-accuracy perception and real-time constraints in autonomous racing, offering a scalable solution for competitive autonomous vehicle systems.