An Efficient LiDAR-Camera Fusion Network for Multi-Class 3D Dynamic Object Detection and Trajectory Prediction

Imagine you are driving a wheelchair through a busy, chaotic campus. You need to see pedestrians, cars, and cyclists, guess where they are going next, and avoid crashing—all while your "brain" (the computer) is small, cheap, and has limited battery power.

This paper presents a new "super-brain" for service robots that does exactly that. It's a smart system that combines LiDAR (a laser scanner that sees shapes and distances) and Cameras (which see colors and details) to understand the 3D world in real-time.

Here is the breakdown of their invention, explained with everyday analogies:

1. The Big Picture: The "Two-Eye" System

Most robots use either lasers or cameras, but both have flaws. Lasers are great at measuring distance but blind to color; cameras are great at seeing details but bad at judging depth.

The Old Way: Trying to glue these two together was like trying to mix oil and water. It was either too messy (inaccurate) or too heavy (slow), causing the robot to freeze up.
The New Way: The authors built a system that acts like a bilingual translator. It doesn't just force the laser data and camera data to sit next to each other; it teaches them to speak to each other efficiently, creating a single, clear picture of the world.

2. The Two Main Characters (The Models)

The system is made of two main "actors" that work together:

Actor A: The Detective (UniMT)

Job: To spot objects (people, cars, bikes) and draw 3D boxes around them instantly.
The Problem: Previous methods were like trying to find a needle in a haystack by looking at the whole haystack at once. It took too long and used too much energy.
The Solution: They used a new technology called Mamba. Think of Mamba as a super-efficient librarian. Instead of reading every single book on the shelf to find one title, Mamba knows exactly which section to look in and skips the rest.
- It uses a "soft" fusion method. Imagine the laser and camera data are two people describing a scene. Instead of shouting over each other (rigid fusion), they whisper to each other, combining their stories smoothly without losing any details.
- Result: It finds objects faster and more accurately than previous "heavy" systems, even on a cheap computer chip.

Actor B: The Fortune Teller (RTMCT)

Job: To guess where the detected objects will go in the next few seconds.
The Problem: Predicting the future is hard. A pedestrian might stop, turn left, or run. Old methods used complex "generative" models (like a chaotic artist trying to paint every possible future), which were slow and often produced weird, unrealistic predictions.
The Solution: They created a system based on Reference Trajectories.
- Imagine you are playing a game of "Guess the Path." Instead of inventing a path from scratch, the robot has a menu of 49 pre-defined moves (e.g., "go straight," "turn sharp left," "stop").
- The robot looks at the person, checks the menu, and picks the best matching move. It doesn't need to "dream" up a new path; it just selects the most likely one from its list.
- Result: It predicts where people will go in a split second, handling different types of objects (cars vs. people) without getting confused.

3. The "Glue": The Tracker

Between spotting the object and predicting its future, the robot needs to know that "Pedestrian #1" at second 1 is the same person at second 2.

They used a lightweight tracker called SimpleTrack.
The Analogy: Think of this as a sticky note. When the robot spots a person, it sticks a note on them. As they move, the robot just updates the note's position.
The Upgrade: The authors made this sticky-note system run on the robot's graphics card (GPU) instead of its main processor. This made it 11 times faster, ensuring the robot never gets "distracted" by the tracking process.

4. The Real-World Test: The Wheelchair Robot

The authors didn't just test this on a supercomputer in a lab. They put it on a real wheelchair robot with a low-end graphics card (an NVIDIA RTX 3060, which is common in gaming laptops).

The Challenge: The robot had to navigate a real campus with real people, bad lighting, and different sensors than the training data.
The Result: The system ran smoothly at 13.9 frames per second.
- Translation: The robot's "eyes" blinked 14 times every second, seeing and reacting to the world fast enough to avoid collisions. It was fast enough to be safe, but light enough to run on a budget robot.

Why This Matters

This paper solves a major problem in robotics: How do we make robots smart enough to navigate a busy world without needing a supercomputer the size of a fridge?

By using efficient "Mamba" technology for detection and a smart "menu-based" approach for prediction, they created a system that is:

Fast: It reacts in real-time.
Lightweight: It runs on cheap hardware.
Accurate: It sees better and guesses future paths more reliably than older methods.

In short, they gave a small, resource-limited robot the "street smarts" it needs to safely share the road with humans.

1. Problem Statement

Service mobile robots operating in complex daily environments (e.g., campuses, hospitals) require robust 3D perception to safely navigate around dynamic objects like pedestrians, vehicles, and cyclists. However, these robots typically have limited computational resources, making existing high-performance perception systems impractical.

Current approaches face several challenges:

End-to-End Models: While accurate, they are computationally expensive and often over-optimized for autonomous driving, lacking adaptability for resource-constrained robots.
Modular Approaches: Existing modular frameworks (detection $\to$ $\to$ tracking $\to$ $\to$ prediction) often suffer from:
- Inefficient Fusion: LiDAR-camera fusion methods (e.g., LSS-based or global attention) are either sensitive to calibration errors, computationally heavy due to depth estimation, or struggle to extract precise features from large-scale maps.
- Rigid Trajectory Prediction: Most trajectory predictors require fixed-length inputs, cannot handle multiple object classes simultaneously, or rely on complex generative models (GANs/CVAEs) that are slow to infer.
- Tracking Latency: Standard tracking algorithms often run on CPUs, creating bottlenecks in real-time pipelines.

The goal is to develop a lightweight, efficient, and accurate multi-modal framework capable of real-time 3D detection and diverse trajectory prediction for service robots.

2. Methodology

The authors propose a modular framework consisting of three core components: a novel detection model (UniMT), a GPU-accelerated tracker (SimpleTrack), and a novel trajectory prediction model (RTMCT).

A. Detection Model: UniMT (Unified modality detector with Mamba and Transformer)

UniMT is designed to fuse LiDAR point clouds and camera images efficiently.

Multi-model Mamba Encoder (MME):
- Instead of rigid fusion or expensive depth estimation, MME uses a "soft fusion" strategy.
- It employs two complementary branches: "LiDAR to Image" (projecting 3D voxels to 2D) and "Image to LiDAR" (associating 2D features with nearest 3D voxels).
- Features are serialized into 1D sequences and encoded using BiMamba (Bidirectional Mamba), a State Space Model (SSM) variant. This allows for linear computational complexity while capturing long-range context, outperforming standard Transformers and CNNs in efficiency.
3D Multi-model Deformable Attention (MDA):
- Used in the DETR-like decoder, this module replaces global attention.
- It adaptively samples sparse, informative features from multi-scale image maps and BEV LiDAR features based on 3D object queries.
- This reduces memory overhead and improves precision in feature extraction compared to global attention mechanisms.

B. Tracking Model: GPU-Accelerated SimpleTrack

The authors adopt the lightweight SimpleTrack (Tracking-by-Detection) algorithm.
To meet real-time constraints, they re-implemented the core computational components on the GPU, achieving an $\approx$ 11 $\times$ speedup compared to the original CPU implementation.

C. Trajectory Prediction Model: RTMCT (Reference Trajectory-based Multi-Class Transformer)

Input Handling: Accepts flexible-length historical trajectories for multiple object classes (Pedestrian, Car, Cyclist) and the robot itself.
Reference Trajectories: Instead of complex generative models, RTMCT uses a set of learnable reference trajectories representing distinct motion modes (e.g., stationary, turning, accelerating).
Architecture: A simple Transformer decoder with 2 layers.
- Self-Attention: Evaluates the plausibility of different reference trajectories.
- Cross-Attention: Refines predictions by modeling interactions between the target object and its neighbors.
Output: Generates diverse future trajectories efficiently by selecting the best-matching reference trajectory based on confidence scores.

3. Key Contributions

Unified Multi-Modal Framework: A complete pipeline integrating efficient detection, fast tracking, and diverse trajectory prediction tailored for mobile robots.
UniMT Detection Model: Introduces the MME (using Mamba for efficient multi-modal fusion) and MDA (sparse attention for precise feature extraction), achieving high accuracy with low latency.
RTMCT Prediction Model: A novel approach for multi-class, flexible-length trajectory prediction using reference trajectories and Transformers, avoiding the complexity of generative models.
Real-World Deployment: Successfully transferred the system to a resource-constrained wheelchair robot with an entry-level GPU (RTX 3060), demonstrating practical viability.
Open Source: The code and ROS packages are released to facilitate reproducibility.

4. Experimental Results

A. Detection Performance

CODa Benchmark (Campus Dataset):
- mAP: 73.60% (outperforming BEVFusion by +7.78% and CMT by +3.71%).
- Efficiency: Inference time of 139 ms on an RTX 3060, significantly faster than BEVFusion (176 ms) and CMT (147 ms), despite CMT having more parameters.
- Robustness: Demonstrated superior performance in varying lighting conditions (sunny, cloudy, night) and against calibration errors.
nuScenes Benchmark (Autonomous Driving):
- Achieved 72.7% mAP and 75.3% NDS, competing with state-of-the-art fusion methods despite using a lighter architecture.

B. Trajectory Prediction Performance

CODa Benchmark:
- Outperformed Social-GAN and Social-Implicit in Average Displacement Error (ADE) and Final Displacement Error (FDE) across all classes (Pedestrian, Car, Cyclist).
- Inference Time: 35 ms (vs. 52–99 ms for baselines).
- Generalization: When tested with tracking results (instead of ground truth) as input, performance remained nearly identical to ground-truth inputs, proving robustness to detection noise.

C. Real-World Deployment (Wheelchair Robot)

Hardware: 16-beam LiDAR, RGB-D camera, Intel i5 CPU, NVIDIA RTX 3060 GPU.
Performance:
- Total FPS: 13.9 FPS (Real-time).
- Breakdown: Detection (64.3 ms), Tracking (3.6 ms), Prediction (4.3 ms).
- Accuracy: Maintained high detection mAP (77.06%) and trajectory accuracy despite using lower-resolution sensors and fewer LiDAR beams compared to the training datasets.

5. Significance and Impact

This work bridges the gap between high-performance perception research and the practical constraints of service robotics.

Efficiency: By leveraging Mamba (SSMs) and Deformable Attention, the authors achieved a breakthrough in balancing accuracy and computational cost, moving away from the "heavy" architectures typical of autonomous driving.
Versatility: The system handles multi-class objects and flexible trajectory lengths, addressing a critical gap in existing trajectory predictors that often focus only on pedestrians or fixed-length inputs.
Practicality: The successful deployment on a wheelchair robot with minimal fine-tuning (only 861 frames) proves that the method is generalizable and ready for real-world deployment in dynamic, unstructured environments.
Reproducibility: The release of code and ROS packages lowers the barrier for other researchers and developers to implement efficient multi-modal perception in mobile robotics.