Person Detection and Tracking from an Overhead Crane LiDAR

Imagine a busy factory floor where giant cranes are moving heavy materials back and forth. In the middle of all this machinery, human workers are walking around. The big question is: How do we make sure the crane doesn't accidentally bump into a person?

Usually, we use cameras for this. But cameras have problems: they get blinded by bright lights, they can't see through smoke or dust, and they invade people's privacy because they capture faces.

This paper proposes a different solution: A "super-sense" laser scanner (LiDAR) hanging from the ceiling.

Here is the story of what the researchers did, explained simply:

1. The Problem: The "Upside-Down" View

Think of a self-driving car. It has sensors on the front that look at the world like a human does: straight ahead. It sees people as tall, thin shapes.

Now, imagine hanging a sensor on the ceiling of a factory. It looks straight down.

The Analogy: If you look at a person from the front, you see their face and chest. If you look at them from directly above (like a drone), you only see the top of their head and their shoulders. They look like a small, flat circle.
The Challenge: The computer programs (AI) that are good at spotting people in cars are trained to look at people from the front. If you give them a "top-down" view, they get confused. It's like trying to recognize a friend by looking only at the top of their head; you might mistake a hat for a person!

2. The Solution: Building a New "Library"

Since there were no existing "top-down" photos of people to teach the AI, the researchers had to create their own.

The Dataset: They set up a laser scanner in a real factory crane. They had three friends walk around, wave, and move in different ways. They used software to draw 3D boxes around these people in the laser data.
The Result: They created a special "textbook" (dataset) specifically for teaching computers how to see people from the ceiling.

3. The Training: Trying Different "Eyes"

The researchers took five different AI models (the "eyes" of the system) that were originally trained for cars and tried to teach them to see people from the ceiling.

The Analogy: Imagine you have five different chefs who are experts at making Italian pasta. You want them to make sushi. You can't just give them the sushi; you have to show them the ingredients and teach them the new technique. This is called Transfer Learning.
The Winners: Two of the chefs (called VoxelNeXt and SECOND) learned the fastest and became the best sushi chefs.
- VoxelNeXt was amazing at spotting people who were close to the crane (within 3 meters).
- SECOND was the most reliable at spotting people who were further away, even when the laser signal got a bit fuzzy.

4. The Tracking: The "Name Tag" System

Detecting a person once is easy. But what if they walk behind a pillar and come back out? The system needs to know, "That's still John, not a new person."

The Analogy: Imagine a bouncer at a club. He sees a person enter (Detection). He then follows them with his eyes, making sure he doesn't lose track of them even if they move behind a pillar (Tracking).
The researchers used two lightweight "bouncers" (AB3DMOT and SimpleTrack). They found that if the "eyes" (the detector) are good, the "bouncer" (the tracker) does a great job. If the eyes are bad, the bouncer gets confused.

5. The Results: How Good Is It?

Near the Crane: When a person is within 1 meter of the crane, the system is almost perfect (97% accuracy). It's like having a hawk's eye.
Further Away: As you move further out (up to 5 meters), the accuracy drops a bit but stays very high (84%).
Speed: The system is fast enough to run on a small computer (like a high-end gaming laptop) in real-time. It doesn't need a massive supercomputer.

Why Does This Matter?

This paper is a big step forward for industrial safety.

Privacy: It doesn't take photos of faces, so workers don't feel like they are being watched by a camera.
Reliability: It works in the dark, in dust, and in bright sunlight.
Open Source: The researchers shared their "textbook" (dataset) and their "recipes" (code) for free. This means other factories can now build their own safety systems without starting from scratch.

In a nutshell: The researchers taught a computer to look down from a crane and spot humans with laser precision, creating a safety net that is fast, private, and works in the toughest factory conditions.

Here is a detailed technical summary of the paper "Person Detection and Tracking from an Overhead Crane LiDAR".

1. Problem Statement

The paper addresses the critical safety challenge of detecting and tracking human workers in industrial indoor environments (specifically factories and warehouses) where they operate in close proximity to heavy machinery, such as overhead cranes.

Domain Shift: Existing LiDAR-based person detection models are predominantly trained on autonomous driving datasets (vehicle-centric, frontal views). These models suffer from a significant domain shift when applied to overhead viewpoints, where point density distributions, occlusion patterns, and geometric perspectives differ drastically.
Data Scarcity: There is a lack of public, annotated 3D LiDAR datasets specifically for overhead-view person detection and tracking in indoor industrial settings.
Technical Challenges: Overhead LiDAR (OLiDAR) faces issues with sparse point clouds on small targets (humans) and varying point density based on range and incidence angle.

2. Methodology

A. System Architecture

The proposed system consists of a two-stage pipeline:

Detection Module: A deep learning-based 3D detector processes sequential point clouds to output per-frame human instances (oriented 3D bounding boxes).
Tracking Module: A lightweight "tracking-by-detection" module associates detections across time steps to maintain unique IDs and trajectories.

B. Dataset Creation

Data Collection: A site-specific dataset was curated using a 32-channel RS-Bpearl LiDAR mounted on an overhead crane at a height of 2.94 meters.
Annotations: The dataset includes manually annotated 3D bounding boxes for human targets using labelCloud.
Scale: The dataset is intentionally small to minimize labeling overhead but sufficient for transfer learning:
- Training/Validation: 29 annotated frames (3 participants).
- Testing: 76 annotated frames (10 new participants) to test generalization.
- Tracking Evaluation: 80 frames across multiple clips with pseudo-ground truth IDs generated by the trackers themselves (due to lack of manual ID labeling).

C. Model Adaptation & Selection

The authors evaluated five candidate 3D detection architectures, adapting them via transfer learning (fine-tuning from pretrained weights on KITTI/nuScenes) to the overhead viewpoint:

PointPillars (PP): Converts points to vertical pillars for 2D CNN processing.
SECOND: Uses sparse 3D convolutions on voxelized data.
PV-RCNN: A two-stage approach combining voxel and point features.
VoxelNeXt: A fully sparse pipeline avoiding dense conversions.
Voxel RCNN: A two-stage approach using voxel features for refinement.

Tracking Algorithms: Two lightweight, non-learning trackers were adapted:

AB3DMOT: Uses Kalman Filtering (KF) with Mahalanobis distance and IoU for association.
SimpleTrack: Relies on BEV geometric IoU overlap for association with KF motion prediction.

D. Evaluation Protocol

Distance-Sliced Evaluation: Performance was analyzed based on horizontal radial distance ( $r$ ) from the LiDAR (1.0m to 5.0m) to define the practical operating envelope.
Metrics: Precision, Recall, F1-score, Average Precision (AP), Mean IoU (mIoU), and latency (inference time).
Tracking Metrics: MOTA (accuracy), MOTP (localization precision), and IDF1 (ID consistency).

3. Key Contributions

Site-Specific Dataset: Creation and release of the first annotated 3D LiDAR dataset for overhead-view person detection in an industrial crane environment.
Comprehensive Benchmark: A unified evaluation of five state-of-the-art 3D detectors adapted for overhead sensing, highlighting the necessity of transfer learning.
Distance-Sliced Analysis: A novel evaluation method quantifying detection feasibility across different working radii, revealing how performance degrades with distance and point sparsity.
Real-Time Feasibility: Demonstration that lightweight tracking-by-detection pipelines can operate in real-time on edge hardware (Jetson Orin NX) with low latency.
Open Source: Public release of the dataset, code, and trained models on GitHub.

4. Results

Detection Performance

Top Performers: VoxelNeXt and SECOND emerged as the most reliable backbones.
- VoxelNeXt excelled in the near field (< 3.0m), achieving an AP of 0.97 at 1.0m and 0.84 at 5.0m.
- SECOND demonstrated superior robustness at larger distances (> 3.0m) where point density decreases.
Impact of Transfer Learning: Pretrained models (without fine-tuning) performed significantly worse (e.g., SECOND AP dropped from 0.84 to ~0.47 at 5.0m), confirming the large domain gap between driving datasets and overhead sensing.
Latency: All adapted detectors (except PV-RCNN) achieved inference speeds suitable for real-time applications (p50 latency between 32ms–46ms on a CPU).

Tracking Performance

Detector Dependency: Tracking quality (MOTA, IDF1) was heavily dependent on the upstream detector. Using VoxelNeXt with AB3DMOT yielded the best results (MOTA: 0.70 at IoU=0.3; 0.83 at IoU=0.1).
Tracker Comparison: AB3DMOT was significantly faster (1.08ms) than SimpleTrack (6.30ms) while maintaining comparable accuracy.
Localization: While ID tracking was effective, precise 3D localization (MOTP) remained challenging (0.53–0.58 mIoU), likely due to the difficulty of fitting 3D boxes to sparse overhead point clouds.

5. Significance and Conclusion

This paper bridges the gap between standard autonomous driving perception and industrial safety monitoring. It proves that overhead LiDAR is a viable, privacy-preserving, and lighting-independent solution for human detection in factories.

Practical Implication: The study defines a practical operating envelope (up to 5.0m) for such systems, showing that with proper model adaptation (specifically VoxelNeXt or SECOND), high-precision detection is achievable.
Future Work: The authors suggest expanding the dataset size and diversity, extending the evaluation range beyond 4.5m, and refining tracking ground truth with manual ID validation to further optimize the system for dynamic industrial environments.

The work provides a foundational benchmark and open resources to accelerate research in industrial safety automation and overhead sensing.