LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration

Imagine you are trying to build a perfect 3D map of the world for a self-driving car. To do this, the car uses three different "eyes":

LiDAR: A laser scanner that sees the world as a cloud of 3D dots (like a digital pointillist painting).
RGB Camera: A standard video camera that sees colors and textures (like your own eyes).
Event Camera: A super-fast camera that only sees changes in light (like a high-speed shutter that only snaps when something moves).

The Problem: The "Drunk" Sensors
Even if you bolt these three sensors onto a car perfectly, over time, bumps in the road, temperature changes, or vibrations can knock them slightly out of alignment. It's like wearing three pairs of glasses that are slightly crooked relative to each other. When the car tries to merge the data, the laser dots don't line up with the video pixels, and the "movement" detected by the event camera is in the wrong spot.

Traditionally, engineers fix this by parking the car in a garage and pointing it at a giant checkerboard pattern. This is accurate, but it's slow, expensive, and impossible to do while the car is driving down the highway.

The Solution: LiREC-Net (The "Super-Translator")
The paper introduces LiREC-Net, a smart AI system that can fix these misalignments instantly while the car is driving, without needing any special checkerboards. Think of it as a universal translator that learns to speak "Laser," "Video," and "Motion" simultaneously.

Here is how it works, using some creative analogies:

1. The Shared Brain (The "Shared LiDAR Representation")

Most previous AI systems were like having two separate students: one studying how to match Lasers to Video, and another studying how to match Lasers to Motion. They didn't talk to each other, so they wasted energy and sometimes gave conflicting advice.

LiREC-Net is different. It has one shared brain for the LiDAR data.

The Analogy: Imagine a chef who needs to make two different dishes (a Laser-Video stew and a Laser-Motion soup). Instead of buying two separate sets of knives and cutting boards, this chef uses one high-quality cutting board to prep the main ingredient (the LiDAR data) for both dishes.
How it works: The AI looks at the LiDAR data in two ways at once: as raw 3D points (the shape) and as a projected 2D depth map (the picture). It fuses these two views together. This ensures the "shape" and the "picture" of the laser scan agree with each other before it tries to match them to the cameras. This saves time and makes the result more consistent.

2. The "Cost Volume" (The "Puzzle Solver")

Once the AI has processed the data, it needs to figure out exactly how much to rotate or shift the laser scan to make it fit the camera image.

The Analogy: Imagine you have a transparent sheet with a laser drawing on it, and you need to slide it over a photograph to make the lines match. You don't just guess; you try sliding it a tiny bit left, a tiny bit right, up, and down, checking every single spot to see where the lines overlap best.
How it works: The AI builds a "Cost Volume." This is a giant 3D grid that calculates the "match score" for every possible tiny shift and rotation. It's like a super-fast puzzle solver that checks millions of possibilities in a split second to find the perfect alignment.

3. The "Iterative Refinement" (The "Fine-Tuning")

The AI doesn't just guess the answer once. It uses a strategy called Iterative Refinement.

The Analogy: Imagine you are tuning a radio. First, you turn the dial roughly to the right station (a big correction). Then, you nudge it slightly to the left, then slightly to the right, until the static disappears and the music is crystal clear.
How it works: The system has multiple "stages." The first stage makes a big, rough correction to fix a major misalignment. The next stage takes that "almost right" result and makes a tiny, precise adjustment. By the end, the alignment is perfect.

Why This Matters

No More Checkers: You don't need to stop the car or set up special targets. The AI learns from the natural world (trees, buildings, other cars).
One System, Three Sensors: Instead of building three different AI models, this one model handles all three sensors at once. It's like having a single conductor leading an orchestra of three different instruments, rather than hiring three separate conductors.
Efficiency: Because it shares the "LiDAR brain," it runs faster and uses less computer power than previous methods.

The Result

The researchers tested this on real driving data (KITTI and DSEC datasets). They found that LiREC-Net could align the sensors almost as perfectly as the old, slow, checkerboard methods, but it did it instantly while the car was moving. It successfully aligned the laser dots with the video pixels and the motion events, proving that a self-driving car can "fix its own glasses" while driving down the road.

In short: LiREC-Net is a smart, efficient AI that acts as a master aligner, ensuring a self-driving car's laser, video, and motion sensors all agree on where things are, all without ever needing to stop and set up a test pattern.

1. Problem Statement

Autonomous systems rely on multi-sensor fusion (LiDAR, RGB cameras, and Event cameras) for robust perception. However, sensor poses shift over time due to vibration, temperature changes, and maintenance, necessitating frequent recalibration.

Limitations of Current Methods: Traditional target-based methods (using checkerboards) are accurate but require controlled environments and manual intervention, making them impractical for continuous operation. Existing learning-based methods are typically bi-modal (calibrating only LiDAR-RGB or LiDAR-Event). Running separate models for each pair is computationally redundant and risks inconsistencies when aligning three sensors simultaneously.
The Gap: There is a lack of a unified, target-free, learning-based framework capable of jointly calibrating LiDAR, RGB, and Event cameras within a single architecture.

2. Methodology: LiREC-Net

The authors propose LiREC-Net, an end-to-end neural network designed to perform extrinsic calibration directly from natural driving scenes without special targets.

A. Architecture Overview

LiREC-Net utilizes a dual-path design that shares a common LiDAR backbone while maintaining separate branches for RGB and Event data.

Input Processing:
- LiDAR: Points are transformed to the camera coordinate system, filtered, and resampled to a fixed number ( $N$ ).
- RGB/Event: Images and event streams are synchronized and resized. Event data is accumulated over a 50ms window into two-channel frames (positive/negative polarity).
Shared LiDAR Branch (Core Innovation):
- Instead of separate encoders for each pair, a single LiDAR branch extracts features using two parallel encoders:
  - Point-based Encoder: Uses Point-Transformer-V3 (PTV3) to process raw 3D points, capturing fine-grained geometric structure.
  - Depth-based Encoder: Uses Mobile-Vision-Transformer-V2 (MViTV2) on projected depth maps to capture spatial context.
- Fusion: Point features are projected to the image plane (Scaled Feature Projection, SFP) and concatenated with depth features to form a unified LiDAR embedding. This combines 3D structural details with dense geometric cues.
Modality-Specific Encoders:
- Separate MViTV2 encoders process the RGB image and Event frame independently to extract appearance and motion cues.
Pair-wise Cost Volumes:
- The unified LiDAR embedding is combined with RGB features and Event features separately to construct correlation cost volumes. These volumes measure local cross-modal similarity (following PWC-Net and LCCNet) to find correspondences.
Context Modules & Prediction Heads:
- Cost volumes are refined by context modules (DenseNet-style concatenation) to preserve multi-scale features.
- Two identical prediction heads output the extrinsic parameters: a 3D translation vector ( $\hat{t}$ ) and a unit quaternion rotation ( $\hat{q}$ ).
Iterative Refinement:
- The system employs a multi-stage training strategy. During inference, multiple stages sequentially predict residual transformations to correct large initial misalignments, refining the pose from coarse to fine.

B. Loss Functions

The network is trained with a composite loss function for each modality pair:

Translation Loss: Smooth L1 loss on the translation vector.
Rotation Loss: Angular distance between predicted and ground-truth quaternions.
Point Cloud Distance Loss: Enforces geometric alignment by minimizing the distance between LiDAR points transformed by the predicted pose versus the ground-truth pose.
Total Loss: A weighted sum of these components across both LiDAR-RGB and LiDAR-Event pairs.

3. Key Contributions

Unified Tri-Modal Framework: LiREC-Net is the first learning-based method to jointly calibrate LiDAR, RGB, and Event cameras in a single target-free architecture, eliminating the need for separate bi-modal models.
Shared LiDAR Representation: Introduces a shared backbone that fuses point-based and depth-based LiDAR features. This reduces computational redundancy, lowers memory usage, and ensures consistent alignment across modalities.
Novel Projection Strategies:
- Scaled Depth Projection (SDP): Projects LiDAR points using scaled intrinsics to generate depth maps, reducing blurring artifacts compared to standard resizing.
- Scaled Feature Projection (SFP): Projects point features to the image plane to align resolutions with depth features before fusion.
Strong Baselines: Establishes the first benchmark for LiDAR-Event calibration on the KITTI dataset (using synthetic events) and provides the first results for LiDAR-RGB calibration on the DSEC dataset.

4. Experimental Results

The model was evaluated on KITTI (LiDAR-RGB + synthetic events) and DSEC (Real LiDAR, RGB, and Events).

Performance on KITTI:
- Achieved 1.80 cm / 0.11° error for LiDAR-RGB and 1.82 cm / 0.12° for LiDAR-Event.
- Outperformed or matched state-of-the-art bi-modal methods (e.g., LCCNet, PseudoCal) while handling a third modality.
Performance on DSEC:
- Achieved 2.51 cm / 0.14° (LiDAR-RGB) and 1.18 cm / 0.07° (LiDAR-Event).
- Demonstrated superior rotation accuracy compared to the existing bi-modal method MULiEv.
Efficiency (Tri-modal vs. Bi-modal):
- Training a single tri-modal model is significantly more efficient than training two separate bi-modal models.
- Inference Time: Reduced by ~35% (e.g., 0.33s vs 0.51s on KITTI).
- Parameters & Memory: Reduced by ~10-15% due to the shared LiDAR branch.
Ablation Studies:
- Removing the fusion of point and depth features caused a massive performance drop (e.g., translation error increased from 2.51cm to 14.43cm).
- Removing scaled projections (SDP/SFP) or using ResNet instead of MViTV2 also significantly degraded accuracy, confirming the necessity of the proposed architecture.

5. Significance and Future Work

Significance: LiREC-Net addresses a critical bottleneck in autonomous driving: the need for frequent, automated, and accurate sensor calibration in uncontrolled environments. By unifying three modalities, it simplifies system deployment and improves alignment consistency.
Limitations: The current method assumes the RGB and Event cameras are pre-calibrated relative to each other ( $T_{Ev \to RGB}$ is known).
Future Work: The authors plan to extend the framework to jointly estimate the relative pose between the two cameras (removing the pre-calibration assumption) and generalize the approach to other sensor modalities like radar and thermal cameras.

In summary, LiREC-Net represents a significant step forward in multi-sensor fusion, offering a highly accurate, efficient, and scalable solution for calibrating complex sensor suites in real-world autonomous systems.

LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration

1. The Shared Brain (The "Shared LiDAR Representation")

2. The "Cost Volume" (The "Puzzle Solver")

3. The "Iterative Refinement" (The "Fine-Tuning")

Why This Matters

The Result

1. Problem Statement

2. Methodology: LiREC-Net

A. Architecture Overview

B. Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation