Occlusion-Aware Multimodal Beam Prediction and Pose Estimation for mmWave V2I

Imagine you are driving a self-driving car in a busy city. You need to do two things at the exact same time:

Find your way (know exactly where you are on the map).
Keep a super-fast internet connection (so you can talk to traffic lights, other cars, and the cloud).

The problem? Millimeter-wave (mmWave) internet is like a super-bright flashlight. It's incredibly fast, but if a truck, a pedestrian, or even a tree blocks the light, the connection dies instantly. Traditional systems try to "guess" where to point the flashlight by constantly scanning the air, which is slow and wastes energy.

This paper proposes a smarter solution: A "Super-Sense" Brain for the Car.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "Blind" Radio

Imagine trying to find a friend in a crowded, foggy stadium by only shouting their name and listening for a reply. If the crowd is too loud or someone blocks your view, you might shout in the wrong direction.

The Old Way: The car's radio system blindly scans 64 different directions (beams) to find the best signal. This is slow and wastes battery.
The Risk: If a bus suddenly pulls in front of the car, the radio doesn't know until the connection already broke.

2. The Solution: The "Multimodal" Detective

The authors built an AI system that acts like a detective with five different senses working together, inspired by how humans navigate (SLAM - Simultaneous Localization and Mapping).

Instead of just listening to the radio, the system looks at:

👁️ Eyes (RGB Camera): It sees the street, cars, and buildings.
📏 3D Ruler (LiDAR): It creates a precise 3D map of the surroundings, like a laser scanner.
📡 Radar: It sees through fog and rain to detect moving objects.
🌍 GPS: It knows the general neighborhood.
📶 Radio Memory: It remembers what the signal strength was just a second ago.

The Magic Ingredient: The system uses a Transformer (the same AI tech behind chatbots) to mix all these senses together. It's like a conductor in an orchestra, making sure the eyes, ears, and memory are playing the same song.

3. What Does It Actually Do?

The AI predicts three things simultaneously:

Where to point the flashlight: Instead of scanning 64 directions, it instantly guesses the one best direction to point the antenna to get the fastest internet.
Is the path blocked? It predicts if a truck is about to block the signal before the signal actually drops.
Where am I? It calculates the car's exact position on the street with high precision.

4. The Results: Winning the Race

The team tested this on a real-world dataset (DeepSense 6G) that simulates a busy city street. Here is how their "Super-Sense" brain compared to using just one sense:

The "Camera-Only" Driver: Good at seeing, but sometimes gets confused by shadows or bad lighting. It got about 50% of the beam directions right.
The "Radio-Only" Driver: Terrible at guessing without looking. It only got 6% right.
The "Super-Sense" Driver (This Paper): By combining all senses, it got 51% of the beam directions right (beating the camera alone) and was much better at spotting blockages.

Why does this matter?

Speed: It keeps the internet connection stable, losing almost no speed (only 0.018 bits/s/Hz loss, which is practically nothing).
Safety: It knows where the car is within 1.33 meters (about 4 feet), which is much better than using just a camera (2.10 meters).
Efficiency: It doesn't need to waste time scanning 64 directions; it just points the flashlight where it needs to go.

The Bottom Line

Think of this technology as giving the car's internet connection eyes and a memory. Instead of blindly shouting into the void, the car looks at the road, remembers what happened a second ago, and instantly points its antenna in the perfect direction to keep the connection alive, even when the city gets chaotic.

This is a big step toward 6G, where your car won't just drive itself; it will stay perfectly connected to the world around it, no matter how many obstacles are in the way.

1. Problem Statement

The paper addresses critical challenges in millimeter-wave (mmWave) Vehicle-to-Infrastructure (V2I) communications within dense urban environments.

The Core Issue: mmWave links are highly susceptible to blockage by vehicles, pedestrians, and structures, leading to abrupt throughput drops.
Current Limitations:
- Communication-Agnostic SLAM: Traditional Simultaneous Localization and Mapping (SLAM) systems (using LiDAR, cameras, radar) focus on localization but ignore communication needs like beam alignment.
- Radio-Only Limitations: Conventional beam training relies on exhaustive sweeps or radio feedback, which is slow, energy-intensive, and brittle under fast dynamics or intermittent blockage.
- Unimodal Failure: Single-sensor approaches (e.g., camera-only) often fail in visually ambiguous scenes or when the Line-of-Sight (LoS) is blocked but the visual scene remains unchanged.
Goal: Develop a unified framework that jointly performs beam management (predicting the optimal beam), blockage detection, and vehicle pose estimation by fusing perception sensors with short-term radio context.

2. Methodology

The authors propose an occlusion-aware multimodal learning framework inspired by SLAM concepts, utilizing a Transformer-based fusion network.

A. Data and Inputs

The system is evaluated on the DeepSense 6G Scenario 31 dataset (60 GHz), which provides synchronized data from:

RGB Images: Processed via a ResNet-18 backbone.
LiDAR Point Clouds: Processed via a PointNet-style encoder.
FMCW Radar: Converted to range-angle magnitude maps and processed via a lightweight CNN.
GNSS: Projected to a local 2D frame and encoded via an MLP.
mmWave Power History: The received power vector from the previous time step ( $t-1$ ) is used as input to capture short-term radio context and implicit occlusion cues (the current sweep is used only for ground truth labeling).

B. Network Architecture

Multimodal Tokenization: Each sensor modality is encoded into a $d$ -dimensional feature vector.
Transformer Fusion:
- Features are stacked into a sequence along with a learnable classification token (h_cls).
- A Transformer Encoder with multi-head self-attention processes the sequence, creating a shared latent representation that fuses geometry, occlusion cues, and radio context.
Multi-Task Heads: The shared latent state (h_cls) feeds into three linear heads:
1. Beam Prediction Head: Outputs logits for the optimal receive beam index (out of 64 beams).
2. Blockage Detection Head: Outputs a probability for the blockage state (Binary classification).
3. Pose Estimation Head: Outputs the 2D vehicle position ( $x, y$ ).

C. Training Strategy

Labels: Beam and blockage labels are automatically derived from the exhaustive 64-beam sweep power vectors measured during data collection.
Loss Function: A weighted multi-task loss combines:
- Cross-entropy for beam classification.
- Binary cross-entropy for blockage detection (with class weighting to handle scarcity of blocked samples).
- Mean Squared Error (MSE) for pose regression.
Optimization: Trained using AdamW with gradient clipping and a ReduceLROnPlateau scheduler.

3. Key Contributions

Unified Multi-Task Framework: First to cast receive-beam prediction, blockage detection, and 2D pose estimation as a single learning problem over a shared latent state derived from heterogeneous sensors (Camera, LiDAR, Radar, GNSS, and mmWave history).
Transformer-Based Fusion: Development of a novel architecture with modality-specific encoders and a Transformer fusion module that learns an occlusion-aware representation without requiring exhaustive beam sweeps at inference time.
Automatic Label Generation: A method to derive ground-truth beam and blockage labels directly from measured beam-sweep power vectors, enabling supervised learning without manual annotation.
SLAM-Style Visualization: Integration of an offline LiDAR map to visualize predicted trajectories, providing a geometric sanity check for the learned pose estimates.

4. Experimental Results

Evaluation was conducted on the DeepSense 6G Scenario 31 dataset (7,012 snapshots), split into training/validation/test sets.

Metric	Multimodal (Proposed)	Best Unimodal (Camera)	Other Baselines
Top-1 Beam Accuracy	50.92%	50.79%	mmWave: 6.37%
Top-3 Beam Accuracy	86.50%	86.03%	mmWave: 17.59%
Spectral Efficiency Loss	0.018 bits/s/Hz	0.019 bits/s/Hz	mmWave: 0.371
Blockage F1-Score	63.35%	59.04%	mmWave: 0.00%
Pose RMSE (2D)	1.33 m	2.10 m	GPS: 4.49 m

Beam Alignment: The multimodal model slightly outperforms the strong camera-only baseline in beam accuracy, demonstrating that adding radio history and other sensors refines the decision boundary.
Robustness: The multimodal approach significantly outperforms single-sensor baselines in blockage detection (63.35% vs. 59.04% F1) and localization (1.33m vs. 2.10m RMSE), proving the value of sensor fusion in occluded scenarios.
Efficiency: The spectral efficiency loss is minimal (0.018 bits/s/Hz), indicating that the predicted beams are nearly as good as the exhaustive sweep.

5. Significance and Conclusion

Bridging ISAC and SLAM: The paper successfully bridges the gap between Integrated Sensing and Communication (ISAC) and autonomous driving perception (SLAM). It demonstrates that communication objectives (beam alignment) can be optimized using perception data, and vice versa.
Practical Impact: By predicting the optimal beam without an exhaustive sweep, the system reduces latency and overhead, which is critical for high-mobility 6G V2I applications.
Future Directions: The authors suggest future work on GNSS-free localization, stronger temporal modeling for high-speed scenarios, and closed-loop integration with real-time handover procedures.

In summary, this work establishes that multimodal fusion is superior to unimodal approaches for mmWave V2I, offering a robust solution that simultaneously handles communication reliability (beam/blockage) and vehicle localization in complex, occluded urban environments.