Structure-Aware Multimodal LLM Framework for Trustworthy Near-Field Beam Prediction

Imagine you are trying to talk to a friend who is flying a drone through a busy, narrow city canyon. You both have powerful, high-tech walkie-talkies (the XL-MIMO system) that can send incredibly focused beams of sound (or radio waves) to each other.

In the old days, if your friend was far away, you could just shout in a general direction, and the sound would spread out like a flashlight beam. But because your friend is close (the "near-field") and the city is full of buildings, the sound doesn't spread; it acts like a laser pointer. It has to hit your friend exactly in the right spot, at the right angle, and at the exact right distance. If you miss by a tiny bit, the connection breaks.

The problem? The "map" of all possible directions and distances is massive. It's like trying to find a specific grain of sand on a beach by checking every single grain one by one. It would take forever, and your friend would fly away before you found them.

This paper proposes a smart solution: A "Super-Intelligent Co-Pilot" (The LLM Framework) that doesn't just guess; it understands the world.

Here is how it works, broken down into simple parts:

1. The "Super-Senses" (Multimodal Inputs)

Instead of just listening to radio signals (which is like trying to navigate in the dark), this system gives the AI a full set of senses:

GPS: It knows where the drone was a moment ago.
Eyes (RGB Camera): It sees the buildings, trees, and streets.
Depth Vision (LiDAR): It knows exactly how far away those buildings are.
The "Story" (Text Prompts): The human operator tells the AI, "Hey, the drone is doing a zigzag patrol." This is like giving the AI a hint about the plot of the movie so it can predict what happens next.

2. The "Brain" (The Large Language Model)

Usually, computers just crunch numbers. But here, they use a Large Language Model (LLM)—the same kind of tech that powers chatbots.

Why? Because LLMs are great at reasoning. They can look at the GPS path, the picture of the street, and the text description, and say, "Ah, the drone is turning left around that corner. The signal will bounce off that brick wall. I know exactly where the beam needs to go."
It's like having a co-pilot who has read every map of the city and can predict the drone's moves before they happen.

3. The "Smart Map" (Structure-Aware Prediction)

The biggest problem was that the map of directions was too huge to search.

The Old Way: Trying to guess one giant, complicated number (e.g., "Beam #4,592,103").
The New Way: The AI breaks the problem down into three simple questions, just like giving someone directions:
1. Left or Right? (Azimuth)
2. Up or Down? (Elevation)
3. Near or Far? (Distance)
By solving these three small puzzles separately, the AI avoids getting overwhelmed. It's like solving a Rubik's cube one side at a time instead of trying to twist the whole thing at once.

4. The "Safety Net" (Adaptive Refinement)

Even smart AI makes mistakes. What if the AI is only 60% sure?

The Trick: The AI also gives itself a "Confidence Score."
If the score is high (90%+), it just points the beam and says, "Go!" (Zero delay).
If the score is low, it doesn't panic. It says, "I'm not sure, but I think it's in this small neighborhood." It then does a tiny, quick scan of just that small neighborhood to lock on.
This is like a detective who is sure of the suspect's location (no search needed) vs. a detective who has a hunch and checks the top 5 likely houses instead of searching the whole city.

5. The Result

The paper shows that this "Super-Intelligent Co-Pilot" is much better than:

Old methods that just search blindly (too slow).
Other AI methods that only look at radio signals (too confused by buildings).
Other AI methods that try to guess the whole direction at once (too messy).

In a nutshell:
This paper teaches a computer to be a smart, multi-sensory navigator. Instead of blindly searching for a signal in a complex city, it uses cameras, maps, and "common sense" reasoning to predict exactly where the signal needs to go, saving time and keeping the connection strong even when the drone is flying through a chaotic, obstacle-filled environment.

1. Problem Statement

The paper addresses the critical challenge of beam alignment in Near-Field Extremely Large-Scale MIMO (XL-MIMO) systems, particularly in complex 3D low-altitude environments (e.g., urban UAV communications).

The Core Issue: In near-field XL-MIMO, wavefronts are spherical rather than planar. This couples the angular and distance dimensions, creating a massive joint angular-distance codebook.
Limitations of Current Methods:
- Beam Training: Exhaustive search over this high-dimensional codebook incurs prohibitive pilot overhead and latency, making it unsuitable for dynamic scenarios.
- Traditional Prediction: Existing deep learning (DL) methods often rely solely on wireless measurements or treat the beam index as a flat, 1D classification problem. This fails to capture the intrinsic 3D geometric structure of the near-field channel and struggles with generalization in complex environments (Non-Line-of-Sight/NLoS).
- Reliability: Current multimodal approaches lack mechanisms to assess prediction uncertainty, leading to unstable performance when models encounter unseen conditions.

2. Methodology

The authors propose a Structure-Aware Multimodal Large Language Model (LLM) Framework that fuses heterogeneous sensor data with the reasoning capabilities of an LLM to predict optimal beams.

A. System Model & Inputs

The system involves a Base Station (BS) equipped with a Uniform Planar Array (UPA), an RGB camera, and a LiDAR sensor, communicating with a UAV equipped with GPS. The framework ingests four modalities:

Historical Kinematics (GPS): 3D position, velocity, and acceleration sequences.
Visual Data (RGB): Images capturing texture and blockage information.
Depth Data (LiDAR): Point clouds providing precise geometric structure.
Textual Prompts: Structured text describing system parameters (frequency, array size) and dynamic flight modes (e.g., "Zigzag," "Street Patrol").

B. Framework Architecture

The workflow follows a "Representation-Perception-Fusion-Reasoning-Refinement" paradigm:

Multimodal Encoders & Feature Fusion:
- Kinematics: Encoded via a learnable linear layer.
- Position-Guided Attention (PGA): A novel mechanism where the UAV's current 3D position acts as a query to aggregate features from RGB and LiDAR data. This creates "spatially-aware" context tokens, ensuring the model focuses on environmental features relevant to the UAV's location.
- Textual Prompts: Encoded using a frozen BERT-Tiny model with offline caching to reduce latency.
- Fusion: All modalities are concatenated into a unified sequence with time embeddings.
LLM-Driven Reasoning (GPT-2 Backbone):
- A pre-trained GPT-2 model (fine-tuned with parameter-efficient strategies) acts as the reasoning engine.
- Unlike static classifiers, the LLM performs autoregressive reasoning, learning the complex spatiotemporal dynamics between the UAV's trajectory, environmental scatterers, and beam transitions.
Cascaded Prediction Heads:
- Auxiliary Trajectory Head: Predicts the UAV's future 3D coordinates. This acts as a spatial prior, grounding the beam prediction in physical reality and narrowing the search space.
- Primary Structure-Aware Beam Head: Instead of predicting a single massive global index, this head decouples the prediction into three independent components: Azimuth ( $\theta$ ), Elevation ( $\phi$ ), and Distance ( $r$ ).
  - Benefit: This mirrors the intrinsic 3D geometry of the near-field codebook, reducing the output space complexity from $O(N_r N_\phi N_\theta)$ to $O(N_r + N_\phi + N_\theta)$ and preserving spatial continuity.
Trustworthy Adaptive Refinement:
- The model outputs confidence scores for each predicted dimension.
- High Confidence: If scores exceed a threshold, the Top-1 prediction is accepted immediately.
- Low Confidence: If uncertainty is high, the system triggers a localized beam sweep only within a small candidate pool (e.g., Top-5 combinations per dimension, totaling 125 candidates). This balances accuracy with low pilot overhead.

3. Key Contributions

Multimodal LLM Reasoning: First application of an LLM to fuse GPS, RGB, LiDAR, and textual prompts for near-field beam prediction, leveraging emergent reasoning to understand complex 3D propagation environments.
Structure-Aware Decoupled Prediction: A novel strategy that decomposes the beam index into azimuth, elevation, and distance. This explicitly respects the 3D geometry of the near-field codebook, solving the "curse of dimensionality" and improving interpretability.
Spatial Priors via Trajectory Guidance: The use of an auxiliary trajectory prediction head to provide geometric constraints, effectively guiding the beam search toward physically plausible locations.
Trustworthy Adaptive Mechanism: A confidence-driven refinement strategy that dynamically switches between direct prediction and small-scale scanning, ensuring robust performance in high-mobility and NLoS scenarios.

4. Experimental Results

The framework was evaluated on a custom dataset (Multimodal-LAE-XLMIMO) containing 30 urban scenes and 10,770 trajectories, covering both Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) conditions.

Accuracy:
- Achieved 83% Top-1 joint beam prediction accuracy across all scenarios (vs. ~35% for GPS-only baselines).
- In difficult NLoS scenarios, accuracy jumped from 18% (without refinement) to 78% with the adaptive mechanism.
- Outperformed state-of-the-art (SOTA) DL models (RNN, LSTM) and multimodal LLM baselines (M2BeamLLM).
Efficiency:
- Under a fixed pilot overhead budget, the proposed method achieved 82.7% accuracy, whereas traditional hierarchical search baselines achieved only 26.1%.
- Achieved a 94% improvement in achievable rate over hierarchical search in LoS scenarios.
Ablation Studies:
- Removing the LLM backbone (replacing with LSTM) caused performance collapse (Top-1 accuracy dropped to 6.7%).
- Removing the decoupled head (predicting global index) significantly degraded performance, confirming the necessity of structure-aware design.
- Removing textual prompts severely impacted NLoS performance, highlighting the role of semantic context.

5. Significance

This paper represents a significant leap in 6G near-field communications by:

Solving the Dimensionality Curse: It provides a scalable solution for the massive codebooks required in XL-MIMO, moving away from brute-force search.
Bridging AI and Physics: By decoupling the prediction and using trajectory priors, it ensures the AI model adheres to physical laws (3D geometry), enhancing interpretability and reliability.
Enabling Trustworthy AI: The adaptive refinement mechanism addresses the "black box" nature of AI in safety-critical communications, ensuring that uncertainty is managed through targeted, low-overhead verification.
Multimodal Synergy: It demonstrates that fusing visual, depth, and textual data with kinematic data is essential for robust beam alignment in complex, obstructed 3D environments where wireless-only methods fail.

Structure-Aware Multimodal LLM Framework for Trustworthy Near-Field Beam Prediction

1. The "Super-Senses" (Multimodal Inputs)

2. The "Brain" (The Large Language Model)

3. The "Smart Map" (Structure-Aware Prediction)

4. The "Safety Net" (Adaptive Refinement)

5. The Result

1. Problem Statement

2. Methodology

A. System Model & Inputs

B. Framework Architecture

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Uncertainty-Weighted Experience Replay for Continual MIMO Channel Prediction

Complex Orthogonal Decomposition (C.O.D.) using Python

Synthesis and Deployment of Maximal Robust Control Barrier Functions through Adversarial Reinforcement Learning

A Control Co-Design Framework to Achieve Solution Feasibility in Energy System Optimization Problems

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks