Near-Field Multiuser Beam Training for XL-MIMO: An End-to-End Interference-Aware Approach with Pilot Limitations

Imagine you are running a massive, high-tech concert hall (the Base Station) with thousands of tiny speakers (the Antennas) arranged in a giant wall. Your goal is to play a perfect, crystal-clear song for 8 different VIP guests (the Users) sitting in different spots in the audience. Some guests are right up close to the stage (Near-Field), while others are far back in the balcony (Far-Field).

In the old days, to make sure everyone hears you clearly, you had to shout a test sound in every single direction, one by one, to see which direction worked best. This is called Beam Training.

The Problem: The "Search" Nightmare

With this new giant wall of speakers, the problem is twofold:

Too Many Directions: Because the guests are close, you have to aim not just left or right, but also "how far away" they are. This turns a simple 2D search (like a map) into a 3D search (like a globe), making the number of directions to check explode.
Too Many Guests: If you try to aim at Guest A, you might accidentally blast noise into Guest B's ear. You need to find a setup where everyone gets their song loud and clear without drowning out the neighbors.
The Time Limit: You only have a tiny window of time to shout these test sounds before the music starts. If you spend too much time testing, you waste the concert time.

The old method is like a security guard checking every single door in a skyscraper one by one. It takes forever, and by the time they check the top floor, the concert is over.

The Solution: The "AI Crystal Ball" (DL-IABT)

This paper proposes a new system called DL-IABT. Instead of checking every door, they use a Deep Learning AI that acts like a super-smart crystal ball.

Here is how it works, step-by-step:

1. The "Sub-Array" Trick (Breaking the Wall into Blocks)

Instead of treating the 1,000 speakers as one giant, impossible-to-manage unit, the system splits them into smaller blocks (like dividing the concert hall into sections).

The Analogy: Imagine the wall of speakers is a giant mosaic. Instead of trying to control every single tile individually, you control 8 large panels. Even though the guests are close, the system pretends each panel is just a standard speaker, which makes the math much easier.

2. The "Magic Ear" (Complex Sensing)

Before the concert starts, the guests whisper a tiny, secret code (a Pilot Signal) to the stage.

The Old Way: The stage tries to guess the direction based on these whispers using a rigid checklist.
The New Way: The AI listens to these whispers through a "Complex Sensing Front-End." It's like the AI has a super-ear that doesn't just hear the volume, but understands the shape and texture of the sound waves, even with background noise.

3. The "Group Chat" (Transformer Predictor)

This is the brain of the operation. The AI uses a Transformer (the same tech behind chatbots like me).

The Analogy: Imagine the 8 guests are in a group chat. The AI reads the whispers from all 8 guests at the same time. It understands that if Guest 1 is whispering from the left, Guest 2 might be on the right, and they might interfere with each other.
Instead of picking the best spot for Guest 1, then Guest 2, the AI looks at the whole group dynamic. It figures out the perfect combination of speaker angles that makes everyone happy simultaneously, minimizing the "crosstalk" (interference).

4. The "Instant Decision" (Gumbel-Softmax)

Usually, picking a specific speaker angle is a "yes or no" choice (like flipping a switch), which is hard for AI to learn.

The Analogy: The AI uses a special trick called Gumbel-Softmax. Think of it like a "soft" switch that can be slightly on, slightly off, and then quickly snaps to the perfect "on" position. This allows the AI to learn and adjust its choices instantly during training, then lock in the perfect setting for the real show.

Why is this a Game Changer?

The paper ran simulations (computer tests) and found two amazing things:

Near-Perfect Performance: The AI's performance was almost as good as a "God-mode" scenario where the system knows the exact location of every guest perfectly (which is impossible in real life).
The Efficiency Win: This is the big one. Because the AI only needed a tiny number of whispers (pilots) to figure out the whole room, it saved a massive amount of time.
- The Result: When you subtract the time spent testing from the total concert time, the AI system delivered much more actual music (data) to the guests than the old methods. The old methods spent so much time testing that they barely had time to play the song.

Summary

In short, this paper teaches a computer to look at a few clues and instantly guess the perfect way to aim a giant speaker wall for a crowd of people, without wasting time checking every single possibility. It turns a slow, exhausting search into a lightning-fast, smart prediction, ensuring everyone gets a great show without the noise.

Here is a detailed technical summary of the paper "Near-Field Multiuser Beam Training for XL-MIMO: An End-to-End Interference-Aware Approach with Pilot Limitations."

1. Problem Statement

The paper addresses the challenge of Beam Training (BT) in Extremely Large-Scale MIMO (XL-MIMO) systems operating in mixed near-field and far-field regions, specifically under sub-connected hybrid beamforming (HB) architectures.

The Core Challenge: In near-field XL-MIMO, the channel requires characterization in both angular and range dimensions. This drastically expands the BT search space compared to far-field systems.
Pilot Limitations: Conventional codebook-based search (hierarchical or exhaustive) becomes prohibitively expensive in terms of pilot overhead and latency, especially when the number of subarrays and users increases.
Multiuser Interference (MUI): In multiuser scenarios, selecting analog beams to maximize individual user gain (stage-wise design) often fails to optimize the global system sum-rate because it ignores inter-user interference.
Hardware Constraints: Sub-connected architectures impose block-diagonal constraints on the analog beamformer, which existing deep learning (DL) approaches (often designed for fully connected or fully digital systems) struggle to accommodate efficiently.

2. Methodology

The authors propose DL-IABT (Deep Learning-based Interference-Aware Multiuser Beam Training), an End-to-End (E2E) framework that directly predicts analog beam indices from limited uplink sensing measurements.

A. System Model & Approximation

Architecture: A sub-connected HB system where a large Uniform Linear Array (ULA) is partitioned into $N_{sub}$ subarrays.
Near-Field Approximation: To reduce complexity, the authors exploit a subarray-level approximation. They demonstrate that for a fixed propagation distance, the near-field spherical wavefront can be approximated by a far-field planar wavefront within each subarray if the subarray aperture is small enough (phase error $<\pi/8$ ). This allows the use of a standard far-field codebook for each subarray, significantly reducing the search space.

B. Optimization Objective (Surrogate Loss)

Directly maximizing the sum-rate is non-convex and difficult for neural networks due to discrete codebook constraints.

MMSE Decoupling: The authors derive a closed-form solution for the optimal digital precoder ( $F_{BB}$ ) using KKT conditions and Minimum Mean Square Error (MMSE) criteria.
Surrogate Loss: By substituting the optimal $F_{BB}$ back into the objective, they derive a variant-MSE loss function. This loss implicitly accounts for Multiuser Interference (MUI) and the total power constraint, allowing the network to optimize analog beams ( $F_{RF}$ ) without explicitly learning the digital precoder.

C. Network Architecture (DL-IABT)

The proposed deep neural network consists of four key components:

Complex-Valued Sensing Front-end: A bias-free grouped convolution layer that parameterizes the uplink measurement process. It emulates time-domain beam switching and integrates noise injection for robustness.
Shared Complex-Valued Feature Encoder: A Multi-Layer Perceptron (MLP) that processes uplink pilot measurements for each user, preserving phase information and generating compact user embeddings.
Interference-Aware Multiuser Predictor: A Transformer-based encoder that processes the user embeddings. By using self-attention mechanisms, it captures the coupling and interference relationships between users, enabling global optimization rather than independent user optimization.
Scalable Beam Selection Head:
- Uses parameter sharing across subarrays to keep model size independent of the number of subarrays.
- Employs Gumbel-Softmax relaxation to approximate the discrete "one-hot" beam selection. This makes the discrete selection differentiable, allowing backpropagation through the loss function during training. During inference, the argmax is taken to select the final beam index.

3. Key Contributions

E2E Interference-Aware Framework: Unlike traditional stage-wise designs (analog selection $\to$ digital optimization), DL-IABT performs E2E learning that implicitly optimizes for system sum-rate and interference mitigation.
Surrogate Loss Derivation: The derivation of the variant-MSE loss via KKT conditions allows the network to bypass the complexity of jointly learning analog and digital beamformers while still accounting for digital precoding gains.
Scalable Architecture: The combination of subarray-level far-field approximation, parameter-shared heads, and Gumbel-Softmax relaxation enables the solution to scale efficiently with the number of users and subarrays, avoiding the exponential complexity of joint search.
Pilot Efficiency: The method operates effectively with a very small number of pilot slots ( $M=8$ in simulations), drastically reducing training overhead compared to conventional search methods.

4. Simulation Results

The authors evaluated the system with $K=8$ users, $N_{sub}=8$ subarrays (32 antennas each), at 100 GHz.

Sum-Rate Performance: DL-IABT achieves performance very close to the "Ideal PCSI AO" (Alternating Optimization with Perfect Channel State Information) upper bound. At 20 dB SNR, it achieves 46.33 bps/Hz, compared to 49.83 bps/Hz for the ideal bound, significantly outperforming hierarchical search (Radix-4) and standard MLP baselines.
Effective Throughput (Pilot Overhead): When accounting for pilot overhead (effective rate = $(1 - M_{pilot}/T_c) \times R_{sum}$ $(1 - M_{p i l o t} / T_{c}) \times R_{s u m}$ ), DL-IABT significantly outperforms all baselines.
- At 20 dB SNR, DL-IABT achieves 45.96 bps/Hz, whereas Ideal AO drops to 37.08 bps/Hz and Noisy NCSI AO drops to 29.39 bps/Hz due to the high pilot cost of exhaustive/hierarchical search.
Scalability: As the codebook size increases, traditional AO methods degrade rapidly (eventually hitting 0 bps/Hz) because the required pilot overhead exceeds the channel coherence time. DL-IABT maintains stable performance, demonstrating superior scalability for high-resolution beamforming.

5. Significance

This work provides a critical solution for the practical deployment of 6G XL-MIMO systems.

It bridges the gap between theoretical near-field gains and practical pilot constraints.
It demonstrates that Deep Learning can effectively replace computationally expensive, stage-wise search algorithms in hybrid architectures.
By explicitly modeling multiuser interference and utilizing Transformer architectures, it offers a path toward high spectral efficiency in dense, near-field communication environments where traditional methods fail due to the "curse of dimensionality."