Indirect and Direct Multiuser Hybrid Beamforming for Far-Field and Near-Field Communications: A Deep Learning Approach

Imagine you are trying to host a massive, high-stakes dinner party in a giant, circular room (the Base Station) with hundreds of waiters (the Antennas). Your goal is to serve delicious food (data) to many guests (users) sitting at different tables, some very close to the kitchen and some far away in the back.

The challenge? You only have a few head waiters (the RF Chains) to direct the hundreds of waiters. You can't tell every single waiter exactly what to do individually; instead, the head waiters must give general instructions to groups of waiters to point their trays in the right direction.

This is the problem of Hybrid Beamforming in next-generation 6G networks. This paper proposes a brilliant new way to solve it using Artificial Intelligence (Deep Learning).

Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Near-Field" Confusion

In older networks, everyone was far away, so the food (signals) traveled in straight, flat lines (like a laser pointer). You just had to point your tray at the right angle.

But in these new massive systems, some guests are sitting right next to the kitchen. Here, the food doesn't travel in a flat line; it travels in a curved wave (like ripples in a pond).

The Issue: If you just point your tray at the right angle, you might miss the guest because they are too close (or too far). You need to aim at the right angle AND the right distance simultaneously.
The Noise: With so many guests close together, their voices (interference) mix up. It's hard to hear one person when everyone is shouting at once.

2. The Old Way vs. The New Way

The Old Way (Traditional Math): Imagine a head waiter trying to calculate the perfect angle and distance for every single guest using a giant calculator. It's accurate, but it takes forever. By the time they finish the math, the guests have moved, and the food is cold.
The "Direct" Way (Current AI): Some AI tries to guess the answer by looking at the guests' faces (pilots) without knowing the full room layout. But often, these AI models get confused by the noise or the complex math, leading to dropped trays.

3. The Paper's Solution: The "Smart Brain" System

The authors built a Deep Learning Brain that acts like a super-intelligent head waiter. It has three special tricks:

A. The "Magic Glasses" (Grouped Convolution Sensing)

Instead of trying to see the whole room perfectly first, the AI puts on a pair of special glasses that scan the room in specific patterns.

How it works: It doesn't just look at the guests; it learns how to "listen" to the room's echoes. It figures out where the guests are by how the sound bounces off the walls, even if it can't see them clearly yet. This is the Sensing Front-End.

B. The "Group Chat" (Shared MLP)

Once the AI has a rough idea of where everyone is, it uses a "Group Chat" to organize the waiters.

How it works: It realizes that guests sitting in similar spots have similar needs. It groups them together and figures out the best way to serve them all at once without their voices overlapping. This is the Feature Extraction part.

C. The "Instant Recipe" (The KKT Shortcut)

This is the paper's biggest innovation. Usually, AI tries to guess both the angle (analog) and the volume (digital) at the same time. This often leads to the AI getting stuck or confused (unstable gradients).

The Trick: The authors realized that once you know the angle (where to point), the perfect volume (digital settings) can be calculated instantly using a simple math formula (like a shortcut recipe).
The Result: The AI only needs to learn the "pointing" part. It ignores the "volume" part during training because it can calculate that instantly later. This makes the AI learn much faster and more stably.

4. Two Modes of Operation

The system works in two ways, depending on how much information it has:

Mode 1: The "Map Reader" (Indirect)
- Scenario: You have a perfect map of the room (Perfect Channel State Information).
- Action: The AI looks at the map, instantly decides where to point the trays, and uses the "Instant Recipe" to set the volume.
- Benefit: It's incredibly fast and nearly as good as the perfect mathematical solution, but takes a fraction of the time.
Mode 2: The "Eagle Eye" (Direct)
- Scenario: You don't have a map. You only have a few seconds to shout "Hello!" to the guests to see who answers (Short Pilots).
- Action: The AI uses its "Magic Glasses" to listen to those short shouts. It learns to point the trays directly based on the sound, without needing to build a full map first. Then, it does a quick check to fine-tune the volume.
- Benefit: This saves a massive amount of time and energy. In a crowded room, shouting less means less noise for everyone else.

5. Why This Matters

Speed: It solves the math problem in a blink of an eye, whereas traditional methods take seconds (which is an eternity in 6G).
Efficiency: It works even when guests are very close together (Near-Field), a scenario where old methods fail miserably.
Robustness: It doesn't get confused by noise or bad weather. It learns the "shape" of the room and adapts.

In a nutshell:
This paper teaches a computer to be a master conductor for a massive orchestra. Instead of trying to write sheet music for every single instrument (which is too slow), the conductor learns to wave the baton in the perfect pattern so that the instruments naturally play the right notes together, even if the musicians are standing right next to the conductor. It's faster, smarter, and handles the chaos of a crowded room better than anything we've had before.

Here is a detailed technical summary of the paper "Indirect and Direct Multiuser Hybrid Beamforming for Far-Field and Near-Field Communications: A Deep Learning Approach."

1. Problem Statement

The paper addresses the challenges of Hybrid Beamforming (HBF) in Extremely Large-Scale MIMO (XL-MIMO) systems, specifically in near-field communication scenarios.

Near-Field Complexity: Unlike far-field systems where channels depend only on angle, near-field channels (due to large antenna apertures) depend on both angle and distance (spherical-wave model). This enables "beam focusing" but complicates optimization.
Multiuser Interference (MUI): In multiuser settings, users with similar angles but different distances create strong spatial correlation and interference, making beamforming optimization non-convex and computationally intensive.
Limitations of Existing Methods:
- Decoupled Designs: Often optimize analog beamforming without explicitly accounting for MUI, leading to suboptimal performance.
- End-to-End (E2E) Deep Learning: Existing E2E approaches that jointly optimize analog and digital components often suffer from unstable training due to non-convex constant-modulus (CM) constraints, pronounced analog-digital coupling, and gradient instability when optimizing sum-rate directly.
- CSI Overhead: Traditional methods require perfect Channel State Information (CSI), which is expensive to acquire in XL-MIMO. Direct methods (learning from pilots) often rely on sparse recovery, which fails in dense scattering or extreme near-field conditions.

2. Methodology

The authors propose a fully complex-valued End-to-End (E2E) Deep Learning framework based on Deep Complex Neural Networks (DCNs). The framework supports two operational modes: Indirect (using estimated CSI) and Direct (using raw pilot measurements).

A. Core Optimization Strategy: Variant-MMSE with KKT Elimination

To stabilize training and decouple the analog and digital design, the authors replace the standard sum-rate maximization objective with a Variant Minimum Mean Square Error (Variant-MMSE) criterion.

Digital Precoder Elimination: Using Karush–Kuhn–Tucker (KKT) conditions, the optimal digital precoder ( $F_{BB}$ ) is derived in closed form as a function of the analog precoder ( $F_{RF}$ ) and the channel.
Stable Objective: By substituting the closed-form $F_{BB}$ back into the objective function, the network only needs to learn the analog precoder $F_{RF}$ . This removes the instability caused by jointly optimizing non-convex analog and digital variables.

B. Network Architecture

The proposed network consists of three main modules:

Grouped Complex-Convolution Sensing Front-end:
- Emulates the uplink (UL) measurement process.
- Uses grouped convolutions to learn a bank of sensing matrices ( $\Phi$ ) that mimic physical beam switching.
- Operates on complex-valued data (Real/Imaginary channels) without bias terms to preserve linearity.
Shared Per-User Complex MLP:
- A shared Multi-Layer Perceptron (MLP) extracts latent features for each user independently.
- Uses Complex Batch Normalization and a custom Complex Hyperbolic Tangent (CTanh) activation function to handle complex signals and prevent gradient explosion/vanishing.
Merged Output Head:
- Aggregates user features and outputs the analog precoder.
- Applies a Constant-Modulus (CM) normalization layer to ensure the output satisfies hardware constraints ( $|[F_{RF}]_{i,j}| = 1/\sqrt{M}$ ).

C. Operational Modes

Indirect Mode (DL-IMHB):
- Input: Estimated Channel State Information (CSI).
- Process: The network predicts $F_{RF}$ from the CSI. The digital precoder $F_{BB}$ is calculated analytically using the KKT closed-form solution.
- Use Case: High-performance scenarios where CSI is available (e.g., via estimation).
Direct Mode (DL-DMHB):
- Input: Raw UL pilot measurements (no explicit CSI reconstruction).
- Process:
  1. The network learns a sensing operator to map pilots directly to $F_{RF}$ .
  2. The analog precoder is fixed, and a short sequence of pilots is used to estimate the effective equivalent channel ( $H_{eq} = H^H F_{RF}$ ).
  3. The digital precoder is derived from this low-dimensional effective channel using the closed-form KKT solution.
- Use Case: Low-latency, low-overhead scenarios where full CSI estimation is impractical.

3. Key Contributions

Novel E2E Framework: A fully complex-valued DNN that unifies indirect and direct beamforming for both near-field and far-field XL-MIMO.
Stable Training via KKT: Introduction of a variant-MMSE objective where the digital precoder is analytically eliminated. This circumvents the gradient instability of conventional E2E sum-rate optimization and ensures robust convergence.
Physical-Inspired Architecture:
- A grouped complex-convolution layer that learns sensing matrices, effectively acting as a learnable linear observation operator.
- A shared complex MLP with CTanh activation that captures user-specific features while maintaining scalability.
Direct Beamforming with Short Pilots: A protocol that learns the sensing operator and analog mapping directly from pilots, significantly reducing pilot overhead compared to sparse-recovery baselines.
Interpretability: Visualization of learned sensing patterns shows the network learns to concentrate energy in effective sectors and captures range-dependent phase evolution (spherical wavefronts) in the latent feature space.

4. Simulation Results

The proposed method was evaluated against baselines like SU-DNN, LDMA, TH-HMP (iterative), and sparse recovery methods (P-SOMP, P-SIGW).

Performance (Sum Rate):
- Indirect Mode: Outperforms SU-DNN by ~3 bps/Hz and approaches the performance of iterative optimization (TH-HMP) with significantly lower complexity (single forward pass vs. multiple iterations).
- Direct Mode: Achieves up to 3 bps/Hz higher sum rate than sparse-recovery baselines (P-SIGW) and ~3.3 bps/Hz over SU-DNN under the same pilot budget.
Robustness:
- Near-Field: Maintains stable performance in extreme near-field (10m) where sparse recovery methods fail due to grid mismatch and energy spread.
- Channel Sparsity: Performs well in both sparse and rich scattering environments, whereas sparse recovery degrades rapidly as path count increases.
Complexity:
- Inference: Achieves near-optimal performance with O(M) complexity (linear in antenna count) compared to O(M^2) or O(M^3) for iterative algorithms.
- Pilot Efficiency: Reduces pilot overhead significantly (e.g., 6+2 pilots vs. 24 pilots for baselines) while maintaining high spectral efficiency.
Scalability: Demonstrates excellent scalability with increasing user counts ( $K$ ) and antenna numbers ( $M$ ), including generalization to Uniform Planar Arrays (UPA).

5. Significance

This work bridges the gap between theoretical optimization and practical deep learning deployment in 6G XL-MIMO systems.

Solves the Near-Field Challenge: It provides a robust solution for the joint angle-distance beamforming required in near-field communications, a critical enabler for 6G.
Stability vs. Performance: By decoupling the digital precoder via KKT conditions, it solves the long-standing issue of training instability in E2E beamforming networks, offering a reliable alternative to iterative algorithms.
Hardware Efficiency: The direct mode enables high-performance beamforming with minimal pilot overhead and no need for explicit channel estimation, making it highly suitable for real-time, low-latency 6G applications.
Generalizability: The framework is adaptable to various array geometries (ULA, UPA) and can be extended to wideband systems and hardware impairments (quantization, non-linearities).