Enhancing User Throughput in Multi-panel mmWave Radio Access Networks for Beam-based MU-MIMO Using a DRL Method

Imagine you are the conductor of a massive, high-speed orchestra, but instead of violins and drums, your instruments are radio waves traveling at the speed of light. This is the world of mmWave (millimeter-wave) communication, the technology behind the super-fast 5G networks of the future.

However, there's a problem: these radio waves are like fragile whispers. They can't travel far or go through walls easily. To make them loud enough, we use antennas that act like giant flashlights, focusing the signal into tight beams to hit specific users.

The Problem: The "Flashlight" Dilemma

In a busy city, you have many users (let's call them "Mobile Terminals" or MTs) trying to talk to the network at the same time. Your base station (the "gNB") has multiple panels of antennas, each with its own set of "flashlights" (beams).

The Old Way (The Legacy Approach):
Imagine a traffic cop who only looks at who is standing closest to the intersection. The old system simply picks the beam with the strongest signal (the loudest whisper) for every user.

The Flaw: Just because a signal is loud right now doesn't mean it's the best choice for the whole group. If you pick the loudest beam for everyone, you might accidentally point two beams at the same spot, causing them to crash into each other (interference). Or, you might pick a beam that is loud but rarely used, leaving other users waiting in line. It's like a chef who only cooks the most popular dish, ignoring that the kitchen is running out of ingredients for the other 99% of customers.

The Solution: The "Smart Conductor" (DRL)

This paper introduces a Deep Reinforcement Learning (DRL) system. Think of this as a Smart Conductor who doesn't just look at who is closest, but learns from experience to manage the whole orchestra perfectly.

The Smart Conductor looks at three things before deciding which "flashlight" to turn on:

Signal Strength (RSRP): Is the signal loud? (The obvious choice).
Popularity (Beam Usage): Has this beam been used a lot lately? If a beam is "popular," it means the network is already good at handling traffic on that path. Switching to a new, unused beam might cause a delay while the system figures it out.
Compatibility (Cross-Correlation): This is the magic part. Imagine you have two people trying to talk on walkie-talkies. If they stand too close together, their voices overlap and become noise. The Smart Conductor checks the "spatial relationship" between beams. It asks: "If I turn on Beam A for User 1, will it interfere with Beam B for User 2?" If they are compatible, it schedules them together. If not, it picks a different pair.

How It Learns (The Video Game Analogy)

How does the computer learn to be this smart? It plays a video game.

The Goal: Maximize the total data (throughput) sent to everyone and minimize the time they wait (latency).
The Trial and Error: At first, the AI makes random choices. Sometimes it picks the wrong beam, and the data rate drops (it loses points). Sometimes it picks a great combination, and the data flies (it gains points).
The Reward: Every time the network runs smoothly, the AI gets a "reward." Over thousands of tries, it learns a policy: "When I see this pattern of signals and this history of usage, I should always pick this specific combination of beams."

The Results: Why It Matters

The paper tested this "Smart Conductor" against the old "Traffic Cop" method in a simulated city with 210 users. The results were impressive:

Faster Speeds: The network delivered up to 16% more data to users. That's like getting a faster download speed on your phone without changing your plan.
Less Waiting: The time it takes for a message to go from your phone to the tower and back (latency) was reduced by 3 to 7 times.
- Analogy: If the old system made you wait 7 seconds for a webpage to load, the new system makes it load in 1 second.
Smarter Grouping: The AI learned to group users together more efficiently. Instead of serving them one by one, it found ways to serve multiple users simultaneously without them interfering with each other, much like a bus driver who realizes they can pick up three people on the same side of the street without making a detour.

The Bottom Line

This paper shows that by teaching a computer to learn from the environment rather than just following rigid rules, we can make our 5G networks significantly faster and more efficient. It's the difference between a robot that blindly follows a map and a human driver who knows the shortcuts, the traffic patterns, and how to navigate the city smoothly.

In short: The old way was "Pick the loudest signal." The new way is "Pick the smartest combination of signals to keep everyone happy and moving fast."

1. Problem Statement

The paper addresses the challenge of optimizing user throughput and minimizing latency in multi-panel millimeter-wave (mmWave) Radio Access Networks (RAN) utilizing Multi-User Multiple-Input Multiple-Output (MU-MIMO) with hybrid beamforming.

Complexity of Beam Selection: In multi-panel configurations (where a gNB has multiple antenna panels, each with a single RF chain), beam selection is not a simple 2D problem. It involves a complex trade-off between:
1. Signal Strength: Selecting beams with the highest Reference Signal Received Power (RSRP).
2. Beam Usage History: Considering how frequently beams have been activated (to avoid delays associated with activating new beams).
3. Spatial Cross-Correlation: Managing interference between beams selected from different panels. Strongly correlated beams cause interference, reducing the efficiency of spatial multiplexing.
Limitations of Legacy Approaches: Traditional "greedy" approaches that select beams solely based on the strongest RSRP fail to account for inter-panel interference and scheduling history. This leads to sub-optimal resource utilization, increased latency (due to packet buffering while waiting for specific beams), and lower overall spectral efficiency.
Infeasibility of Classical Optimization: The search space for optimal beam selection grows exponentially with the number of beams and users, making direct mathematical optimization computationally infeasible for real-time network operation.

2. Methodology

The authors propose a Deep Reinforcement Learning (DRL) framework, specifically using a Double Deep Q-Network (DDQN), to model the beam selection process as a Markov Decision Process (MDP).

A. System Model

Scenario: Downlink communication in a 3GPP-compliant 5G NR network (Dense Urban Macro) operating at 30 GHz (FR2).
Hardware: gNBs equipped with multiple antenna panels (e.g., 3 panels per sector), where each panel supports a grid-of-beams (GoB) structure. Only one beam per panel can be active per Transmission Time Interval (TTI).
Channel Model: Standardized 3GPP 3D spatial channel model with user mobility.

B. DRL Formulation

The problem is formulated as an MDP defined by $(S, A, R, P)$ :

State Space ( $S$ ): The agent observes a vector integrating three key dimensions:
1. Normalized RSRP ( $z_b$ ): Signal strength of candidate beams (clipped to 3GPP reporting range).
2. Beam Activation History ( $h_b$ ): The frequency of past beam usage within a switching interval (approx. 40ms).
3. Cross-Correlation ( $\rho_{b,j}$ ): The spatial correlation between beams from different panels, calculated based on historical co-scheduling data to predict interference.
Action Space ( $A$ ): The selection of a specific beam index from the set of available beams ( $B$ ) for activation.
Reward Function ( $R$ ): Defined as the normalized instantaneous throughput delivered to a user ( $B^t_k / \max_j(B^t_j)$ ). This encourages the agent to maximize data delivery while stabilizing training.
Algorithm: The DDQN algorithm is used to approximate the optimal action-value function ( $Q^*$ ), allowing the agent to learn a policy that maximizes long-term cumulative rewards without requiring an explicit model of the channel transition dynamics.

C. Beam Management Strategy

The DRL agent learns to balance the trade-off between immediate link quality (RSRP) and long-term system efficiency (spatial orthogonality and scheduling frequency). It groups users (MTs) onto beams that may have slightly lower RSRP but offer better spatial separation (low cross-correlation) or higher activation likelihood, thereby enabling more efficient MU-MIMO pairing.

3. Key Contributions

Multi-Dimensional Beam Management: Unlike previous works that focused on single-panel or 2D beam selection, this paper introduces a framework that jointly optimizes RSRP, beam usage statistics, and inter-panel spatial cross-correlation in a multi-panel gNB environment.
DRL for Dynamic Optimization: The paper demonstrates the application of model-free DRL to solve the NP-hard beam selection problem in real-time, avoiding the need for massive pre-collected datasets required by supervised learning.
Practical Network Integration: The solution is validated in a realistic system-level simulation environment compliant with 3GPP specifications (including PHY/MAC layer procedures, beam sweeping, and feedback loops), rather than just theoretical link-level simulations.

4. Numerical Results

The proposed DDQN-based approach was evaluated against a baseline "Max RSRP" beam selection method in a network with 210 mobile terminals (MTs) and 21 gNBs.

Throughput Gain: The DRL approach achieved a geometric mean user throughput increase of up to 16% (specifically 16.5% in some configurations) compared to the baseline.
Latency Reduction: End-to-end latency was reduced by a factor of 3x to 7x. This is attributed to the agent's ability to schedule users on sub-optimal (in terms of RSRP) but immediately available beams, reducing the time packets spend in buffers waiting for specific beam activations.
CDF Performance: The Cumulative Distribution Function (CDF) of user throughput showed a steeper curve for the RL approach, indicating that a larger proportion of users experienced higher throughput compared to the baseline.
Scheduling Efficiency: The RL agent successfully grouped more users for co-scheduling in the spatial domain, effectively utilizing the available antenna panels ( $M_p=3$ ) to maximize spatial multiplexing gains.

5. Significance

This work provides a critical step toward practical, intelligent beam management in next-generation 6G and advanced 5G mmWave networks.

Scalability: It offers a scalable solution to the "curse of dimensionality" in multi-panel beam selection, which traditional optimization cannot handle.
User Experience: By significantly reducing latency and increasing throughput, the method directly enhances the Quality of Experience (QoE) for bandwidth-intensive applications (eMBB).
Future Direction: The study validates that Reinforcement Learning is superior to static or supervised learning methods for dynamic wireless environments where channel conditions and user mobility change rapidly. It opens the door for integrating additional channel state information (CSI) dimensions and digital beamforming techniques in future research.