Learning to Reflect: Hierarchical Multi-Agent Reinforcement Learning for CSI-Free mmWave Beam-Focusing

Imagine you are in a large, crowded conference room trying to have a conversation. The walls are thick, and the person you are talking to is on the other side of a pillar. Your voice (the signal) gets blocked, and you can't hear each other.

In the world of wireless internet (specifically the super-fast 60GHz "mmWave" used for 5G and beyond), this is a huge problem. The signals are like high-pitched whispers that get blocked easily by walls. To fix this, engineers use Reconfigurable Intelligent Surfaces (RIS)—essentially, giant, smart mirrors on the walls that can bounce the signal around obstacles to reach the user.

However, there's a catch. Traditional smart mirrors are like super-advanced, electronic mirrors that need to know the exact shape of every single air molecule between them and you to bend the light perfectly. This requires a massive amount of data (called "Channel State Information" or CSI) to be constantly measured and calculated. It's like trying to direct a traffic jam by asking every single driver exactly where they are, how fast they are going, and what they plan to do next, in real-time. It's too slow, too expensive, and too complicated.

This paper proposes a smarter, simpler way: "Learning to Reflect."

Here is the breakdown of their solution using simple analogies:

1. The "CSI-Free" Idea: Stop Measuring the Air, Just Look at the Map

Instead of trying to measure the invisible air currents (the complex radio waves), the authors say: "Let's just look at where the people are."

The Old Way: Like a blindfolded conductor trying to tune an orchestra by listening to every single instrument individually.
The New Way: Like a traffic director who just looks at a map of where the cars are. If you know a car is at the corner, you don't need to measure the wind to know which way to point the traffic sign.
The Benefit: They use user location data (which is easy to get, like GPS or Wi-Fi positioning) instead of complex radio measurements. This saves a massive amount of computing power.

2. The "Hierarchical" Team: The Manager and the Workers

The problem of controlling hundreds of tiny mirror tiles is too big for one brain. So, they split the job into two levels, like a company structure:

The High-Level Manager (The Allocator):
- Job: This is the boss. It looks at the whole room and decides: "Okay, User A is in the north corner, so they should be served by Mirror Group 1. User B is in the south, so they get Mirror Group 2."
- Analogy: Think of a restaurant manager assigning tables to waiters. The manager doesn't cook the food; they just decide which waiter serves which table.
- Speed: This manager doesn't need to move every second. They make a plan every few seconds and stick with it.
The Low-Level Workers (The Focal Point Optimizers):
- Job: Once the manager assigns a user to a mirror group, these workers take over. Their only job is to tilt the specific tiles in their group to focus the signal exactly on that one user.
- Analogy: These are the waiters. Once they know which table they are serving, they focus entirely on getting the food (the signal) to that specific person perfectly. They don't worry about the other tables.
- Speed: They adjust the mirrors constantly and quickly to track the user if they move.

3. The "Mechanical" Mirrors: No Electronics Needed

Most research focuses on mirrors made of tiny electronic chips that change the signal's phase. These are expensive and hard to build for large surfaces.

This Paper's Twist: They use mechanical mirrors. Imagine a wall covered in hundreds of small, physical metal tiles (like hexagonal scales) that can physically rotate using simple motors (servos).
Why it's cool: It's like using a physical shutter instead of a digital filter. It's cheaper, works across all frequencies (broadband), and doesn't need complex electronics. The "AI" just tells the motors where to point.

4. The "Teacher" (The Compatibility Matrix)

When the AI starts learning, it's like a student who knows nothing. It has to guess which mirror goes with which user. There are millions of wrong guesses.

The Cheat Sheet: The authors gave the AI a "Compatibility Matrix." This is a simple rule of thumb: "If a user is close to a mirror, that mirror is probably a good choice."
The Result: This acts like a teacher giving a hint to a student. It helps the AI learn 300 times faster and get to a much better solution than if it had to learn from scratch.

The Results: Why Does This Matter?

The researchers tested this in a simulated room with moving people. Here is what happened:

Better Signal: Their system improved the signal strength by 2.8 to 7.9 dB compared to traditional "all-in-one" computer methods. In plain English, the connection was much stronger and more stable.
Scales Well: When they doubled the number of people in the room, the system didn't crash. It handled the crowd almost as well as it handled a small group.
Robust: Even if the location data was slightly wrong (up to 0.5 meters off, like a slightly inaccurate GPS), the system still worked well. It didn't break; it just got a tiny bit worse.

The Big Picture

This paper shows that we don't need to build super-complex, expensive electronic brains to control smart wireless environments. Instead, we can use a hierarchical team of simple agents (a manager and workers) controlling physical mechanical mirrors, guided by simple location data.

It's the difference between trying to control a swarm of bees with a laser pointer (complex, expensive, fragile) versus building a beehive with a smart entrance that naturally guides the bees where they need to go (simple, robust, scalable). This approach could make high-speed 6G internet in offices and cities much cheaper and more reliable.

1. Problem Statement

The paper addresses critical bottlenecks in deploying Reconfigurable Intelligent Surfaces (RIS) for millimeter-wave (mmWave) communications:

CSI Overhead: Traditional RIS systems rely on precise Channel State Information (CSI) estimation for every reflecting element. In large-scale arrays (hundreds to thousands of elements), the pilot overhead and computational burden for CSI estimation scale exponentially, rendering them impractical for dynamic environments.
Dimensionality Explosion: Centralized optimization of thousands of individual tile orientations creates a massive combinatorial action space, making real-time control intractable.
Hardware Complexity: Conventional electronic RIS requires complex RF circuitry and phase shifters. The paper proposes using mechanically reconfigurable metallic reflectors (rotatable tiles) which are frequency-agnostic and simpler to control but introduce new geometric optimization challenges.

2. Methodology

The authors propose a Hierarchical Multi-Agent Reinforcement Learning (HMARL) framework that operates in a CSI-free manner, relying instead on user localization data.

A. System Architecture

Physical Setup: A mmWave system with an Access Point (AP), multiple User Equipments (UEs), and reflector arrays composed of hexagonal metallic tiles. The tiles are mechanically rotated (elevation $\theta$ and azimuth $\phi$ ) to steer beams.
CSI-Free Paradigm: The system eliminates pilot-based channel estimation. Instead, it uses readily available user position data ( $u_k$ ) and reflector geometry to optimize beam focusing.
Hierarchical Decomposition: To manage complexity, the problem is split into two abstraction levels:
1. High-Level (Allocation): A centralized controller assigns users to specific reflector segments ( $L$ segments) based on global spatial positioning. This is a discrete combinatorial decision made every $T$ time steps.
2. Low-Level (Execution): Decentralized agents for each reflector segment optimize the focal point ( $f_l$ ) of their assigned segment to maximize Received Signal Strength Indicator (RSSI) for their specific user. This is a continuous control problem executed at every time step.

B. Learning Framework

Algorithm: The system utilizes Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) paradigm.
- Training: A global critic network observes the full system state to stabilize learning and resolve non-stationarity.
- Execution: Agents act based only on local observations (assigned user position, reflector position, current focal point).
Temporal Abstraction: The high-level controller updates slowly (every $T$ steps) to allow low-level controllers time to converge on optimal focal points before reassignment occurs.
Inductive Bias (Compatibility Matrix): To accelerate convergence in the sparse reward environment, a geometric compatibility matrix ( $C$ ) is introduced. It encodes prior knowledge about signal propagation favorability based on distance and reflection angles, guiding the high-level allocator during early training.

C. Mathematical Formulation

Dimensionality Reduction: The focal point abstraction reduces the control parameters from $2N^2$ (individual tile angles) to $3L$ (segment focal points) plus $KL$ (allocations). For typical indoor scenarios, this yields a >5-fold reduction in action space dimensionality.
Constraints: Physical limits on tile rotation angles and mechanical actuation speeds are enforced as hard constraints within the simulation environment, ensuring all actions are physically valid.

3. Key Contributions

CSI-Free Operation: Demonstrated a viable path for RIS control using only user localization, eliminating the prohibitive overhead of high-dimensional channel estimation.
Scalable HMARL Architecture: Introduced a two-tier neural architecture (Allocator + Local Controllers) that effectively decomposes the joint optimization problem, achieving superior scalability compared to centralized baselines.
Hardware-Aware Design: Validated the use of mechanically reconfigurable metallic reflectors, offering a cost-effective, wideband alternative to electronic metasurfaces.
Geometric Inductive Bias: Proved that integrating a domain-specific compatibility matrix significantly accelerates learning and improves final performance in high-dimensional combinatorial spaces.

4. Results and Evaluation

The framework was evaluated using a high-fidelity ray-tracing simulator (NVIDIA Sionna + Blender) in a 60 GHz indoor conference room scenario.

Performance Gains:
- The HMARL framework achieved 2.81 dB to 7.94 dB RSSI improvements over centralized PPO baselines.
- The performance gap widened as system complexity increased (4 users vs. 2 users), highlighting the scalability of the hierarchical approach.
Scalability:
- Doubling user density (from 2 to 4) resulted in only a 1.39 dB per-user degradation, whereas naive resource splitting would imply a 3 dB loss.
- Total system RSSI remained stable, demonstrating effective exploitation of multi-user diversity.
Robustness:
- Localization Errors: The system maintained graceful performance degradation up to 0.5 m localization error (typical of UWB/WiFi systems). Performance dropped significantly only beyond 1.0 m.
- Aperture Size: Performance saturated after a certain tile count (99 tiles vs. 45 tiles), suggesting an optimal hardware trade-off exists.
- Reward Sensitivity: The system was robust to different path-loss compensation factors in the reward function.
Training Efficiency: The inclusion of the compatibility matrix accelerated convergence by 200–300 episodes and improved final rewards by 28–37% compared to learning without domain guidance.

5. Significance

This work establishes a practical pathway for deploying intelligent surfaces in real-world mmWave networks. By shifting from electromagnetic precision (CSI) to spatial awareness (localization) and leveraging hierarchical multi-agent learning, the authors solve the "curse of dimensionality" and the "CSI overhead" problem simultaneously.

The proposed solution is particularly significant for:

Indoor mmWave Coverage: Providing reliable, high-gain links in Non-Line-of-Sight (NLOS) scenarios without complex hardware.
Cost-Effective Deployment: Utilizing mechanical reflectors avoids expensive RF circuitry.
Future Networks: Offering a scalable control architecture capable of handling dense user environments and dynamic mobility, which are critical for 6G and beyond.

The code for the implementation is open-sourced, facilitating further research into hierarchical control for wireless environments.